Hello!

The main idea here is to understand box plots, what they are, how to built one in R and how to interpret them.

Box plots

A pretty popular manner of analising data (between statisticians) is by using box plots

A box plot shows a continuous variable, divided into groups by a categorical variable. For more information about variables types, please visit this link.

The following pictures shows a Box plot created using the ToothGrowth data set which compares the effect of vitamin C over the growth of tooth cells in Guinea pigs, in the x axis we put the categorical variable supp, or supplement type which could be OJ for orange juice and VC for ascorbic acid, and in the Y axis is the continuous variable len or the size of odontoblasts (the cells) of each animal:

The code:

ToothGrowth %>% #lendo o data set
  ggplot(aes(x = supp, y = len, fill = supp)) +  
    geom_boxplot() + 
    labs(
      title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
      subtitle = "by: Guilherme Bastos Gomes", #subtitle
      caption = "Source: ToothGrowth data set", #description
      x = "Supplement",
      y = "Length" 
    ) +
    theme_classic() #picking a theme from the ggthemes package

How to interpret

Boxplot has two main parts: the box and the whiskers (two lines that extends outside the box)

First, we divide our distribution into 4 parts, each part is known as a quartile
The box holds the data between the 25 and 75 quartiles, distance known as IQR or interquartile range.
In the middle of the box it is possible to see a line that represents the median, a.k.a. 50 quartile of our distribution. These three lines gives us information about the distribution of numbers and also tell us if these distributions could be considered symmetric or skimmed.
There are two whiskers, a line towards the top, and another one towards the bottom. These lines extends until the point 1.5 times away of each edge of the box.
Dots that are far away are called outliers, the famous points out of a curve, those that does not appear very often in our data set.

ToothGrowth

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

we should have two categorical variables then: supp and dose

Although you may notice that the variable dose it is not yet considered a categorical by R.

str(ToothGrowth$dose)

num [1:60] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Let’s transform this variable from numeric into factors to plot a box plot with even more information on it:

ToothGrowth$dose <- as.factor(ToothGrowth$dose)

Now we can call the geom_boxplot() function using both variables:

ToothGrowth %>% 
  ggplot(aes(x = dose, y = len, fill = supp)) + 
    geom_boxplot() +
    labs( 
      title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
      subtitle = "by: Guilherme Bastos Gomes", #subtitle
      caption = "Source: ToothGrowth data set", #description
      x = "Supplement",
      y = "Length" 
    ) +
    theme_classic()

This graph is even more interesting, we can even compare 3 variables at the same time.

Boxplot comprehension with Dotplot

Let’s do a Boxplot with a Dotplot within it, also known as a graph of points that represents where data is accumulated in our data set:

Easier to pick up, right?

Now we can visualise how an aplication of dose of each supplements has something to do in the measure of cell length.

Another pretty interesting graph to show this information are the Violin Plots

Violin Plots

The code:

ToothGrowth %>% 
  ggplot(aes(x = dose, y = len)) + 
    geom_violin(trim = FALSE) +
    labs( 
      title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
      subtitle = "by: Guilherme Bastos Gomes", #subtitle
      caption = "Source: ToothGrowth data set", #description
      x = "Supplement",
      y = "Length" 
    ) +
    theme_classic()

The most interesting part of the Violin plot is that it also shows us the probability of kernel density of all the values from our data, a concept to explore in future analyses.

Generally Violin Plots includes a mark with the median information, and also a box that indicates the IQR, just like box plots do.

To build a Violin Plot, instead of geom_boxplot() we use the geom_violin() function.

Mean and Median inside the Violin

This time, let’s mark the mean of our data in our geom_violin:

ToothGrowth %>% 
  ggplot(aes(x = dose, y = len)) + 
    geom_violin(trim = FALSE) +
    stat_summary(fun=mean, geom="point", size=2, color="FireBrick") +
    labs( 
      title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
      subtitle = "by: Guilherme Bastos Gomes", #subtitle
      caption = "Source: ToothGrowth data set", #description
      x = "Supplement",
      y = "Length" 
    ) +
    theme_classic()

And using the median:

ToothGrowth %>% 
  ggplot(aes(x = dose, y = len)) + 
    geom_violin(trim = FALSE) +
    stat_summary(fun=median, geom="point", size=2, color="FireBrick") +
    labs( 
      title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
      subtitle = "by: Guilherme Bastos Gomes", #subtitle
      caption = "Source: ToothGrowth data set", #description
      x = "Supplement",
      y = "Length" 
    ) +
    theme_classic()

Add median and the quartile (a boxplot!)

To wrap up, let’s add some colours, using the ggthemes package:

ToothGrowth %>% 
  ggplot(aes(x = dose, y = len, fill = dose)) + 
    geom_violin(trim = FALSE) +
    stat_summary(fun=median, geom="point", size=2, color="MidnightBlue") +
    geom_boxplot(width=0.1) +
    theme_wsj()

I hope you can use the power of this kind of visualisation in your own analysis, to comprehend the distribution of your data!

What would explain the great concentration of data in each part of the graph? Let’s explore even more of these ideas in future posts!

Learning how to build Boxplots and ViolinPlots using R (EN-AU)