In this post I am showing how to analyse Boxplots and Violin Plots using the ggplot2 package in the R programming language.
The main idea here is to understand box plots, what they are, how to built one in R and how to interpret them.
A pretty popular manner of analising data (between statisticians) is by using box plots
A box plot shows a continuous variable, divided into groups by a categorical variable. For more information about variables types, please visit this link.
The following pictures shows a Box plot created using the
ToothGrowth data set which compares the effect of vitamin C
over the growth of tooth cells in Guinea pigs, in the x axis we put the
categorical variable supp, or supplement type which could
be OJ for orange juice and VC for ascorbic acid, and in the Y axis is
the continuous variable len or the size of odontoblasts
(the cells) of each animal:

The code:
ToothGrowth %>% #lendo o data set
ggplot(aes(x = supp, y = len, fill = supp)) +
geom_boxplot() +
labs(
title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
subtitle = "by: Guilherme Bastos Gomes", #subtitle
caption = "Source: ToothGrowth data set", #description
x = "Supplement",
y = "Length"
) +
theme_classic() #picking a theme from the ggthemes package
Boxplot has two main parts: the box and the whiskers (two lines that extends outside the box)
First, we divide our distribution into 4 parts, each part is
known as a quartile
The box holds the data between the 25 and 75 quartiles, distance known as IQR or interquartile range.
In the middle of the box it is possible to see a line that represents the median, a.k.a. 50 quartile of our distribution. These three lines gives us information about the distribution of numbers and also tell us if these distributions could be considered symmetric or skimmed.
There are two whiskers, a line towards the top, and another one towards the bottom. These lines extends until the point 1.5 times away of each edge of the box.
Dots that are far away are called outliers, the
famous points out of a curve, those that does not appear very often in
our data set.
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
we should have two categorical variables then: supp and
dose
Although you may notice that the variable dose it is not
yet considered a categorical by R.
str(ToothGrowth$dose)
num [1:60] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Let’s transform this variable from numeric into
factors to plot a box plot with even more information on
it:
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
Now we can call the geom_boxplot() function using both
variables:

ToothGrowth %>%
ggplot(aes(x = dose, y = len, fill = supp)) +
geom_boxplot() +
labs(
title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
subtitle = "by: Guilherme Bastos Gomes", #subtitle
caption = "Source: ToothGrowth data set", #description
x = "Supplement",
y = "Length"
) +
theme_classic()
This graph is even more interesting, we can even compare 3 variables at the same time.
Let’s do a Boxplot with a Dotplot within it, also known as a graph of points that represents where data is accumulated in our data set:

Easier to pick up, right?
Now we can visualise how an aplication of dose of each supplements has something to do in the measure of cell length.
Another pretty interesting graph to show this information are the Violin Plots

The code:
ToothGrowth %>%
ggplot(aes(x = dose, y = len)) +
geom_violin(trim = FALSE) +
labs(
title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
subtitle = "by: Guilherme Bastos Gomes", #subtitle
caption = "Source: ToothGrowth data set", #description
x = "Supplement",
y = "Length"
) +
theme_classic()
The most interesting part of the Violin plot is that it also shows us the probability of kernel density of all the values from our data, a concept to explore in future analyses.
Generally Violin Plots includes a mark with the median information, and also a box that indicates the IQR, just like box plots do.
To build a Violin Plot, instead of geom_boxplot() we use
the geom_violin() function.
This time, let’s mark the mean of our data in our
geom_violin:
ToothGrowth %>%
ggplot(aes(x = dose, y = len)) +
geom_violin(trim = FALSE) +
stat_summary(fun=mean, geom="point", size=2, color="FireBrick") +
labs(
title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
subtitle = "by: Guilherme Bastos Gomes", #subtitle
caption = "Source: ToothGrowth data set", #description
x = "Supplement",
y = "Length"
) +
theme_classic()

And using the median:
ToothGrowth %>%
ggplot(aes(x = dose, y = len)) +
geom_violin(trim = FALSE) +
stat_summary(fun=median, geom="point", size=2, color="FireBrick") +
labs(
title = "The effect of vitamin C in tooth growth of Guinea pigs", #plot title
subtitle = "by: Guilherme Bastos Gomes", #subtitle
caption = "Source: ToothGrowth data set", #description
x = "Supplement",
y = "Length"
) +
theme_classic()


To wrap up, let’s add some colours, using the ggthemes
package:

ToothGrowth %>%
ggplot(aes(x = dose, y = len, fill = dose)) +
geom_violin(trim = FALSE) +
stat_summary(fun=median, geom="point", size=2, color="MidnightBlue") +
geom_boxplot(width=0.1) +
theme_wsj()
I hope you can use the power of this kind of visualisation in your own analysis, to comprehend the distribution of your data!
What would explain the great concentration of data in each part of the graph? Let’s explore even more of these ideas in future posts!