In this post I combine some of the tidyverse functions to obtain interesting stats from the data. The main idea is to work with the tools we’ve already used. In the case of a new tool showing up, I will explain it, but always remeber that you can read about R functions by tiping “?nameofthefunction” in the console!

Remembering that in this post I’ve explained about each of these functions!
Today we will use the mpg data set from the
tidyverse, so let’s start by loading the package:
library(tidyverse)
Let’s check what the data.frame contains, or have a
glimpse() on it:
glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi"~
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", ~
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, ~
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999~
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8~
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manua~
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", ~
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16~
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23~
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", ~
$ class <chr> "compact", "compact", "compact", "compact", "compact", "compact", "com~
Interesting, right? It’s possible to check that in that
data.frame we can find 11 columns and 234 rows. To
understand even more about this data you can type ?mpg in
the console, but here is a brief explanation:
Fuel economy data from 1999 to 2008 for 38 popular models of cars.
It is possible to use the function colnames() to check
what is in it:
colnames(mpg)
[1] "manufacturer" "model" "displ" "year" "cyl"
[6] "trans" "drv" "cty" "hwy" "fl"
[11] "class"
Another pretty interesting way of receiving some stats happens by
using the function summary()
This is a very generic functio from R base, and serves mainly to obtain a summary (yeah) with statistics of an object:
summary(mpg)
manufacturer model displ year cyl
Length:234 Length:234 Min. :1.600 Min. :1999 Min. :4.000
Class :character Class :character 1st Qu.:2.400 1st Qu.:1999 1st Qu.:4.000
Mode :character Mode :character Median :3.300 Median :2004 Median :6.000
Mean :3.472 Mean :2004 Mean :5.889
3rd Qu.:4.600 3rd Qu.:2008 3rd Qu.:8.000
Max. :7.000 Max. :2008 Max. :8.000
trans drv cty hwy fl
Length:234 Length:234 Min. : 9.00 Min. :12.00 Length:234
Class :character Class :character 1st Qu.:14.00 1st Qu.:18.00 Class :character
Mode :character Mode :character Median :17.00 Median :24.00 Mode :character
Mean :16.86 Mean :23.44
3rd Qu.:19.00 3rd Qu.:27.00
Max. :35.00 Max. :44.00
class
Length:234
Class :character
Mode :character
Wow! This is a lot of information our data with only one function!
Let’s now combine some of the functions from tidyverse to obtain even more stats!
mpg %>%
select(motor_type = displ,
year,
city_miles_per_gallon = cty,
highway_miles_per_gallon = hwy) %>%
group_by(year) %>%
summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon),
mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon))
Here we see that in average, cars from 1999 used to run 0.3 more miles per gallon in the city than in 2008, although the average does not changes much from the average on the hailway:
| year | mean_cty_miles_per_gallon | mean_hwy_miles_per_gallon |
|---|---|---|
| 1999 | 17.0 | 23.4 |
| 2008 | 16.7 | 23.5 |
Notice how inside the select() function we could add the
name of columns that would appear in our sub set.
So inside the summarise() we should include those new
names
But how about the many types of motors?
mpg %>%
select(motor_type = displ,
year,
city_miles_per_gallon = cty,
highway_miles_per_gallon = hwy) %>%
group_by(motor_type, year) %>%
summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon),
mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon))
Agora agrupamos por duas colunas, primeiro por tipo do motor, segundo pelo ano:
| motor_type | year | mean_cty_miles_per_gallon | mean_hwy_miles_per_gallon |
|---|---|---|---|
| 1.6 | 1999 | 24.8 | 31.6 |
| 1.8 | 1999 | 20.7 | 29.4 |
| 1.8 | 2008 | 25.8 | 35.6 |
| 1.9 | 1999 | 32.3 | 43 |
| 2 | 1999 | 19.8 | 27.5 |
| 2 | 2008 | 20.5 | 28.7 |
| 2.2 | 1999 | 20.7 | 27.3 |
| 2.4 | 1999 | 18.8 | 26.7 |
| 2.4 | 2008 | 21.3 | 30.7 |
| 2.5 | 1999 | 18.3 | 25.5 |
| 2.5 | 1999 | 18.3 | 25.5 |
Great, now we have comparisons about motor types! We can also
arrange() our table:
mpg %>%
select(motor_type = displ,
year,
city_miles_per_gallon = cty,
highway_miles_per_gallon = hwy) %>%
group_by(motor_type, year) %>%
summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon),
mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
arrange(year)
Cool, now that we have our organized data we can split two more data sets from this one, by using a filter we can include only those cars from 1999 and 2008 in two separate data sets:
mpg_1999 <- mpg %>%
select(motor_type = displ,
year,
city_miles_per_gallon = cty,
highway_miles_per_gallon = hwy) %>%
group_by(motor_type, year) %>%
summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon),
mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
arrange(year) %>%
filter(year == 1999)
mpg_2008 <- mpg %>%
select(motor_type = displ,
year,
city_miles_per_gallon = cty,
highway_miles_per_gallon = hwy) %>%
group_by(motor_type, year) %>%
summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon),
mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
arrange(year) %>%
filter(year == 2008)
This is pretty useful when we are dealing with a lot of data and to improve our analyses.
Have a look in both sub sets and find out ways of comparing those values.
Did the cars became more or less efficient in time?
In the next post: One of the easiest ways of comparing data is by using plots. Next post I’m going to use ggplot to create some of them and compare results.
Follow me on twitter: @gimbgomes