Hello! In this post I will use some of the functions from tidyverse to wrangle built-in R data

Remembering that in this post I’ve explained about each of these functions!

Today we will use the mpg data set from the tidyverse, so let’s start by loading the package:

library(tidyverse)

Let’s check what the data.frame contains, or have a glimpse() on it:

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi"~
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", ~
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, ~
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999~
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8~
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manua~
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", ~
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16~
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23~
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", ~
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "compact", "com~

Interesting, right? It’s possible to check that in that data.frame we can find 11 columns and 234 rows. To understand even more about this data you can type ?mpg in the console, but here is a brief explanation:

mpg

Fuel economy data from 1999 to 2008 for 38 popular models of cars.

It is possible to use the function colnames() to check what is in it:

colnames(mpg)

[1] "manufacturer" "model"        "displ"        "year"         "cyl"         
[6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"

Another pretty interesting way of receiving some stats happens by using the function summary()

summary()

This is a very generic functio from R base, and serves mainly to obtain a summary (yeah) with statistics of an object:

summary(mpg)

 manufacturer          model               displ            year           cyl       
 Length:234         Length:234         Min.   :1.600   Min.   :1999   Min.   :4.000  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000  
 Mode  :character   Mode  :character   Median :3.300   Median :2004   Median :6.000  
                                       Mean   :3.472   Mean   :2004   Mean   :5.889  
                                       3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000  
                                       Max.   :7.000   Max.   :2008   Max.   :8.000  
    trans               drv                 cty             hwy             fl           
 Length:234         Length:234         Min.   : 9.00   Min.   :12.00   Length:234        
 Class :character   Class :character   1st Qu.:14.00   1st Qu.:18.00   Class :character  
 Mode  :character   Mode  :character   Median :17.00   Median :24.00   Mode  :character  
                                       Mean   :16.86   Mean   :23.44                     
                                       3rd Qu.:19.00   3rd Qu.:27.00                     
                                       Max.   :35.00   Max.   :44.00                     
    class          
 Length:234        
 Class :character  
 Mode  :character

Wow! This is a lot of information our data with only one function!

Let’s now combine some of the functions from tidyverse to obtain even more stats!

combining functions using a pipe “%>%”

mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon))

Here we see that in average, cars from 1999 used to run 0.3 more miles per gallon in the city than in 2008, although the average does not changes much from the average on the hailway:

year	mean_cty_miles_per_gallon	mean_hwy_miles_per_gallon
1999	17.0	23.4
2008	16.7	23.5

Notice how inside the select() function we could add the name of columns that would appear in our sub set.

So inside the summarise() we should include those new names

But how about the many types of motors?

mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon))

Agora agrupamos por duas colunas, primeiro por tipo do motor, segundo pelo ano:

motor_type	year	mean_cty_miles_per_gallon	mean_hwy_miles_per_gallon
1.6	1999	24.8	31.6
1.8	1999	20.7	29.4
1.8	2008	25.8	35.6
1.9	1999	32.3	43
2	1999	19.8	27.5
2	2008	20.5	28.7
2.2	1999	20.7	27.3
2.4	1999	18.8	26.7
2.4	2008	21.3	30.7
2.5	1999	18.3	25.5
2.5	1999	18.3	25.5

Great, now we have comparisons about motor types! We can also arrange() our table:

mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
  arrange(year)

Cool, now that we have our organized data we can split two more data sets from this one, by using a filter we can include only those cars from 1999 and 2008 in two separate data sets:

mpg_1999 <- mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
  arrange(year) %>%
  filter(year == 1999)

mpg_2008 <- mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
  arrange(year) %>%
  filter(year == 2008)

This is pretty useful when we are dealing with a lot of data and to improve our analyses.

Have a look in both sub sets and find out ways of comparing those values.

Did the cars became more or less efficient in time?

In the next post: One of the easiest ways of comparing data is by using plots. Next post I’m going to use ggplot to create some of them and compare results.

Thank you for your time!

Follow me on twitter: @gimbgomes

Using tidyverse functions

Hello! In this post I will use some of the functions from tidyverse to wrangle built-in R data

mpg

summary()

combining functions using a pipe “%>%”

Thank you for your time!