Using tidyverse functions

In this post I combine some of the tidyverse functions to obtain interesting stats from the data. The main idea is to work with the tools we’ve already used. In the case of a new tool showing up, I will explain it, but always remeber that you can read about R functions by tiping “?nameofthefunction” in the console!

Guilherme Bastos Gomes https://gbastosg.github.io/guilhermesportfolio/
2022-04-23

Hello! In this post I will use some of the functions from tidyverse to wrangle built-in R data

Remembering that in this post I’ve explained about each of these functions!

Today we will use the mpg data set from the tidyverse, so let’s start by loading the package:

library(tidyverse)

Let’s check what the data.frame contains, or have a glimpse() on it:

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi", "audi"~
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "a4 quattro", ~
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, ~
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 2008, 2008, 1999~
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8~
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto(l5)", "manua~
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4", "4", "4", ~
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 15, 15, 17, 16~
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 25, 24, 25, 23~
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", ~
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "compact", "com~

Interesting, right? It’s possible to check that in that data.frame we can find 11 columns and 234 rows. To understand even more about this data you can type ?mpg in the console, but here is a brief explanation:

mpg

Fuel economy data from 1999 to 2008 for 38 popular models of cars.

It is possible to use the function colnames() to check what is in it:

colnames(mpg)

[1] "manufacturer" "model"        "displ"        "year"         "cyl"         
[6] "trans"        "drv"          "cty"          "hwy"          "fl"          
[11] "class"

Another pretty interesting way of receiving some stats happens by using the function summary()

summary()

This is a very generic functio from R base, and serves mainly to obtain a summary (yeah) with statistics of an object:

summary(mpg)

 manufacturer          model               displ            year           cyl       
 Length:234         Length:234         Min.   :1.600   Min.   :1999   Min.   :4.000  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000  
 Mode  :character   Mode  :character   Median :3.300   Median :2004   Median :6.000  
                                       Mean   :3.472   Mean   :2004   Mean   :5.889  
                                       3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000  
                                       Max.   :7.000   Max.   :2008   Max.   :8.000  
    trans               drv                 cty             hwy             fl           
 Length:234         Length:234         Min.   : 9.00   Min.   :12.00   Length:234        
 Class :character   Class :character   1st Qu.:14.00   1st Qu.:18.00   Class :character  
 Mode  :character   Mode  :character   Median :17.00   Median :24.00   Mode  :character  
                                       Mean   :16.86   Mean   :23.44                     
                                       3rd Qu.:19.00   3rd Qu.:27.00                     
                                       Max.   :35.00   Max.   :44.00                     
    class          
 Length:234        
 Class :character  
 Mode  :character
 

Wow! This is a lot of information our data with only one function!

Let’s now combine some of the functions from tidyverse to obtain even more stats!

combining functions using a pipe “%>%”

mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon))

Here we see that in average, cars from 1999 used to run 0.3 more miles per gallon in the city than in 2008, although the average does not changes much from the average on the hailway:

year mean_cty_miles_per_gallon mean_hwy_miles_per_gallon
1999 17.0 23.4
2008 16.7 23.5

Notice how inside the select() function we could add the name of columns that would appear in our sub set.

So inside the summarise() we should include those new names

But how about the many types of motors?

mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon))

Agora agrupamos por duas colunas, primeiro por tipo do motor, segundo pelo ano:

motor_type year mean_cty_miles_per_gallon mean_hwy_miles_per_gallon
1.6 1999 24.8 31.6
1.8 1999 20.7 29.4
1.8 2008 25.8 35.6
1.9 1999 32.3 43
2 1999 19.8 27.5
2 2008 20.5 28.7
2.2 1999 20.7 27.3
2.4 1999 18.8 26.7
2.4 2008 21.3 30.7
2.5 1999 18.3 25.5
2.5 1999 18.3 25.5

Great, now we have comparisons about motor types! We can also arrange() our table:

mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
  arrange(year)

Cool, now that we have our organized data we can split two more data sets from this one, by using a filter we can include only those cars from 1999 and 2008 in two separate data sets:

mpg_1999 <- mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
  arrange(year) %>%
  filter(year == 1999)
mpg_2008 <- mpg %>%
  select(motor_type = displ,
  year, 
  city_miles_per_gallon = cty, 
  highway_miles_per_gallon = hwy) %>%
  group_by(motor_type, year) %>%
  summarise(mean_cty_miles_per_gallon = mean(city_miles_per_gallon), 
  mean_hwy_miles_per_gallon = mean(highway_miles_per_gallon)) %>%
  arrange(year) %>%
  filter(year == 2008)

This is pretty useful when we are dealing with a lot of data and to improve our analyses.

Have a look in both sub sets and find out ways of comparing those values.

Did the cars became more or less efficient in time?

In the next post: One of the easiest ways of comparing data is by using plots. Next post I’m going to use ggplot to create some of them and compare results.

Thank you for your time!

Follow me on twitter: @gimbgomes