In this post I am showing you how to use the basic tools from the R package tidyverse to analyse data. These tools are useful to deal with a whole sort of data!

All these verbs are also great functions to deal with data from the real world!
Let’s have a look in the Iris data set which gives us
the measurements in centimeters from three species of plants, the
variables are: sepal length, sepal width, petal length and petal width,
respectively, for 50 flowers from each of 3 species of the
Iris gender. The species are Iris setosa, Iris
versicolor, and Iris virginica.
And since we are dealing with the tidyverse package we
have to load it:
library(tidyverse)
we can call it and have a look:
iris
or just a part of it:
head(iris)
or just a glimpse:
glimpse(iris)
First of all let’s start with select and since we are
using a tidyverse way of thinking, we are going to start using
pipes from now on, these are also known as %>% or
|>, depending on which you prefer to write.
To select a sort of columns you wish you can:
iris %>% #here we are calling the dataframe iris and piping to the next function
select(Sepal.Length, Species) #here we are calling the function and choosing two columns `Sepal.Length` and `Species`
Note the pipe “%>%” which means the output of one function will be
the input of the next one, and the first pipe just includes the iris
data set into the next function
select(Sepal.Length, Species)
Pretty easy, right?
Now you should have a smaller data set (a sample), only the selected
columns from Iris to work with, if you want to check it,
just call the same script again, but using the head()
function to see only the first 10 lines:
iris %>%
select(Sepal.Length, Species) %>%
head()
Just a tip, this would also work (but some may consider harder to read):
head(select(iris, Sepal.Length, Species))
Now you should have something like this:
| Sepal.Length | Species |
|---|---|
| 5.1 | setosa |
| 4.9 | setosa |
| 4.7 | setosa |
| 4.6 | setosa |
| 5.0 | setosa |
| 5.4 | setosa |
To make our work even easier, we should create an object with our
smaller iris dataframe:
It’s good to name it as it is, in this case I will use the following
name: selected_iris
selected_iris <- iris %>%
select(Sepal.Length, Species) %>%
head()
Suppose you want now to quick check the ordered distribution of Sepal
Lengths, we can either call the view() function inside
RStudio or use the arrange() function, like this:
arrange(selected_iris) %>%
head()
You should have something like this:
| Sepal.Length | Species |
|---|---|
| 4.3 | setosa |
| 4.4 | setosa |
| 4.4 | setosa |
| 4.4 | setosa |
| 4.5 | setosa |
| 4.6 | setosa |
You are going to notice that dataframe is now ordered
from the lowest to highest values of Sepal.Length. But what
if we wanted to arrange them in a descending order? Then we
should just use the function desc() inside of our
arrange() calling:
arrange(desc(selected_iris)) %>%
head()
Now you have:
| Sepal.Length | Species |
|---|---|
| 7.9 | virginica |
| 7.7 | virginica |
| 7.7 | virginica |
| 7.7 | virginica |
| 7.7 | virginica |
| 7.6 | virginica |
Now that we already know how to select() and
arrange(), let’s learn another two verbs that are great for
data manipulation and analysis:
filter() function to gather only
Iris virginica species from our dataset:iris %>%
filter(Species == "virginica")
Note that we are using the == operator, meaning we want
to filter everything from the Species column that is equal
to our text "virginica".
iris %>%
filter(Species != "virginica")
Now we have used the != which stands for different or
not equal, then we will have received a data set with Iris
setosa and Iris versicolor species and to check if this is
true let’s create objects receiving our filtered data sets:
iris_virginica <-
iris %>%
filter(Species == "virginica")
#And let's create the other one
not_iris_virginica <-
iris %>%
filter(Species != "virginica")
Both of these operators == (equal) and !=
(not equal) can be used in different contexts in R, for example to check
if everything was okay with our filtering:
#here I am going to use another form of the select funcion:
select(iris_virginica, Species) == "virginica"
And this should return TRUE for all the rows inside our
data set. When we test our affirmations with conditionals, R can check
if that is TRUE or FALSE.
Let’s try another one:
#just calling the data set inside the function again:
select(not_iris_virginica, Species) == "virginica"
And now you should receive FALSE for all the rows inside
our not_iris_virginica data set. Meaning our filter worked,
cool right?
I will talk about conditionals and all those boolean things in
another post, since there are many other operators (>=,
<=, >, <,
&, |, and so on…)
Now I am going to continue with another tidyverb ;)
The last and, perhaps most important, function we are going to learn
is mutate()
This is a great function to manipulate data and to transform or create new columns in our data set
mutate() works like this:
mutate(column_you_want_to_create = equation)
Suppose you want to know the ratio between Petal and Sepal length of all the species:
iris %>%
mutate(ratio_petal_sepal = Petal.Length/Sepal.Length)
There are many uses for the mutate verb, another one is to sort data
combining it with the ifelse() function, which creates a
condition:
Imagine we want to classify our flowers as being “large” or “small”
based on our ratio, that means “if ratio_petal_sepal < 0.5, then it
receives a small tag, otherwise it receives a
large tag:
iris %>%
mutate(ratio_petal_sepal = Petal.Length/Sepal.Length) %>%
mutate(size = ifelse(ratio_petal_sepal < 0.5, "small", "large"))
Check that we are using a double mutate calling, just because we have not created an object, let’s tidy it:
ratio_iris <- iris %>%
mutate(ratio_petal_sepal = Petal.Length/Sepal.Length)
Now we have the object ratio_iris and I can explain how
mutate()combined with ifelse() works:
ratio_iris %>%
mutate(size = ifelse(ratio_petal_sepal < 0.5, "small", "large"))
ratio_iris %>% #putting the data set through our pipe
mutate(size = #calling mutate and creating the column "size"
ifelse(ratio_petal_sepal < 0.5, #if this statement is `TRUE` and the number inside of ratio_petal_sepal is lesser than 0.5
"small", #then the column "size" will receive that tag "small"
"large")) #else or otherwise it will receive "large"
You are going to have a complete sorted data set.
Next post I am going to show you how to count() and to
gather some stats from data.
Hope you liked this one!
Now you can manipulate data and start a data analysis project using
the tidyverse
Try using those functions in the other data sets from R, like the
mpg for example!
Follow me on twitter: @gimbgomes