Hello! A pretty important concept to analise data it’s the EDA

EDA stands for Exploratory Data Analysis

We do an EDA by using functions that helps us understand our data, and also with graphs that we are going to develop in future time.

In this post I’m going to show you how to use the following functions for data analysis and manipulation:

count()
summarise()
group_by()
top_n()

Just a reminder that in this post I’ve wrote about the following functions:

select()

arrange()

filter()

mutate()

Ok, since we are dealling with the tidyverse don’t forget to load it in your space:

library(tidyverse)

we will continue to use the iris data set that is already built in R.

Planning the best way to proceed

Let’s use the function count() to find the total number of each species in our data set.

head(iris)

This should give you something like this:

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

Notice that the column Species holds a value that can be used to form groups in our data set.

Well, to a better understandment of the values from each column, and also to create a strategy to deal with the data, just use the command glimpse()

glimpse(iris)

This should give you something like this:

Rows: 150  
Columns: 5  
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5~  
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3~  
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1~  
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0~  
$ Species      <fct> setosa, setosa, setosa~

Seems that we have a column Species with names that we can group, but what are those names? There are some ways to obtain this answer, one of them is by using count()

count()

Be a friend of this function, hahaha

We can use it as:

iris %>%
  count(Species)

Now you should have something like this:

Species	n
setosa	50
versicolor	50
virginica	50

A column with the groups and another with the number of ocurrencies of each one in our data set.

There are 50 samples of each species: Iris setosa, Iris versicolor and Iris virginica

Counting with weights

If we want to understand the frequency of another variable inside our counting, we just put a comma in the function and call the argument wt = column_name

iris %>%
  count(Species, wt = Petal.Length)

Arguments are useful ways of improving our analysis by using the full potential of a function. To read the arguments associated to a function, just call “?” with the name of the function inside of RStudio:

?count()

All the docummentation related to the function should appear in a side window. It’s possible to see another cool argument of the count() function, that is, sort (organize). The default is FALSE, but we can switch it to TRUE in the following way:

iris %>%
count(Species, wt = Petal.Length, sort = TRUE)

In this way, we don’t need to use the arrange()function:

Species	n
setosa	73.1
versicolor	213.0
virginica	277.6

Flowers from the Iris virginica specie are much larger than those from Iris setosa. Yey, information with only one function!

Ok, cool. But what could be the mean size of petals of these species?

Now let’s gather some stats.

group_by() e summarise()/summarize()

Some other functions that are really useful to understand our data set are group_by() and summarise() (or summarize() for those who preffer the merican english)

summarise()

This function is really useful to explore and reduce data at the same time, you will see that with it we can obtain different information from our data set:

iris %>%
  # Summarising to find the minimum, maximum, and the mean Petal size in the whole data set
  summarise(min_sepal = min(Sepal.Length),
            max_sepal = max(Sepal.Length),
            mean_sepal = mean(Sepal.Length))

min_sepal	max_sepal	mean_sepal
4.3	7.9	5.84

It’s also possible to have multiple information at the same time:

iris %>%
  # Summarising to find the Min, Max, and the Mean Petal Size, also the Sepal size!
  summarise(min_petal = min(Petal.Length),
            max_sepal = max(Sepal.Length),
            media_sepal = mean(Sepal.Length),
            media_petal = mean(Petal.Length),
            total_n = n())

min_petal	max_sepal	mean_sepal	mean_petal	total_n
1	7.9	5.843333	3.758	150

Pretty nice!

group_by()

Now if we want the same stats, but for groups, we can simply add the function group_by() to our pipe before summarise():

iris %>%
  # Using summarise combined with gorup_by
  group_by(Species) %>% #now we will have stats for each group
  summarise(min_petal = min(Petal.Length),
            max_sepal = max(Sepal.Length),
            mean_sepal = mean(Sepal.Length),
            mean_petal = mean(Petal.Length))

We will have this table:

Species	min_petal	max_sepal	mean_sepal	mean_petal
1 setosa	1	5.8	5.01	1.46
2 versicolor	3	7	5.94	4.26
3 virginica	4.5	7.9	6.59	5.55

top_n()

This function help us find top values in our data:

iris %>%
  group_by(Species) %>%
  top_n(1, Sepal.Length)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
<dbl>	<dbl>	<dbl>	<dbl>	<fct>
5.8	4	1.2	0.2	setosa
7	3.2	4.7	1.4	versicolor
7.9	3.8	6.4	2	virginica

Now you have even more tools to analyse and manipulate data, hope this was useful!

In a future post, I will write about these initials that appeared in our table: <dbl>, <fct>, <car>, etc…

Thank you for your time!

Follow me on twitter: @gimbgomes

Grouping and counting in the tidyverse