In this post, I will show you how to group and count your data to generate some stats. During an analysis it’s pretty common to spend some time with EDA.

EDA stands for Exploratory Data Analysis
We do an EDA by using functions that helps us understand our data, and also with graphs that we are going to develop in future time.
count()summarise()group_by()top_n()Just a reminder that in this post I’ve wrote about the following functions:
- select()
- arrange()
- filter()
- mutate()
Ok, since we are dealling with the tidyverse don’t
forget to load it in your space:
library(tidyverse)
we will continue to use the iris data set that is
already built in R.
Let’s use the function count() to find the total number
of each species in our data set.
head(iris)
This should give you something like this:
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
Notice that the column Species holds a value that can be
used to form groups in our data set.
Well, to a better understandment of the values from each column, and
also to create a strategy to deal with the data, just use the command
glimpse()
glimpse(iris)
This should give you something like this:
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5~
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3~
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1~
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0~
$ Species <fct> setosa, setosa, setosa~
Seems that we have a column Species with names that we
can group, but what are those names? There are some ways to obtain this
answer, one of them is by using count()
Be a friend of this function, hahaha
We can use it as:
iris %>%
count(Species)
Now you should have something like this:
| Species | n |
|---|---|
| setosa | 50 |
| versicolor | 50 |
| virginica | 50 |
A column with the groups and another with the number of ocurrencies of each one in our data set.
There are 50 samples of each species: Iris setosa, Iris versicolor and Iris virginica
If we want to understand the frequency of another variable inside our
counting, we just put a comma in the function and call the argument
wt = column_name
iris %>%
count(Species, wt = Petal.Length)
Arguments are useful ways of improving our analysis by using the full potential of a function. To read the arguments associated to a function, just call “?” with the name of the function inside of RStudio:
?count()
All the docummentation related to the function should appear in a
side window. It’s possible to see another cool argument of the
count() function, that is, sort (organize). The default is
FALSE, but we can switch it to TRUE in the
following way:
iris %>%
count(Species, wt = Petal.Length, sort = TRUE)
In this way, we don’t need to use the
arrange()function:
| Species | n |
|---|---|
| setosa | 73.1 |
| versicolor | 213.0 |
| virginica | 277.6 |
Flowers from the Iris virginica specie are much larger than those from Iris setosa. Yey, information with only one function!
Ok, cool. But what could be the mean size of petals of these species?
Now let’s gather some stats.
Some other functions that are really useful to understand our data
set are group_by() and summarise() (or
summarize() for those who preffer the merican english)
This function is really useful to explore and reduce data at the same time, you will see that with it we can obtain different information from our data set:
iris %>%
# Summarising to find the minimum, maximum, and the mean Petal size in the whole data set
summarise(min_sepal = min(Sepal.Length),
max_sepal = max(Sepal.Length),
mean_sepal = mean(Sepal.Length))
| min_sepal | max_sepal | mean_sepal |
|---|---|---|
| 4.3 | 7.9 | 5.84 |
It’s also possible to have multiple information at the same time:
iris %>%
# Summarising to find the Min, Max, and the Mean Petal Size, also the Sepal size!
summarise(min_petal = min(Petal.Length),
max_sepal = max(Sepal.Length),
media_sepal = mean(Sepal.Length),
media_petal = mean(Petal.Length),
total_n = n())
| min_petal | max_sepal | mean_sepal | mean_petal | total_n |
|---|---|---|---|---|
| 1 | 7.9 | 5.843333 | 3.758 | 150 |
Pretty nice!
Now if we want the same stats, but for groups, we can simply add the
function group_by() to our pipe before
summarise():
iris %>%
# Using summarise combined with gorup_by
group_by(Species) %>% #now we will have stats for each group
summarise(min_petal = min(Petal.Length),
max_sepal = max(Sepal.Length),
mean_sepal = mean(Sepal.Length),
mean_petal = mean(Petal.Length))
We will have this table:
| Species | min_petal | max_sepal | mean_sepal | mean_petal |
|---|---|---|---|---|
| 1 setosa | 1 | 5.8 | 5.01 | 1.46 |
| 2 versicolor | 3 | 7 | 5.94 | 4.26 |
| 3 virginica | 4.5 | 7.9 | 6.59 | 5.55 |
This function help us find top values in our data:
iris %>%
group_by(Species) %>%
top_n(1, Sepal.Length)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <fct> |
| 5.8 | 4 | 1.2 | 0.2 | setosa |
| 7 | 3.2 | 4.7 | 1.4 | versicolor |
| 7.9 | 3.8 | 6.4 | 2 | virginica |
Now you have even more tools to analyse and manipulate data, hope this was useful!
In a future post, I will write about these initials that appeared in our table: <dbl>, <fct>, <car>, etc…
Follow me on twitter: @gimbgomes