In this post I’m showing the main tools that we use to analyse data in R with the tidyverse package. Also a little about variation, categorical and continuous variables.

In the last post I wrote about histograms and the importance of
understanding frequency in our data. Today we are going to perform some
exploratory data analysis in a data set from tidyverse,
therefore we should load the package:
library(tidyverse)
During this phase of exploratory analysis, we should acquire a better understand of our data. The best way to understand data, is to question what is in it. Questions are tools that helps us a lot, they guide our exploration. Since we are always improving out tools, we are also always improving our questions. During exploration we will be mutating our questions to make them more precise and to help us develop a wide comprehension of our data. This part is really all about curiosity and creativity.
Although there is no right way of exploring data, Hadley Wickham puts it in this way:
“There are two types of questions that will certainly help you find out more about your data: 1. What type of variation exists within my variables? 2. What type of covariation occurs between my variables?
Understanding this concepts could be a little complicated in the start, but in the bottom of it our main goal here is to ask lots of questions and to improve them along.
Allow me to rephrase the book “R for data science” here, to explain what is variation or co variation in this post.
Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results.
One of the best ways of analising this variations is by using graphic tools!
A good visualization of a distribution with categorical values is by using the classical bar chart:
Let’s use the ggplot package to analyse the data set
mpg that possesses the column
manufacturer:
ggplot(data = mpg) +
geom_bar(mapping = aes(x = manufacturer))

This kind of graphic show us something similar to:
mpg %>% count(manufacturer)
| manufacturer | n |
|---|---|
| <chr> | <num> |
| audi | 18 |
| chevrolet | 19 |
| dodge | 37 |
| ford | 25 |
| honda | 9 |
| hyundai | 14 |
| jeep | 8 |
| land rover | 4 |
| lincoln | 3 |
| mercury | 4 |
| nissan | 13 |
| pontiac | 5 |
Our chart still has visualization problems, first because it’s
unordered and that makes it hard to perform comparisons, second, because
names are still overposed. We can fix the first issue by using the
function fct_infreq of the forcats package (a
great package to deal with factors in R, it’s already loaded in our
space within the tidyverse, oh and it’s also an anagram
;D):
mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_infreq(manufacturer)))

Cool right, if we want to reverse the order:
mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_rev(fct_infreq(manufacturer))))

Names are still overposing, to change that we can revert the axes of our chart:
mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_rev(fct_infreq(manufacturer)))) +
coord_flip()

Inversely:
mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_infreq(manufacturer))) +
coord_flip()

It seems that in this car sample there are many more dodges than lincolns, what could have happened to these companies during time? Well, this is a question for another moment. For now, let’s continue to analyse variations!
In the last post we’ve done histograms to analyse continuous variables, that is variables that possess numbers, dates or times.
Another extremely useful function is to analyse this types of
variation is geom_freqpoly()
ggplot(data = mpg, mapping = aes(x = cty, color = as.factor(year))) +
geom_freqpoly(binwidth = 1)

Notice how this function compared two histograms at the same time, showing us where most of our data is accumulated, or show up more frequently.
Here is the calculation for it:
This type of chart divides the x axis into equally spaced
bins, then it uses the bar height (peak of the line) to show the number of observations that are in each of thebins.
This graph is even more useful when we have more than one category in our data set.
From this moment on, we could go to many sides of our analysis:
There are many options! We shall continue to develop these ideas in the next post!!!