Exploratory Data Analysis tools using R

In this post I’m showing the main tools that we use to analyse data in R with the tidyverse package. Also a little about variation, categorical and continuous variables.

Guilherme Bastos Gomes https://gbastosg.github.io/guilhermesportfolio/
2022-05-06

Hello!

In the last post I wrote about histograms and the importance of understanding frequency in our data. Today we are going to perform some exploratory data analysis in a data set from tidyverse, therefore we should load the package:

library(tidyverse)

Exploratory Data Analysis (EDA)

During this phase of exploratory analysis, we should acquire a better understand of our data. The best way to understand data, is to question what is in it. Questions are tools that helps us a lot, they guide our exploration. Since we are always improving out tools, we are also always improving our questions. During exploration we will be mutating our questions to make them more precise and to help us develop a wide comprehension of our data. This part is really all about curiosity and creativity.

Although there is no right way of exploring data, Hadley Wickham puts it in this way:

“There are two types of questions that will certainly help you find out more about your data: 1. What type of variation exists within my variables? 2. What type of covariation occurs between my variables?

Understanding this concepts could be a little complicated in the start, but in the bottom of it our main goal here is to ask lots of questions and to improve them along.

Variation

Allow me to rephrase the book “R for data science” here, to explain what is variation or co variation in this post.

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results.

One of the best ways of analising this variations is by using graphic tools!

Visualizing distributions

A good visualization of a distribution with categorical values is by using the classical bar chart:

Bar chart

Let’s use the ggplot package to analyse the data set mpg that possesses the column manufacturer:

ggplot(data = mpg) +
geom_bar(mapping = aes(x = manufacturer))

This kind of graphic show us something similar to:

mpg %>% count(manufacturer)
manufacturer n
<chr> <num>
audi 18
chevrolet 19
dodge 37
ford 25
honda 9
hyundai 14
jeep 8
land rover 4
lincoln 3
mercury 4
nissan 13
pontiac 5

Rearraging bars from the geom_bar() function by count

Our chart still has visualization problems, first because it’s unordered and that makes it hard to perform comparisons, second, because names are still overposed. We can fix the first issue by using the function fct_infreq of the forcats package (a great package to deal with factors in R, it’s already loaded in our space within the tidyverse, oh and it’s also an anagram ;D):

mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_infreq(manufacturer)))

Cool right, if we want to reverse the order:

mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_rev(fct_infreq(manufacturer))))

Names are still overposing, to change that we can revert the axes of our chart:

mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_rev(fct_infreq(manufacturer)))) +
coord_flip()

Inversely:

mpg %>%
ggplot() +
geom_bar(mapping = aes(x = fct_infreq(manufacturer))) +
coord_flip()

It seems that in this car sample there are many more dodges than lincolns, what could have happened to these companies during time? Well, this is a question for another moment. For now, let’s continue to analyse variations!

In the last post we’ve done histograms to analyse continuous variables, that is variables that possess numbers, dates or times.

Another extremely useful function is to analyse this types of variation is geom_freqpoly()

ggplot(data = mpg, mapping = aes(x = cty, color = as.factor(year))) +
  geom_freqpoly(binwidth = 1)

Notice how this function compared two histograms at the same time, showing us where most of our data is accumulated, or show up more frequently.

Here is the calculation for it:

This type of chart divides the x axis into equally spaced bins, then it uses the bar height (peak of the line) to show the number of observations that are in each of the bins.

This graph is even more useful when we have more than one category in our data set.

From this moment on, we could go to many sides of our analysis:

There are many options! We shall continue to develop these ideas in the next post!!!

Thank you!