In this post I am writing about the correlation between two variables, how to find and interpret this concept.

In this post we will work with correlations.
Correlation is a statistical measure that takes in consideration the degree of separation (or union) betweeen two variables. It’s important to say that correlation measures the association level of our numbers, and doesn’t show us if one CAUSES the other, or vice-versa. Also doesn’t show us if the association is happening because of another variable from our data.
From two variables we can obtain a numeric value known as the correlation coefficient, which varies from -1 to 1.
This coefficient shows us the relation between two distributions, if it is positive, than it’s closer to 1, otherwise it will show that one variable causes de inverse effect on the other and the coefficient proximates -1. If a variation doesn’t affect the other, then there’s no linear correlation, and the value approximates 0.
There are some methods to obtain the correlation coefficient
in R, let’s test of them in the mpg data set
library(tidyverse)
The correlation coefficient can be calculated using the functions
cor() or cor.test(), where:
- cor() gives us the correlation coefficient
- cor.test() gives us an association test between two samples of our variables, besides the correlation coefficient, this function also gives us the level of significance of the test, that is the
p-valueorp-valor, we will talk more about them in the future!
We can use the function, together with the tests in the following way:
Is there a correlation between the variables displ and
cty?

The code:
mpg %>%
ggplot(aes(x = displ, y = cty, color = class)) +
geom_point() +
labs(
title = "Motor efficiency of different cars",
subtitle = "by: Guilherme Bastos Gomes",
caption = "Source: mpg data set",
x = "Motor Display",
y = "City miles per gallon",
) +
theme_classic()
Notice that there is a tendency in out plot, that can be modeled with
a line using the function geom_smooth:

Check how the line can explain (in it’s own way) the relation between variables:
mpg %>%
ggplot(aes(x = displ, y = cty, color = class)) +
geom_point() +
labs(
title = "Motor efficiency of different cars",
subtitle = "by: Guilherme Bastos Gomes",
caption = "Source: mpg data set",
x = "Motor Display",
y = "City miles per gallon",
) +
theme_classic()
Line points to the correlation sense, here we have:
As the value of
displincreases, the value ofctydecreases
Now let’s use correlation tests to obtain the correlation coefficient:
cor(mpg$displ, mpg$cty, method = c("pearson"))
[1] -0.798524
Notice how the correlation coefficient is close to -1, evidencing a negative correlation.
There are 3 most used methods to obtain a correlation coefficient: “Pearson”, “Spearman” or “Kendall”.
To obtain even more evidences on the correlations we can obtain the correlation coefficient for all methods:
cor(mpg$displ, mpg$cty, method = c("spearman"))
cor(mpg$displ, mpg$cty, method = c("kendall"))
[1] -0.8809049
[1] -0.7210828
Notice how the Spearman test gave us a correlation coefficient even closer to -1! It’s important to say that correlation show us a pattern between variables, but it doesn’t tell us that one is CAUSING the other.
In our data we are showing the negative relationship between
displ and cty, and as bigger the motor, more
gallons are necessary to run a mile. The explanation for this kind of
correlation may seem obvious, but it shows us that considered potent
motors tend to spend more fuel, what can directly affect who pays for
the product.
In
the last post we have investigated the correlation between the
variables Petal.Length and Petal.Width, of
each species of the data set iris:
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width, color = Species)) +
geom_point() +
labs(
title = "Comparison between sizes of Sepals in 3 species of the Iris gender",
subtitle = "by: Guilherme Bastos Gomes",
caption = "Source: Edgar Anderson's Iris Data set",
x = "Petal.Length",
y = "Petal.Width"
) +
theme_classic() +
facet_grid(~Species) +
geom_smooth(method = "lm")

Now let’s do a correlation test between variables
Petal.Lengthe Petal.Width:
cor.test(iris$Petal.Length, iris$Petal.Width,
method = "pearson")
Pearson’s product-moment correlation
data: iris$Petal.Length and iris$Petal.Width t = 43.387, df = 148, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.9490525 0.9729853 sample estimates: cor 0.9628654
In the results we obtained from the test we can see:
- t is the statistical t-test (43.387)
- df represents the degrees of freedom (df = 148)
- p-value is the significance level gathered from the t-test (p-value = 2.2e-16)
- conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [0.9490525, 0.9729853])
- sample estimates gives us the correlation coefficient (Cor.coeff = 0.9628654)
Here we have a strong positive correlation, with strong evidences that suggests that one measure affects the other, allowing us to generate ideas about the petal sizes of these three species we can find in nature.
Going back to the mpg data set where we have a negative
correlation between of variables:
ct <- cor.test(mpg$displ, mpg$cty,
method = "pearson")
ct
Pearson’s product-moment correlation
data: mpg$displ and mpg$cty t = -20.205, df = 232, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8406782 -0.7467508 sample estimates: cor -0.798524
Function cor.test() returns us a list that contains:
ct$p.value
ct$estimate
In statiscs, p-value is the probability of obtaining results as extreme as the observed results in a hypothesis test, assuming that the null hypothesis is true.
I won’t be in details of all these right now, I will write about p-value in another post!
cor.test(dataframe\$coluna1, dataframe\$coluna2,
method = "pearson")