Hello!

In this post we will analyse two continuous variables in the same plot, this is a very interesting way to understand the relationship between variables.

In the last post, we talked about variation and in this one we are going to talk about covariation!

Covariation

If the variation described the behaviour of a variable, covariation describes the behaviour between the variables.

Covariation is the tendency of values of two or more variables vary together.

A good way of understanding these relations is by using Scatter Plots

A Scatter Plot shows the relationship between two continuous variables. We can build one in R, using the function geom_point() from the ggplot2() package.

Since you have to load tidyverse to make it work in your R:

library(tidyverse)

The code:

library(tidyverse)
library(ggthemes)


iris %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +  
    geom_point() + 
    labs( 
      title = "Mesuments in centimeters of flowers from 3 species",
      subtitle = "by: Guilherme Bastos Gomes", 
      caption = "Source: Edgar Anderson's Iris Data set", 
      x = "Sepal Length",
      y = "Sepal Width" 
    ) +
    theme_classic()

Check how this plot gives us information about the two variables together, again with the categorical variable (Species):

The Iris setosa group tends to have a lesser sepal length, though has greater widths. This numbers can form a group.
The other groups possesses a much more intimous relation as than Iris setosa. Still, Iris virginica seems to have bigger lengths, which sepparates than from the others.

Therefore we can start to understand how these numbers behaves inside their groups, from this point the next step would be to model, but let’s continue working with Scatter Plots for now, though they are really powerful, mainly when we are dealing with pattern.

Patterns can bring us to models

Patterns in data provides us evidence regarding variable relationship, facing a pattern we should think like it was written in the “R For Data Science” book:

Could this pattern be happening by coincidence (randomly)?
How is it possible to describe the relationship evidenced by the pattern?
How strong is the relationship shown by the pattern?
Which other variables can afect this relationship?
Does this relationship changes if we look to subgroups in the data?

In this kind of data there are groups defined by a categorical variable: Species, then we are able to do the following to proceed with the analysis of the relationship between length and width of the Petals:

iris %>% 
   ggplot(aes(x = Petal.Length, y = Petal.Width, color = Species)) +  
     geom_point() + 
     labs( 
       title = "Mesuments in centimeters of flowers from 3 species",
      subtitle = "by: Guilherme Bastos Gomes", 
      caption = "Source: Edgar Anderson's Iris Data set", 
      x = "Petal Length",
      y = "Petal Width" 
    ) +
     theme_classic() +
   facet_grid(~Species)

Here we are using the function facet_grid() to divide our graph into the groups defined by the variable Species.

Here we are analyzing the distribution between the relationship of the variables from each group. Could we somehow model these relationships? We will in the future!

For now, I needed to install the package hexbin to do the next plot:

install.packages("hexbin")

Then:


library(hexbin)
iris %>% 
  ggplot(aes(x = Petal.Length, y = Petal.Width)) +  
    geom_hex() + 
    labs( 
      title = "The effect of vitamin C in tooth growth of Guinea pigs",
      subtitle = "by: Guilherme Bastos Gomes", 
      caption = "Source: ToothGrowth data set", 
      x = "Supplement",
      y = "Length" 
    )  +
  facet_grid(~Species)

Observing several Scatter Plots of your data can be a very powerful tool to understand and explain relationships.

Using R base to plot several scatter plots

Function pairs() helps us plotting several relations at the same time:

To add groups in this kind of plot, we should first define a vector with the desired colours:

colours <- c("#0000FF","#00FF00","#FF0000")
pairs(iris[,1:4], pch = 19,  cex = 0.5,
      col = colours[iris$Species],
      lower.panel=NULL)

From this moment on we can start talking about correlation, but this one concept I am going to leave for a future post!

Thank you!

Covariation and Scatter Plots

Hello!

Covariation

A good way of understanding these relations is by using Scatter Plots

Patterns can bring us to models

Using R base to plot several scatter plots