Chapter 2 Basics of ggplot2 and Correlation Plot
Load these packages by typing the codes below.
library(tidyverse) # it has ggplot2 package
library(cowplot) # it allows you to save figures in .png file
library(smplot)
2.1 Uploading data
Sample data: mpg
- I will be using an example from the book R for Data Science (https://r4ds.had.co.nz/data-visualisation.html).
- Question: Do cars with large engines use up more fuel than the those with small ones?
- Let’s open mpg, which is a data frame stored in the ggplot2 package.
- mpg contains data about cars in the US. You can type
?mpg
for more information.- displ: the size of the car’s engine in liters
- hwy: fuel efficiency. If it’s high, then the car uses less fuel per distance.
mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p compa~
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compa~
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compa~
## 10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p compa~
## # ... with 224 more rows
- Notice that some columns and rows are not shown. You can type
View(mpg)
to see the entire data frame. - Each row is an unique observation.
- Each column is an unique variable/condition.
View(mpg)
2.2 Basics of ggplot2
Let’s make some graphs
- Question: Do cars with large engines use up more fuel than the those with small ones?
- To answer our question, we need to plot mpg data. The x-axis should be displ, the y-axis should be hwy.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
- We find that a smaller car has a higher efficiency and that a larger car has a lower efficiency. In other words, we see a negative relationship.
How ggplot works
- When you are making a graph with ggplot2, always begin by typing the function
ggplot()
.- The data you want to plot is the first argument here. Ex.
ggplot(data = mpg)
.
- The data you want to plot is the first argument here. Ex.
- However,
ggplot(data = mpg)
alone does not create a graph. You will need add (by typing +) more layers, such asgeom_point()
.geom_point()
adds points to your graphs. You will need to specify (or map) x- and y-axes in theaes()
function, which means aesthetics. This process is called mapping.- As you might expect, there are other geom functions, such as
geom_bar()
,geom_boxplot()
,geom_errorbar()
. They plot bar graphs, boxplots and error bars, respectively.
- Here is the template for using ggplot2 (copied from R for Data Science).
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Different color of points for each unique group
- You can apply different colors by the class of each car (each car = each row of the mpg data frame).
- Include
class
variable in theaes()
function. - This maps the third variable
class
into your graph. aes()
means aesthetic (ex. color, shape, etc).
- Include
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
- You can also set different shapes for each group of the data.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
- Or size or transparency (not recommended). But you get the idea. Using
aes()
in a geom function (ex.geom_point()
), you can label different group of points.
# different levels of transparency (alpha) for each group
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# different sizes of the points for each group
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Different color & shape for each group
- You can also apply different color & shape for each group of the data.
- Exercise: Try it on your own before you look at the code below.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class,
shape = class))
Same shape across all groups
- So far, you have put variables such as
shape
andcolor
inside the functionaes()
.- This has enabled you to apply different shape and color for each group.
- If you put the variable for
shape
,color
,size
outside ofaes()
in the geom function, then all data points will have the specifiedshape
,color
, etc even if they are in different groups.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy,
color = class), shape = 17)
- Notice that the
color
is different for each group because it is inside the functionaes()
. However, all the points are triangle because we have typedshape = 17
outside the functionaes()
. - Exercise: try changing the shape of the points to the circle with the border.
Exercise: try changing the shape of the points to the circle with the border.
- When
shape = 19
, the shape is the circle without the border. - When
shape = 20
, the shape is the small circle without the border. - When
shape = 21
, the shape is the circle with the border. - So let’s set
shape
to 21.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class), shape = 21)
- Notice that the border color is different for each group, but not the color that fills the circle.
- Shapes without their borders (15-20) are filled with
color
. - Shapes with the border (21-24) are filled with
fill
and its border colored withcolor
. - So let’s change
color = Class
tofill = Class
.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, fill = class), shape = 21)
How do we draw the best-fit line of the graph?
- Here is our graph.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
- There seems to be a negative relationship.
- How do we draw the best-fit line of the graph’s negative relationship?
- Use another geom function
geom_smooth()
.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
geom_point() + geom_smooth()
- Now let’s combine geom_point() + geom_smooth() into one graph.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot()
acts as a system where you can add multiplegeom
objects, such asgeom_point()
andgeom_smooth()
.- You can add multiple layers of geom in a single plot, like shown here.
ggplot()
and at least one geom function are necessary to draw a graph.ggplot()
alone does not draw a graph. Try it on your own.
ggplot(data = mpg)
Writing shorter codes
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
- Notice that we have typed
mapping = aes(x = displ, y = hwy)
twice. This is repetitive. - If you type the
mapping
argument inggplot()
, you won’t need to type them anymore in the subsequentgeom
functions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This is exactly the same as the previous graph. In both cases, the mapping has been set so that the x-axis is
displ
and the y-axis ishwy
in bothgeom_point()
andgeom_smooth()
.Now let’s apply different color of points and the fit the line for each group.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
- Okay, this is extremely messy and probably a bad idea.
- You might have gotten
warnings
but you can usually ignore them.
- You might have gotten
- Let’s plot the best-fit line across all groups (i.e., one best-fit line) but apply different color for each class (i.e., many colors).
- To do so, type
color = class
in geom_point, notggplot()
. This enables you to specify that you will apply different color for each class only ingeom_point()
but not ingeom_smooth()
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
2.3 Improve Data Visualization using smplot
- Although the default theme of ggplot2 graphs is clean, there are some things that I do not like:
- The fonts are too small.
- The grey background is distracting.
- There are too many grids.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
- Let’s make this graph prettier by using functions from smplot.
- In this example, let’s use
sm_corr_theme()
. I’ve made this function as a theme suitable for correlation plots. - Disclaimer: smplot package has been built based on my preference.
- smplot is not necessary to make a ggplot graph or change its style. It is possible to change every aspect of the graph with ggplot2 but this requires about 8-20 lines of codes (based on my experience). Instead, smplot function does so in one line of code.
- In this example, let’s use
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
sm_corr_theme()
- Now let’s remove the border within
sm_corr_theme()
by settingborders = FALSE
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
sm_corr_theme(borders = FALSE)
- Exercise: You can also set
borders = TRUE
and see what happens.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
sm_corr_theme(borders = TRUE)
You might notice that borders come back. This is exactly what happens when you do not include
borders
argument insm_corr_theme()
. This is becausesm_corr_theme()
is set toborders = TRUE
as default.I think the one with the border looks better.
You can also remove the legend by setting
legends = FALSE
insm_corr_theme()
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
sm_corr_theme(legends = FALSE)
- Exercise Set
legends = TRUE
and see what happens. Type?sm_corr_theme
to see why legends appear without directly writinglegends = TRUE
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
sm_corr_theme(legends = TRUE)
- However, in this case, I think we need a legend because there are many classes.
Positive relationship between x- and y-axes
- Let’s plot another scatterplot using mtcars data.
- Set the x-axis with drat and y-axis with mpg.
- Since you are making a scatterplot, you will need to use
geom_point()
. - Set the size of all points to 3 by typing
size = 3
. - Set the shape of all points to the circle with a border by typing
shape = 21
. - Set the filled color of all points to green by typing
fill = '#0f993d'
. - Set the border color to white by typing
color = 'white'
.- Since
shape = 21
refers to the circle with a border,fill
is the color that fills the points andcolor
is the border color.
- Since
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white',
size = 3)
drat and mpg have a positive relationship.
Now let’s make it pretty by adding
sm_corr_theme()
.
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) +
sm_corr_theme()
- You can remove borders too by setting
borders = FALSE
insm_corr_theme()
.
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) +
sm_corr_theme(borders = FALSE)
Reporting statistics from a paired correlation
- smplot also offers a function that plots the best-fit line of a scatterplot (i.e., correlation plot) and prints statistical values, such as p- and R-values.
- p-value is used to check for statistical significance. If it’s less than 0.05, its regarded as statistically significant. However, it gets smaller with a larger sample size.
- R-value (correlation coefficient) measures the strength and the direction of the correlation. It ranges from -1 to 1. It does not depend on the sample size.
- Let’s add a function
sm_statCorr()
. The statistical results are from Pearson’s correlation test.
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) + sm_corr_theme() +
sm_statCorr()
## `geom_smooth()` using formula 'y ~ x'
- I don’t really like how the line color is different from that of the points. Let’s change the color to green.
- Also let’s get results from Spearman’s correlation test rather than from Pearson’s.
- To do so, type
corr_method = 'spearman'
in the functionsm_statCorr()
. You will get a different R value from 0.68, which is from Pearson’s correlation test.
- To do so, type
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) +
sm_corr_theme() +
sm_statCorr(color = '#0f993d', corr_method = 'spearman')
## `geom_smooth()` using formula 'y ~ x'
- Exercise: Set
corr_method = 'pearson'
and see what happens.
ggplot(data = mtcars, aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) +
sm_corr_theme() +
sm_statCorr(color = '#0f993d', corr_method = 'pearson')
You will see that this is exactly the same as when
corr_method
argument is not included insm_statCorr()
. In short, the default correlation method forsm_statCorr()
is'pearson'
. So, if you don’t write anything forcorr_method
, it will give results from Pearson’s correlation test. Type?sm_statCorr
to see the default ofline_type
.#0f993d
is a specific green that I like.Now, let’s change the color. Replace
'#0f993d'
with'green'
ingeom_point()
andsm_statCorr
.- This
'green'
is the default green color of R.
- This
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = 'green', color = 'white', size = 3) + sm_corr_theme() +
sm_statCorr(color = 'green')
## `geom_smooth()` using formula 'y ~ x'
- Which one do you prefer?
fill = '#0f993d'
vs fill = 'green'
- I personally like
#0f993d
more. However, R does not recognize this color asgreen
. - So how are you supposed to remember the color code?
- You do not have to. You can type
sm_color('green')
instead. This is a function from the smplot package. sm_color()
accepts the name of the color. If you want to get the hex codes (color codes) for red and green, typesm_color('red','green')
.
sm_color('red','green')
- Again,
sm_color()
has been built based on my preference. So it returns the hex codes of colors that I use most often. - There are many more color themes that are available in R. For more information, please check out Chapter 28 of R for Data Science (https://r4ds.had.co.nz/graphics-for-communication.html).
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = sm_color('green'), color = 'white',size = 3) +
sm_corr_theme() +
sm_statCorr(color = sm_color('green'))
## `geom_smooth()` using formula 'y ~ x'
- Exercise Change the color of the points and the best-fit line to
blue
usingsm_color()
. If you want to see all the color options forsm_color()
, type?sm_color
. There are 16 colors total.
ggplot(data = mtcars, mapping = aes(x = drat, y = mpg)) +
geom_point(shape = 21, fill = sm_color('blue'), color = 'white',size = 3) +
sm_corr_theme() +
sm_statCorr(color = sm_color('blue'))
Different color for each group but with other colors
- Let’s go back to the mpg data. Set the x-axis with displ and y-axis with hwy. Then make a scatterplot using
geom_point()
.- Set the size of the points to 2 across all groups. So type
size = 2
outside ofaes()
ingeom_point()
.
- Set the size of the points to 2 across all groups. So type
- Let’s apply different color for each
class
of the cars by writingcolor = class
inaes()
fromggplot()
.fill = class
is needed when the shape of the point is set to 21-25.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point(size = 2)
- To use other colors, we could use a function from ggplot2 called
scale_color_manual()
.scale_fill_manual()
is used when the shape of the point has borders (shape = 21-25).
- To find how many colors we need total, we need to find how many groups exist.
<- unique(mpg$class) unique_classes
- In R, you can extract data from one column by using
$
. You can try it with different variables too. unique()
returns unique values in the selected data.- Then compute the number of unique values using
length()
function.
<- length(unique_classes)
number_of_classes number_of_classes
## [1] 7
sm_palette
accepts the number of colors as input. It returns colors that I use most often.- Now that we know we need 7 colors total, we can type
sm_palette(7)
orsm_palette(number_of_classes)
forvalues
inscale_color_manual()
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point(size = 2) +
scale_color_manual(values = sm_palette(number_of_classes)) +
sm_corr_theme()
- Let’s store this graph using a variable called
figure1
.
<- ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
figure1 geom_point() +
scale_color_manual(values = sm_palette(number_of_classes)) +
sm_corr_theme()
- Notice that when you store a figure into a variable, the figure is not displayed when you run the code that makes the figure, ex.
figure1 <- ggplot(data = mpg, mapping = ...
. To display the figure, please type the variable name in the console.
# it will appear again by calling this variable figure1
Let’s save the plot as an image in your folder LearnR by using the variable figure1.
- To save the figure as an image, we will use a function from the cowplot package.
- The function is
save_plot()
. - There is one important argument:
base_asp
.- This is the ratio of your image (width/height). I usually set it to 1.4. So let’s type
base_asp = 1.4
insave_plot()
. - If
base_asp
is larger than 1, it gets wider than its height. This is recommended when you have a legend. - If there is no legend, then
base_asp = 1
is recommended.
- This is the ratio of your image (width/height). I usually set it to 1.4. So let’s type
save_plot('figure1.png', figure1, base_asp = 1.4)
- Exercise: try to save it again with a name figure1b.png by typing:
save_plot('figure1b.png', figure1)
How’s the picture? Why does it look different? Type
?save_plot
to see what the defaultbase_asp
is.Done! The graph (in png file) should be in your LearnR folder.
Exercise: Try to open Microsoft Word or PowerPoint and upload figure1. The figure should look the same as it appears in the slides.
Exercise: Remove the legend and save the scatterplot with
base_asp = 1
.Congratulations! You can now make correlation plots with R.
2.4 Summary
- You have learned the basics of ggplot.
- You begin by writing a
ggplot()
function. - If aesthetics (color, shape, etc) are specified outside of
aes()
function, then there is no group difference. - If aesthetics are specified in
aes()
, different groups of data will have different looks. - You have learned to add geom layers such as
geom_point()
, which shows points, andgeom_smooth()
, which plots the best-fit function. - You have learned to plot
geom_point()
andgeom_smooth()
in the same graph.
- You begin by writing a
- smplot functions can be used to improve ggplot2 visually.
- For correlation plots, add
sm_corr_theme()
. - You can report statistical results and plot linear regression from correlation by
sm_statCorr()
. - You can also select colors using
sm_color()
.
- For correlation plots, add
- Save the graph as an image file in your working directory.
- Working directory has to be set in RStudio (Session -> Set Working Directory -> Choose Directory)
- Then use
save_plot()
from cowplot to save the image in your directory (folder LearnR).