In today’s class, we will begin to explore how R can be used to make graphics from data, making customized static graphics with the ggplot2 package. This is part of Hadley Wickham’s tidyverse, so you already have it installed from last week.
Download the data for this session from here, unzip the folder and place it on your desktop. It contains the following files:
disease_democ.csvData illustrating a controversial theory suggesting that the emergence of democratic political systems has depended largely on nations having low rates of infectious disease, from the Global Infectious Diseases and Epidemiology Network and Democratization: A Comparative Analysis of 170 Countries, as used in week 1.
food_stamps.csvU.S. Department of Agriculture data on the number of
participants, in millions, and
costs, in $ billions, of the federal Supplemental Nutrition Assistance Program from 1969 to 2015.
kindergarten.csvData from the California Department of Public Health, documenting enrollment and the number of children with complete immunizations at entry into kindergartens in California from 2001 to 2015. Contains the following variables:
sch_codeUnique identifying code for each school.
pub_privWhether school is public or private.
enrollmentNumber of children enrolled.
completeNumber of children with complete immunizations.
start_yearYear of entry (for the 2015-2016 school year, for example, this would be 2015).
nations.csvData from World Bank World Development Indicators portal, giving data on population, GDP per capita, life expectancy, birth rate, neonatal mortality rate, region and income group for the world’s nations, from 1990 onwards, as used in week 3.
The “gg” in ggplot2 stands for “grammar of graphics,” an approach to drawing charts devised by the statistician Leland Wilkinson. Rather than thinking in terms of finished charts like a scatter plot or a column chart, it starts by defining the coordinate system (usually the X and Y axes of a cartesian system), maps data onto those coordinates, and then adds layers such as points, bars and so on. This is the logic behind ggplot2 code.
Some key things to understand about ggplot2:
ggplotThis is the master function that creates a ggplot2 chart.
aesThis function, named for “aesthetic mapping,” is used whenever data values are mapped onto a chart. So it is used when you define which variables are plotted onto the X and Y axes, and also if you want to change the size or color of parts of the chart according to values for a variable.
geomAll of the functions that add layers to a chart start with
geom, followed by an underscore, for example
geom_bar(). The code in the brackets for any
geomlayer styles the items in that layer, and can include
aesmappings of values from data.
themeThis function modifies the appearance of elements of a plot, used, for example, to set size and font face for text, the position of a legend, and so on.
scaleFunctions that begin with
scale, followed by an underscore, are used to modify the way an
aesmapping of data appears on a chart. They can change the axis range, for example, or specify a color palette to be used to encode values in the data.
+is used each time you add a layer, a scale, a theme, or elements like axis labels and a title After a
+you can continue on the same line of code or move the next line. I usually write a new line after each
+, which makes the code easier to follow.
We’ll start by making and modifying a scatter plot from the disease and democracy data that we first encountered in week 1.
Open a new R script in RStudio, save the blank script to the folder with the data for this week, and then set your working directory to this location by selecting from the top menu
Session>Set Working Directory>To Source File Location.
Now copy the following code into your script and run to load readr, ggplot2, and then load the disease and democracy data:
# load required packages library(ggplot2) library(readr) # load disease and democracy data disease_democ <- read_csv("disease_democ.csv")
Copy this code into your R script and run:
# map values in data to X and Y axes ggplot(disease_democ, aes(x = infect_rate, y = democ_score))
The brackets after the
ggplot function define the data frame to be used, followed by the
aes mapping of variables in the data to the chart’s X and Y axes.
The following chart should appear in the
Plots panel at bottom right:
The axis ranges are automatically set to values in the data, but at this point there is just a black chart grid, because we haven’t added any
geom layers to the chart.
By default, the axis labels will be the names of the variables in the data. But it’s easy to customize, using the following code:
# customize axis labels ggplot(disease_democ, aes(x = infect_rate, y = democ_score)) + xlab("Infectious disease prevalence score") + ylab("Democratization score")
The default gray theme of ggplot2 has a rather academic look. See here and here for how to use the
theme option to customize individual elements of a chart. However, for my charts, I typically use one of the ggplot2 built-in themes, and then customize the fonts.
# Change the theme ggplot(disease_democ, aes(x = infect_rate, y = democ_score)) + xlab("Infectious disease prevalence score") + ylab("Democratization score") + theme_minimal(base_size = 14, base_family = "Georgia")
Notice how the
base_size can be used with a built-in theme to change font face and size. R’s basic fonts are fairly limited (run
names(postscriptFonts())) to view those available). However, you can use the extrafonts package to make other fonts available.
If you wish to develop your own customized theme, I recommend using this web app to select your theme options. When you are statisfied with the appearance of the chart in the app, click the
R script for theme (run every R session) button to download your theme as an R script.
If you then load and run run this script at the start of your R session, your ggplot2 charts for that session will use the downloaded theme.
You can save a ggplot2 chart as an object in your environment using the
<- assignment operator. So we’ll do that here to save the basic template, with no
# save chart template, and plot disease_democ_chart <- ggplot(disease_democ, aes(x = infect_rate, y = democ_score)) + xlab("Infectious disease prevalence score") + ylab("Democratization score") + theme_minimal(base_size = 14, base_family = "Georgia") plot(disease_democ_chart)
There should now be an object of type
gg in your Environment called
plot function will plot a saved ggplot2 object.
This code will add a
geom layer with points to the template:
# add a layer with points disease_democ_chart + geom_point()
# add a trend line disease_democ_chart + geom_point() + geom_smooth()
The following code modifies the two
geom layers to change their appearance.
# customize the two geom layers disease_democ_chart + geom_point(size = 3, alpha = 0.5) + geom_smooth(method = lm, se=FALSE, color = "red")
geom_point layer, we have increased the size of each point, and reduced its transparency using
geom_smooth function, we have changed the
color of the line, removed the ribbon showing the
se or “standard error,” a measure of the uncertainty surrounding the fit to the data, and changed the
method used to fit the data from a smoothed fit from a method called locally-weighted scatterplot smoothing to a linear regression, or linear model (
When setting colors in ggplot2 you can use their R color names, or their HEX values. This code will produce the same result:
# customize the two geom layers disease_democ_chart + geom_point(size = 3, alpha = 0.5) + geom_smooth(method = lm, se=FALSE, color = "#FF0000")
Until you are familiar with the options for each
geom, you will need to look up how to change the appearance of each layer: Follow the links for each
geom from here.
The following code customizes the trend line further, and includes an
aes mapping in to set the color of the points to that they reflect the categorical variable of World Bank income group.
# customize again, coloring the points by income group disease_democ_chart + geom_point(size = 3, alpha = 0.5, aes(color = income_group)) + geom_smooth(method = lm, se =FALSE, color = "black", linetype = "dotdash", size = 0.3)
Notice how the
aes function colors the points by values in the data, rather than setting them to a single color. ggplot2 recognizes that
income_group is a categorical variable, and uses its default qualitative color palette.
Now run this code, to see the different effect of setting the
aes color mapping for the entire chart, rather than just one
# color the entire chart by income group ggplot(disease_democ, aes(x = infect_rate, y = democ_score, color=income_group)) + xlab("Infectious disease prevalence score") + ylab("Democratization score") + theme_minimal(base_size = 14, base_family = "Georgia") + geom_point(size = 3, alpha = 0.5) + geom_smooth(method=lm, se=FALSE, linetype= "dotdash", size = 0.3)
Because here we mapped the variable
income group to color for the whole chart, and not just the points, it also affects the
geom_smooth layer, so a separate trend line, colored the same as the points, is calculated for each
# set the axis ranges, change color palette disease_democ_chart + geom_point(size = 3, alpha = 0.5, aes(color = income_group)) + geom_smooth(method = lm, se = FALSE, color = "black", linetype = "dotdash", size = 0.3) + scale_x_continuous(limits=c(0,70)) + scale_y_continuous(limits=c(0,100)) + scale_color_brewer(name="Income group", palette = "Set1")
Notice how the first two
scale functions are used to set the ranges for the axis, which are entered as a list, using the
c function we saw last week.
You can apply ColorBrewer qualitative palettes by using the
scale_color_brewer function. Add the text you want to appear as a legend title using
Having made a series of charts, you can browse through them using the blue arrows at the top of the
Plots tab in the panel at bottom right. The broom icon will clear all of your charts; the icon to its immediate left remove the chart in the current view.
You can export any chart by selecting
Export>Save as Image.... At the dialog box, you can select the desired image format, and size. If you wish to subsquently edit or annotate the chart in a vector graphics editor such as Abode Illustrator, export as an SVG file.
You can also save your final ggplot2 chart as an object in your R environment:
# save final disease and democracy chart final_disease_democ_chart <- disease_democ_chart + geom_point(size = 3, alpha = 0.5, aes(color = income_group)) + geom_smooth(method = lm, se = FALSE, color = "black", linetype = "dotdash", size = 0.3) + scale_x_continuous(limits = c(0,70)) + scale_y_continuous(limits = c(0,100)) + scale_color_brewer(name = "Income group", palette = "Set1")
Now we will explore a series of other
geom functions using the food stamps data.
# load data food_stamps <- read_csv("food_stamps.csv") # save basic chart template food_stamps_chart <- ggplot(food_stamps, aes(x = year, y = participants)) + xlab("Year") + ylab("Participants (millions)") + theme_minimal(base_size = 14, base_family = "Georgia")
# line chart food_stamps_chart + geom_line()
# customize the line, add a title food_stamps_chart + geom_line(size = 1.5, color = "red") + ggtitle("Line chart")
ggtitle adds a title to the chart.
# Add a second layer to make a dot-and-line chart food_stamps_chart + geom_line() + geom_point() + ggtitle("Dot-and-line chart")
# Make a column chart food_stamps_chart + geom_bar(stat = "identity") + ggtitle("Column chart") + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())
geom_bar works a little differently to the
geoms we have considered previously. If you have not mapped data values to the Y axis with
aes, its default behavior is to set the heights of the bars by counting the number of records for values along the X axis. If you have mapped a variable to the Y axis, and want the heights of the bars to represent values in the data, use you must use
Notice also the use of
theme options to remove the X axis grid lines.
# Make a bar chart food_stamps_chart + geom_bar(stat = "identity") + ggtitle("Bar chart") + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank()) + coord_flip()
coord_flip switches the X and Y axes.
geom_bar, you can set color for their outline as well as the interior of the shape.
# set color and fill food_stamps_chart + geom_bar(stat = "identity", color = "#888888", fill = "#CCCCCC", alpha = 0.5) + ggtitle("Column chart")
When setting colors,
color refers to the outline,
fill to the interior of the shape.
# fill the bars according to values for the cost of the program food_stamps_chart + geom_bar(stat = "identity", color= "white", aes(fill = costs))
This code uses an
aes mapping to color the bars according values for the costs of the program, in billions of dollars. ggplot2 recognizes that
costs is a continuous variable, but its default sequential scheme applies more intense blues to lower values, which is counterintuitive.
# use a colorbrewer sequential palette food_stamps_chart + geom_bar(stat = "identity", color = "#888888", aes(fill = costs)) + scale_fill_distiller(name = "Cost\n($ billion)", palette = "Reds", direction = 1)
scale_color_distiller) work like
scale_color_brewer, but set color gradients for ColorBrewer’s sequential and diverging color palettes;
direction = 1 ensures that larger numbers are mapped to more intense colors (
direction = -1 reverses the color mapping).
Notice also the
\n in the title for the legend. This introduces a new line.
This code uses the
theme function to moves the legend from its default position to the right of the chart to use some empty space on the chart itself.
food_stamps_chart + geom_bar(stat="identity", color = "#888888", aes(fill=costs)) + scale_fill_distiller(name = "Cost\n($ billion)", palette = "Reds", direction = 1) + theme(legend.position=c(0.15,0.8))
The coordinates for the legend are given as a list: The first number sets the horizontal position, from left to right, on a scale from 0 to 1; the second number sets the vertical position, from bottom to top, again on a scale form 0 to 1.
The scales package allows you to format axes to display as currency, as percentages, or so that large numbers use commas as thusands separators.
Here are some examples of using dplyr, ggplot2, and scales to process data and make charts.
# load required package library(scales) library(dplyr) # load data immun <- read_csv("kindergarten.csv")
# proportion incomplete, entire state, by year immun_year <- immun %>% group_by(start_year) %>% summarize(enrolled = sum(enrollment, na.rm=TRUE),completed = sum(complete, na.rm=TRUE)) %>% mutate(incomplete = round(((enrolled-completed)/enrolled),4)) # proportion incomplete, by county, by year immun_counties_year <- immun %>% group_by(county,start_year) %>% summarize(enrolled = sum(enrollment, na.rm = TRUE),completed = sum(complete, na.rm = TRUE)) %>% mutate(incomplete = round(((enrolled-completed)/enrolled),4)) # identify five counties with the largest enrollment over all years top5 <- immun %>% group_by(county) %>% summarize(enrolled = sum(enrollment, na.rm = TRUE)) %>% arrange(desc(enrolled)) %>% head(5) %>% select(county) # proportion incomplete, top 5 counties by enrollment, by year immun_top5_year <- semi_join(immun_counties_year, top5)
The code above uses dplyr to group and summarixe the data for the charts that follow. Notice the use of
na.rm = TRUE with the
sum function. This is needed when summarizing data using functions like
median if there are any missing values in the data. It is a good idea to get into the habit if including it when using these functions.
The calculated variable
incomplete gives the proportion of children who did not have complete immunizations, as a decimal fraction between 0 and 1. In the ggplot2 that follows, we will use scales to display these numbers as percentages.
# bar chart by year, entire state ggplot(immun_year, aes(x = start_year, y = incomplete)) + geom_bar(stat = "identity", fill = "red", alpha = 0.7) + theme_minimal(base_size = 12) + scale_y_continuous(labels = percent) + scale_x_continuous(breaks = c(2002,2004,2006,2008,2010,2012,2014)) + xlab("") + ylab("Incomplete") + ggtitle("Immunization in California kindergartens, entire state") + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())
Here, the code
scale_y_continuous(labels = percent) uses
labels = percent from scales to format the decimal fractions in
incomplete as percentages.
scale_x_continuous(breaks = c(2002,2004,2006,2008,2010,2012,2014)) manually sets the positions of the X axis tick labels, rather than accepting the default values selected by ggplot2.
# dot and line chart, top5 counties, by year ggplot(immun_top5_year, aes(x = start_year, y = incomplete, color = county)) + scale_color_brewer(palette = "Set1", name = "") + geom_line(size=1) + geom_point(size=3) + theme_minimal(base_size = 12) + scale_y_continuous(labels = percent, limits = c(0,0.15)) + scale_x_continuous(breaks = c(2002,2004,2006,2008,2010,2012,2014)) + xlab("") + ylab("Incomplete") + theme(legend.position = "bottom") + ggtitle("Immunization in California kindergartens\n(five largest counties)")
# heat map, all counties, by year ggplot(immun_counties_year, aes(x = start_year, y = county)) + geom_tile(aes(fill = incomplete), colour = "white") + scale_fill_gradient(low = "white", high = "red", name="", labels = percent) + scale_x_continuous(breaks = c(2002,2004,2006,2008,2010,2012,2014)) + theme_minimal(base_size = 12) + xlab("") + ylab("County") + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position="bottom", legend.key.height = unit(0.4, "cm")) + ggtitle("Immunization in California kindergartens, by county")
This code uses
geom_tile to make a heat map, and
scale_fill_gradient to create a color gradient by manually setting the colors for the end of the scale. This is an alternative to using
scale_fill_distiller to use a ColorBrewer sequential palette.
# load data nations <- read_csv("nations.csv") # filter for 2014 data only nations2014 <- nations %>% filter(year == 2014) # make bubble chart ggplot(nations2014, aes(x = gdp_percap, y = life_expect)) + xlab("GDP per capita") + ylab("Life expectancy at birth") + theme_minimal(base_size = 12, base_family = "Georgia") + geom_point(aes(size = population, color = region), alpha = 0.7) + scale_size_area(guide = FALSE, max_size = 15) + scale_x_continuous(labels = dollar) + stat_smooth(formula = y ~ log10(x), se = FALSE, size = 0.5, color = "black", linetype="dotted") + scale_color_brewer(name = "", palette = "Set2") + theme(legend.position=c(0.8,0.4))
In this code,
scale_size_area ensures that the size of the circles scales by their area according to the population data, up to the specified
guide = FALSE within the brackets of this function prevents a legend for size being drawn.
labels = dollar from scales formats the X axis labels as currency in dollars.
stat_smooth works like
geom_smooth but allows you to use a
formula to specify the type of curve to use for to trend line fitted to the data, here a logarithmic curve.
For both charts, you will first need to create a new variable in the data, using
mutate from dplyr, giving the GDP of each country in trillions of dollars, by multiplying
population and dividing by a trillion. When drawing both charts with ggplot2 you should also manually set the X axis tick mark labels (otherwise the default will be to include a label for 2015, for which there is no data.)
For the first chart, you will need to
filter the data with dplyr for the four desired countries. When making the chart with ggplot2 you will need to add both
geom_line layers, and use the
Set1 ColorBrewer palette.
For the second chart, using dplyr you will need to
group_by region and year, and then summarize using
sum. There will be null values, or NAs, in this data, so you will need to use
na.rm = TRUE. When drawing the chart with ggplot2, you will need to use
geom_area and the
Set2 ColorBrewer palette. Think about the difference between
color when making the chart, and put a very thin white line around each area.
File each chart as a JPEG image, setting its size to 750 pixels wide and 500 pixels high.
Also file your R script containing the code used to process the data and draw your charts.
Hadley Wickham: ggplot2: Elegant Graphics For Data Analysis
ggplot2 cheat sheet from RStudio.