The simple features or sf R package
allows geodata to be imported as objects with the type
sf, which are essentially R data frames with the map
geometry stored as coordinates in a variable called
geometry
. This allows you to work with geodata as you would
with regular data frames or tibbles, using functions from the dplyr package
and the rest of the tidyverse,
including the charting package ggplot2 for
making maps.
If you are working in the computers provided at the NICAR meeting, the packages listed in the code chunk below should already be installed. If you are working on your own computer, you will first need to install these packages. For sf to work correctly, you must also have the Geospatial Data Abstraction Library, or GDAL, installed on your computer. If you routinely work with QGIS, this should already be installed. If it is not, follow the instructions to install GDAL for your operating system given on the sf package home page.
Also if working on your own computer, click on the Materials link in
the top navigation bar to download a zipped folder containing the data,
notes, and code for this class, which is in the file
r-sf-mapping-geo-analysis.Rmd
. Open that file in RStudio,
and switch to the Visual
Markdown Editing view, by clicking the Visual
button at
top left above.
With the required packages all installed, run the code below to load them for use in the current session:
# load required packages
library(sf)
library(tidyverse)
library(tidygeocoder)
First we will import the file seismic.geojson
using the
function st_read
. The data contains a single variable
ValueRange
giving ranges for the annual percentage chance
of experiencing a damaging earthquake across the continental US, as
calculated by the US
Geological Survey, plus the geometry
for a map. The
coordinates define polygons, or two-dimensional shapes.
# load geojson file
seismic <- st_read("seismic.geojson")
# examine the data
glimpse(seismic)
view(seismic)
sf objects can be exported to other common geodata
formats using the st_write
function, as shown in the
following code chunk:
# save as kml
st_write(seismic, "seismic.kml", delete_dsn = TRUE)
# save as shapefile
dir.create("seismic")
st_write(seismic, "seismic/seismic.shp", delete_dsn = TRUE)
Including delete_dsn = TRUE
in the st_write
function allows any previously saved version to be overwritten, so
exclude this from the code if you do not wish to allow this behavior.
Shapefiles consist of multiple files, so the code to save as a shapefile
first creates a folder to hold those files, and then writes the files
into that folder.
Next we will use the tidygeocoder
package to geocode the list of San Francisco addresses using the ArcGIS
geocoder. The data is in the file addresses.tsv
, which
contains a single column with the header address
.
# load addresses data
addresses <- read_tsv("addresses.tsv")
# geocode
addresses <- geocode(addresses,
address = address,
method = "arcgis",
full_results = TRUE)
# look at the geocoded data
view(addresses)
Including full_results = TRUE
in this code provides some
information about the estimated accuracy of the geocoding.
We can now convert the geocoded data frame of San Francisco addresses to an sf object with the following code:
# convert to sf object with equirectangular projection (EPSG:4326)
addresses <- addresses %>%
st_as_sf(coords = c("long","lat"),
crs = st_crs("EPSG:4326"))
# examine data
view(addresses)
The function st_as_sf
converts a regular data frame with
latitude and longitude coordinates into an sf object.
In this code "long"
and "lat"
are the names of
the variables containing longitude and latitude coordinates. The
sf object now has a geometry
column
containing the point coordinates.
Notice also that the st_as_sf
function includes
st_crs("EPSG:4326")
. This sets the projection, or
coordinate reference system, for the geometry.
Because the Earth is roughly spherical, any map other than a globe is a distortion of reality. Just as you can’t peel an orange and arrange the skin as a perfect rectangle, circle, or ellipse, it is impossible to plot the Earth’s surface in two dimensions and accurately represent distances, areas, shapes and directions. So maps rely on a projection or coordinate reference system to convert locations on a sphere to two-dimensional maps. Some projections are optimized to minimize the distortion of area; others aim to preserve shape or distance; yet others keep directions constant.
Projections have numeric codes, defined originally by the European Petroleum Survey Group.
Here are the codes for some common projections:
Equirectangular: EPSG:4326
Plots
degrees of latitude against degrees of longitude. Used in the code
above.
Web Mercator: EPSG:3857
Used for
OpenStreetMap and other web maps. Mercator was originally designed for
navigation and keeps direction/compass bearing constant.
Albers Equal Area Conic, centered on continental
US: EPSG:5070
Keeps areas constant, widely used for maps of
the entire US.
Mollweide: ESRI:54009
Equal area
projection suitable for mapping the entire world.
Robinson: ESRI:54030
A “compromise”
projection for showing the entire world, minimizes distortion across
each of area, shape, distance, and compass bearing but represents none
of them perfectly.
Search here for projection codes.
You can change the projection of an sf object with
the function st_transform
and check the projection of an
sf object with the function st_crs
:
# change projection to Web Mercator
addresses <- addresses %>%
st_transform("EPSG:3857")
# what is the projection?
st_crs(addresses)
# change projection back to equirectangular
addresses <- addresses %>%
st_transform("EPSG:4326")
# what is the projection?
st_crs(addresses)
Before making a map from the seismic
data, we will
convert the ValueRange
variable from text to a categorical
variable, or factor, with the values arranged in
increasing order of risk. This will ensure that the colors get applied
to the values in the ascending order of risk.
This is achieved with the following code:
# convert to ordered factor/categorical variable
seismic <- seismic %>%
mutate(ValueRange = factor(ValueRange,
levels = c("< 1","1 - 2","2 - 5","5 - 10","10 - 14")))
Let’s also check the projection of the data:
# what is the projection of the seismic risk data?
st_crs(seismic)
This reveals that the data is in EPSG:4326
.
The following code creates a map of the annual risk of experiencing a damaging quake with ggplot2:
# make a map of the annual risk of experiencing a damaging quake in the continental US
ggplot() +
geom_sf(data = seismic,
aes(fill = ValueRange),
linewidth = 0,
color = NA) +
scale_fill_brewer(palette = "Reds",
name = "% chance") +
coord_sf(crs = "EPSG:5070",
default_crs = "EPSG:4326") +
theme_void()
geom_sf
will add a layer to a ggplot2
map from an sf object, plotting points, lines, or polygons
(i.e. two-dimensional shapes that can be filled with color) depending on
the coordinates stored in the geometry variable.
In the code above, the seismic risk data is added inside the
geom_sf
function and then the color to fill the polygons
with is set to the values in ValueRange
using an
aes
function. linewidth = 0
sets the size of
boundary lines between the different seismic risk zones to zero, so they
do not appear; color = NA
should be redundant, but ensures
that any line has no color.
The scale_fill_brewer
function is then used to apply a
ColorBrewer palette to the
data.
coord_sf
applies a custom projection to the map.
crs = "EPSG:5070"
applies an Albers Equal Area Conic
projection, centered on the continental US;
default_crs = "EPSG:4326"
defines the pre-existing
projection of the data, so that the conversion is handled correctly.
theme_void
is a ggplot2 theme
that is good for maps because it omits grid lines, axes, and so on.
Now we will add data on earthquakes to the map. The code below loads GeoJSON returned from the US Geological Survey earthquakes search API for quakes larger than magnitude 5.5 since the beginning of 1960 within 3,000 kilometers of the geographic center of the continental US:
# load quakes data from USGS earthquakes API
quakes <- st_read("https://earthquake.usgs.gov/fdsnws/event/1/query?starttime=1960-01-01T00:00:00&minmagnitude=5.5&format=geojson&latitude=39.828175&longitude=-98.5795&maxradiuskm=3000&orderby=magnitude")
# examine the data
view(quakes)
Looking at the data, the geometry
column contains three
coordinates. The first two give latitude and longitude in degrees. The
third is depth of the quake in kilometers and may create a problem when
we try to map the points. This can be solved by removing that dimension
from the coordinates with the function st_zm
.
# remove the depth dimension from the coordinates
quakes <- quakes %>%
st_zm()
We should also check the projection:
# what is the projection of the quakes data?
st_crs(quakes)
Again, the projection is EPSG:4326
. Before making a map
in ggplot2 with multiple sf layers,
make sure they are all in the same projection so that they can all be
handled in the same way by the coord_sf
function. I would
recommend EPSG:4326
.
Now we can add the quakes to the map:
# add the historical quakes to the map
ggplot() +
geom_sf(data = seismic,
aes(fill = ValueRange),
linewidth = 0,
color = NA) +
geom_sf(data = quakes,
alpha = 0.1,
aes(size = 10^mag)) +
scale_fill_brewer(palette = "Reds",
name = "% chance" ) +
scale_size_area(max_size = 20, guide = "none") +
coord_sf(crs = "EPSG:5070",
default_crs = "EPSG:4326") +
theme_void()
The code above adds the quakes using another geom_sf
layer, making them 90% transparent by setting their opacity with
alpha = 0.1
. The points are sized using an aes
function according to the values of 10 raised to the power of
mag
, or the magnitude of each quake. Using this formula is
a quirk of working with earthquakes, where magnitude is measured on a
logarithmic scale, so that a difference of 1 corresponds to a 10-fold
difference in earth movement, as recorded on a seismograph. Raising 10
to the power of the earthquake magnitude correctly scales the circles
according to the amount of shaking they caused.
When scaling circles on a map for most forms of data you would simply
set size
to correspond to the values of a variable in the
data.
The maximum size for the resulting circles is then set using the
scale_size_area
function; guide = "none"
prevents a legend for size being drawn for the size of the circles.
Using scale_size_area
also ensures that the circles scale
correctly by area right down to a value of zero; without this there will
be a minimum size below which the points will not scale with the data
values.
The map as it currently stands has no wider context for the positions of nearby quakes that occurred outside of the boundaries of the continental US, and is now zoomed out to the 3,000km-radius circle containing all of the quakes we pulled from the USGS API. So now we will add data for the world’s nations and change the zoom of the map.
The data used is from Natural Earth, a widely used repository of geodata, converted to a local GeoJSON file. It can also be accessed from R directly using the rnaturalearth package.
# load a world map of country boundaries
countries <- st_read("countries.geojson")
Now we can add this layer to the map beneath the seismic risk and
quakes layers with a third geom_sf
function, setting the
fill to a light gray defined by the hex code #cccccc
and
the country borders to white. Altering size
for this
geom_sf
layer will change the thickness of the border
lines.
# add the countries layer and zoom into the continental US
ggplot() +
geom_sf(data = countries,
fill = "#cccccc",
color = "white",
linewidth = 0.3) +
geom_sf(data = seismic,
aes(fill = ValueRange),
linewidth = 0,
color = NA) +
geom_sf(data = quakes,
alpha = 0.1,
aes(size = 10^mag)) +
scale_fill_brewer(palette = "Reds",
name = "% chance") +
scale_size_area(max_size = 20, guide = "none") +
coord_sf(crs = "EPSG:5070",
default_crs = "EPSG:4326",
xlim = c(-135,-65),
ylim = c(20,55)) +
theme_void()
Adding limits to the longitude coordinates with xlim
and
latitude coordinates with ylim
to the coord_sf
function zoomed the map further in.
The sf package includes many functions for working with geodata that allow you to calculate distances and areas, or to perform geometric operations such finding the intersections between two geometries, drawing a bounding box around features, and so on. Here is a full reference guide. (If you are used to working with PostGIS, many of these functions will seem familiar.)
To provide a taste of the possibilities, here are a couple of examples working with the data we used above.
For working with distances, we will first convert to a projection
that uses meters as its units of distance, rather than the degrees of
latitude and longitude used by EPSG:4326
. (For functions
that calculate distance and area, sf will work with the
units of the projection; for EPSG:4326
it should fall back
to meters and square meters as a default.)
# change projection to Web Mercator
addresses <- addresses %>%
st_transform("EPSG:3857")
# check units for that projection
st_crs(addresses)
Now we can create a buffer around each address using the
sf function st_buffer
and see the results
by drawing a quick map with ggplot2. This operation
would be useful for stories such as investigating the impact of
exclusion zones for sex offenders around schools.
# create a buffer of half a mile (804.672 meters) around each address
addresses_buffer <- addresses %>%
st_buffer(dist = 804.672)
# look at data types
glimpse(addresses_buffer)
# draw a quick map
ggplot() +
geom_sf(data = addresses_buffer) +
geom_sf(data = addresses)
Notice that the geometry
column now contains the
coordinates for polygons rather than points.
You may wish to combine multiple polygons into one. This can be done
for sf objects with the function summarize
from the dplyr package.
addresses_buffer <- addresses_buffer %>%
summarize()
# look at data types
glimpse(addresses_buffer)
# draw a quick map
ggplot() +
geom_sf(data = addresses_buffer) +
geom_sf(data = addresses)
To do this, we will again first convert the projection of the
seismic
data to an Albers Equal Area projection, or
EPSG:5070
. Then we will combine the multiple polygons for
each risk category into one polygon per category. This can be done using
summarize
after first grouping by ValueRange
with the dplyr function group_by
.
# change projection to Albers Equal Area for continental US
seismic <- seismic %>%
st_transform("EPSG:5070")
# collapse the seismic data into one polygon per risk category
seismic <- seismic %>%
group_by(ValueRange) %>%
summarize()
Having prepared the data, we can now use the sf
function st_area
to calculate the area covered by each risk
category. That creates a vector of numbers with the units
m^2
. We can then make a data frame, or tibble, with columns
for ValueRange
from the seismic
data and the
areas converted to simple numbers using the function
as.numeric
, and finally calculate the percentages in a
dplyr mutate
function.
# calculate the area for each category in square meters
areas <- st_area(seismic)
view(areas)
# create a data frame with risk categories and and calculate % area in each category
pc_area_by_risk <- tibble(risk = seismic$ValueRange, area = as.numeric(areas)) %>%
mutate(percent = area/sum(area)*100)
view(pc_area_by_risk)
We have only scratched the surface of what you can do with sf. Read the full sf documentation, articles, and download the cheat sheet to learn more.
sf also works very well with the tidycensus package, which makes it easy to import data from key US Census Bureau APIs. See the vignette on Spatial data in tidycensus for more details. To use tidycensus, you will need to obtain a Census API key.