Text mining in R with tidytext

Introducing tidytext

This class assumes you’re familiar with using R, RStudio and the tidyverse, a coordinated series of packages for data science. If you’d like a refresher on basic data analysis in tidyverse, try this class from the 2018 NICAR meeting.

tidytext is an R package that applies the principles of the tidyverse to analyzing text.

(We will also touch upon the quanteda package, which is good for quantitative tasks like counting the number of words and syllables in a body of text.)

If working on your own computer, you will need to install tidyverse, tidytext, and quanteda.

The data we will use

The data for this session should be loaded on the computers at NICAR. If you’re working through this class remotely or on your own computer at the meeting, download the data from here, unzip the folder and place it on your desktop. It contains the following files:

trump_tweets.json Tweets from Donald Trump’s personal Twitter account from his inauguration on Jan. 20, 2017 to the end of 2019; downloaded from the Trump Twitter Archive. Contains the following variables:

  • source Platform/device used to tweet.
  • text The text of the tweet.
  • created_at When the tweet was sent.
  • retweet_count Number of times the tweet was retweeted.
  • favorite_count Number of times the tweet was favorited.
  • is_retweet Was it a retweet? TRUE or FALSE
  • id_str unique tweet ID.

sou.csv Text of annual presidential State of the Union addresses from 1790 to 2020, from the American Presidency Project. Contains the following variables:

  • link URL for the address at the time the data was obtained.
  • president Name of President.
  • message Title of address.
  • date Date of delivery.
  • text Text of the address.
  • party President’s party.

Some of the addresses were written, some spoken. Where there was both a spoken address and a written message, the text is from the speech. In 1973, Richard Nixon sent an overview, plus multiple reports to Congress on various areas of policy; here the text is from his overview message.

Analyzing Trump’s tweets

To understand how tidytext works, we’ll look at tweets from Trump’s personal Twitter account. As we’ll see, most but not all of these tweets seem to be from Trump himself, and they’re rather different from those apparently tweeted by his aides. The analysis below is inspired by an earlier exploration of Trump’s tweets, run during the 2016 presidential campaign by David Robinson, one of the authors of tidytext.

Setting up

The code below loads the packages we’ll be using for this analysis; their role is explained in the code comments. It also loads the data, and includes a regular expression to help parse the tweets, which we’ll use to filter out URLs, RT, and other extraneous characters that we don’t want to include in the text analysis.

At the end of this chunk, I’ve also included some code to parse the created_at variable into a standard timestamp, to convert that from UTC to US Eastern Time, and to extract the year and month from US Eastern timestamp. While we won’t be analyzing tweets over time in this class, this should set you up to perform some month-by-month analyses of Trump’s most frequently used words or phrases and the sentiment of his tweets.

# load required packages
library(readr) # reading and writing delimited text files
library(dplyr) # SQL-style data processing
library(tidytext) # text analysis in R
library(stringr) # working with text strings
library(lubridate) # working with times and dates
library(jsonlite) # reading and writing JSON
library(tidyr) # data reshaping

# load data
tweets <- fromJSON("trump_tweets.json")

# regex for parsing tweets
replace_reg <- "https?://[^\\s]+|&amp;|&lt;|&gt;|\bRT\\b"

# create date elements
tweets <- tweets %>%
  mutate(created_at = paste(substr(created_at,27,30),
                      substr(created_at,5,7),
                      substr(created_at,9,10),
                      substr(created_at,12,20)),
         utc_timestamp = ymd_hms(created_at),
         est_timestamp = with_tz(utc_timestamp, tz = "US/Eastern"),
         year = year(est_timestamp),
         month = month(est_timestamp))

What devices/platforms has @realDonaldTrump tweeted on?

The source variable allows us to make a quick tally of the devices and platforms used to tweet from this account.

sources <- tweets %>%
  group_by(source) %>%
  count() %>%
  arrange(-n)



When David Robinson ran his analysis, he found that Trump’s characteristically angry tweets came from his Android phone. Since assuming the presidency, Trump has mostly used an iPhone. In the class, we’ll briefly examine the tweets from different devices and platforms. The ones that obviously stand out from the rest are those tweeted using Twitter’s Media Studio or Twitter Ads, which are professional tools used for Twitter campaigns. These, I assume, are tweets sent by Trump’s communications staff, so let’s create a new variable in the data making that distinction, calling it source2.

tweets <- tweets %>%
  mutate(source2 = case_when(grepl("Ads|Media",source) ~ "aides",
                            TRUE ~ "trump"))

Split the text into words and word pairs

The key to tidytext is the function unnest_tokens, which splits up text in a defined manner according to the token used, creating a new row in the data for each of the split sections. Run the code below and examine the result:

# split into words
words <- tweets %>%
  filter(is_retweet == FALSE & substring(text,1,2) != "RT" & substring(text,1,4) != "http") %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(word, text, token = "tweets")

This code first filters out all retweets, so that we are only looking at original text written by the @realDonaldTrump account, and tweets in which the account tweeted only a link. The mutate section then removes URLs and other unwanted characters using our regular expression, before unnest_tokens does its magic. It splits the text of each tweet into individual words. Usually, when splitting text into words, you would use token = "words". Here, we used token = "tweets", which is a variant that retains hashtag # and @ symbols. (By default, unnest_tokens also converts text to lower case.)

But notice that the words include common words like the and this. To analyze someone’s distinctive word use, you want to remove these words. That can be done with an anti_join to tidytext’s list of stop_words. (See the Twitter chapter from the Tidy Text Mining With R book, recommended below, for a more sophisticated way to filter out stop words that will also remove stop words preceded by a hashtag.)

To understand why this works, we’ll first view the stop_words object to see that it contains a variable called word, listing stop words from a number of different lexicons. So we can anti_join to our data frame on this column to filter out any rows that match these words.

View(stop_words)

# remove stop words
words <- words %>%
  anti_join(stop_words, by = "word")

I have found that analyzing word pairs, or bigrams, can often be more revealing than looking at individual words. So let’s also create a data frame of bigrams.

Here, unnest_tokens uses token = "ngrams", n = 2) to split the text into word pairs.

# split into word pairs
bigrams <- tweets %>% 
  filter(is_retweet == FALSE & substring(text,1,4) != "http" & substring(text,1,2) != "RT") %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

Removing word pairs that contain stop words is a little more involved in this case. First, we split each bigram into its individual components using the separate function from the tidyr package. Having done that, we need two anti_joins, specifying how each join should be made, to remove any bigrams that contain a stop word.

The last part of the code below filters the data to remove any bigrams containing “words” without any alphabetic characters.

# remove stop words
bigrams <- bigrams %>%
  separate(bigram, into = c("first","second"), sep = " ", remove = FALSE) %>%
  anti_join(stop_words, by = c("first" = "word")) %>%
  anti_join(stop_words, by = c("second" = "word")) %>%
  filter(str_detect(first, "[a-z]") &
         str_detect(second, "[a-z]"))

What words were used most often by Trump and by his aides?

Having created tidy data frames with one row for each mention of a word or bigram, it’s easy to analyze the frequency of word use with dplyr:

words_count <- words %>%
  group_by(source2, word) %>%
  count()

trump_words <- words_count %>%
  filter(source2 == "trump") %>%
  arrange(-n)