This class assumes you’re familiar with using R, RStudio and the tidyverse, a coordinated series of packages for data science. If you’d like a refresher on basic data analysis in tidyverse, try this class from the 2018 NICAR meeting.
tidytext is an R package that applies the principles of the tidyverse to analyzing text.
(We will also touch upon the quanteda package, which is good for quantitative tasks like counting the number of words and syllables in a body of text.)
If working on your own computer, you will need to install tidyverse, tidytext, and quanteda.
The data for this session should be loaded on the computers at NICAR. If you’re working through this class remotely or on your own computer at the meeting, download the data from here, unzip the folder and place it on your desktop. It contains the following files:
trump_tweets.json Tweets from Donald Trump’s personal Twitter account from his inauguration on Jan. 20, 2017 to the end of 2019; downloaded from the Trump Twitter Archive. Contains the following variables:
sourcePlatform/device used to tweet.
textThe text of the tweet.
created_atWhen the tweet was sent.
retweet_countNumber of times the tweet was retweeted.
favorite_countNumber of times the tweet was favorited.
is_retweetWas it a retweet?
id_strunique tweet ID.
sou.csv Text of annual presidential State of the Union addresses from 1790 to 2020, from the American Presidency Project. Contains the following variables:
linkURL for the address at the time the data was obtained.
presidentName of President.
messageTitle of address.
dateDate of delivery.
textText of the address.
Some of the addresses were written, some spoken. Where there was both a spoken address and a written message, the text is from the speech. In 1973, Richard Nixon sent an overview, plus multiple reports to Congress on various areas of policy; here the text is from his overview message.
To understand how tidytext works, we’ll look at tweets from Trump’s personal Twitter account. As we’ll see, most but not all of these tweets seem to be from Trump himself, and they’re rather different from those apparently tweeted by his aides. The analysis below is inspired by an earlier exploration of Trump’s tweets, run during the 2016 presidential campaign by David Robinson, one of the authors of tidytext.
The code below loads the packages we’ll be using for this analysis; their role is explained in the code comments. It also loads the data, and includes a regular expression to help parse the tweets, which we’ll use to filter out URLs,
RT, and other extraneous characters that we don’t want to include in the text analysis.
At the end of this chunk, I’ve also included some code to parse the
created_at variable into a standard timestamp, to convert that from UTC to US Eastern Time, and to extract the year and month from US Eastern timestamp. While we won’t be analyzing tweets over time in this class, this should set you up to perform some month-by-month analyses of Trump’s most frequently used words or phrases and the sentiment of his tweets.
# load required packages library(readr) # reading and writing delimited text files library(dplyr) # SQL-style data processing library(tidytext) # text analysis in R library(stringr) # working with text strings library(lubridate) # working with times and dates library(jsonlite) # reading and writing JSON library(tidyr) # data reshaping # load data tweets <- fromJSON("trump_tweets.json") # regex for parsing tweets replace_reg <- "https?://[^\\s]+|&|<|>|\bRT\\b" # create date elements tweets <- tweets %>% mutate(created_at = paste(substr(created_at,27,30), substr(created_at,5,7), substr(created_at,9,10), substr(created_at,12,20)), utc_timestamp = ymd_hms(created_at), est_timestamp = with_tz(utc_timestamp, tz = "US/Eastern"), year = year(est_timestamp), month = month(est_timestamp))
source variable allows us to make a quick tally of the devices and platforms used to tweet from this account.
sources <- tweets %>% group_by(source) %>% count() %>% arrange(-n)
When David Robinson ran his analysis, he found that Trump’s characteristically angry tweets came from his Android phone. Since assuming the presidency, Trump has mostly used an iPhone. In the class, we’ll briefly examine the tweets from different devices and platforms. The ones that obviously stand out from the rest are those tweeted using Twitter’s Media Studio or Twitter Ads, which are professional tools used for Twitter campaigns. These, I assume, are tweets sent by Trump’s communications staff, so let’s create a new variable in the data making that distinction, calling it
tweets <- tweets %>% mutate(source2 = case_when(grepl("Ads|Media",source) ~ "aides", TRUE ~ "trump"))
The key to tidytext is the function
unnest_tokens, which splits up text in a defined manner according to the
token used, creating a new row in the data for each of the split sections. Run the code below and examine the result:
# split into words words <- tweets %>% filter(is_retweet == FALSE & substring(text,1,2) != "RT" & substring(text,1,4) != "http") %>% mutate(text = str_replace_all(text, replace_reg, "")) %>% unnest_tokens(word, text, token = "tweets")
This code first filters out all retweets, so that we are only looking at original text written by the @realDonaldTrump account, and tweets in which the account tweeted only a link. The
mutate section then removes URLs and other unwanted characters using our regular expression, before
unnest_tokens does its magic. It splits the
text of each tweet into individual words. Usually, when splitting text into words, you would use
token = "words". Here, we used
token = "tweets", which is a variant that retains hashtag
@ symbols. (By default,
unnest_tokens also converts text to lower case.)
But notice that the words include common words like
this. To analyze someone’s distinctive word use, you want to remove these words. That can be done with an
anti_join to tidytext’s list of
stop_words. (See the Twitter chapter from the Tidy Text Mining With R book, recommended below, for a more sophisticated way to filter out stop words that will also remove stop words preceded by a hashtag.)
To understand why this works, we’ll first view the
stop_words object to see that it contains a variable called
word, listing stop words from a number of different lexicons. So we can
anti_join to our data frame on this column to filter out any rows that match these words.
View(stop_words) # remove stop words words <- words %>% anti_join(stop_words, by = "word")
I have found that analyzing word pairs, or bigrams, can often be more revealing than looking at individual words. So let’s also create a data frame of bigrams.
token = "ngrams", n = 2) to split the text into word pairs.
# split into word pairs bigrams <- tweets %>% filter(is_retweet == FALSE & substring(text,1,4) != "http" & substring(text,1,2) != "RT") %>% mutate(text = str_replace_all(text, replace_reg, "")) %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
Removing word pairs that contain stop words is a little more involved in this case. First, we split each bigram into its individual components using the
separate function from the tidyr package. Having done that, we need two
anti_joins, specifying how each join should be made, to remove any bigrams that contain a stop word.
The last part of the code below filters the data to remove any bigrams containing “words” without any alphabetic characters.
# remove stop words bigrams <- bigrams %>% separate(bigram, into = c("first","second"), sep = " ", remove = FALSE) %>% anti_join(stop_words, by = c("first" = "word")) %>% anti_join(stop_words, by = c("second" = "word")) %>% filter(str_detect(first, "[a-z]") & str_detect(second, "[a-z]"))
Having created tidy data frames with one row for each mention of a word or bigram, it’s easy to analyze the frequency of word use with dplyr:
words_count <- words %>% group_by(source2, word) %>% count() trump_words <- words_count %>% filter(source2 == "trump") %>% arrange(-n)