Interviewing data

Interviewing data: R, RStudio, and the tidyverse for data analysis

R and RStudio

Last quarter we worked with R and RStudio to process data and make static, interactive, and animated graphics. Here, we will explore how to “interview” data to find leads and context for stories.

The data we will use

Download the data for these classes from here, unzip the folder and place it on your desktop. It contains the following files:

  • ca_discipline.csv Disciplinary alerts and actions issued by the Medical Board of California from 2008 to 2017. Processed from downloads available here. Contains the following variables:
    • alert_date Date alert issued.
    • last_name Last name of doctor/health care provider.
    • first_name First name of doctor/health care provider.
    • middle_name Middle/other names.
    • name_suffix Name suffix (Jr., II etc)
    • city City of practive location. +state State of practice location.
    • license California medical license number.
    • action_type Type of action.
    • action_date Date of action.
  • ca_medicare_opioids.csv Data on prescriptions of opioid drugs under the Medicare Part D Prescription Drug Program by doctors in California, from 2013 to 2015. Filtered from the national data downloads available here. It contains many variables, including:
    • npi National Provider Identifier (NPI) for the doctor/organization making the claim. This is a unique code for each health care provider.
    • nppes_provider_last_org_name For individual doctors, their last name. For organizations, the organziation name.
    • nppes_provider_first_name First name for individual doctors, blank for organizations.
    • nppes_provider_city City where the provider is located.
    • nppes_provider_state State where the provider is located; “CA” for all of these records.
    • specialty_description Provider’s medical speciality, reported on their medicare claims. For providers that have more than one Medicare specialty code reported on their claims, the code associated with the largest number of services.
    • drug_name Includes both brand names (drugs that have a trademarked name) and generic names (drugs that do not have a trademarked name).
    • generic_name The chemical ingredient of a drug rather than the trademarked brand name under which the drug is sold.
    • bene_count Total number of unique Medicare Part D beneficiaries (i.e. patients) with at least one claim for the drug. Counts fewer than 11 are suppressed and are indicated by a blank.
    • total_claim_count Number of Medicare Part D claims; includes original prescriptions and refills. If less than 11, counts are not included in the data file.
    • total_drug_cost Total cost paid for all associated claims; includes ingredient cost, dispensing fee, sales tax, and any applicable fees.
    • year 2013, 2014, or 2015.
  • npi_license.csv Crosswalk file to join NPI identifiers to state license numbers, processed from the download available here to include license numbers potentially matching California doctors. This will provide one way of joining the precription data to the medical board disciplinary data. As we shall see, problems with the data mean that it is not infallible. Contains the following variables:
    • npi National Provider Identifier, as described above.
    • plicnum State license number, from the original file.
    • license Processed from pclicnum to conform to the format of California medical license numbers.

Some words of caution, before we start

The US is currently in the grip of an epidemic of opioid abuse and addiction. Although widespread medical prescription of opioids helped drive addiction, a majority of overdoses now occur through the consumption of drugs purchased illegally.

Opioids have important medical uses, and just because a doctor prescribes large amounts of the drugs doesn’t necessarily mean they are practising irresponsibly.

As ProPublica explained, in the methods for its stories based on Medicare Part D prescription data:

“The data could not tell us everything. We interviewed many high-volume prescribers to better understand their patients and their practices. Some told us their numbers were high because they were credited with prescriptions by others working in the same practice. In addition, providers who primarily work in long-term care facilities or busy clinics with many patients naturally may write more prescriptions.”

Reminder on R code basics

We reviewed these basics in the previous quarter:

  • <- is known as an “assignment operator.” It means: “Make the object named to the left equal to the output of the code to the right.”
  • & means AND, in Boolean logic.
  • | means OR, in Boolean logic.
  • ! means NOT, in Boolean logic.
  • When referring to values entered as text, or to dates, put them in quote marks, like this: "United States", or "2016-07-26". Numbers are not quoted.
  • When entering two or more values as a list, combine them using the function c, for combine, with the values separated by commas, for example: c("2017-07-26","2017-08-04")
  • As in a spreadsheet, you can specify a range of values with a colon, for example: c(1:10) creates a list of integers (whole numbers) from one to ten.
  • Some common operators:
  • + - add, subtract.
  • * / multiply, divide.
  • > < greater than, less than.
  • >= <= greater than or equal to, less than or equal to.
  • != not equal to.

  • Equals signs can be a little confusing, but see how they are used in the code we use today:

  • == test whether an object is equal to a value. This is often used when filtering data, as we will see.
  • = make an object equal to a value; similar to <-, but used within a function (see below).

  • Handling null values:
  • Nulls are designated as NA.
  • is.na(x) looks for nulls within variable x.
  • !is.na(x) looks for non-null values within variable x.

Here, is.na is a function. Functions are followed by parentheses, and act on the data/code in the parentheses.

Important: Object and variable names in R should not contain spaces.

Reminder on dplyr basics

We can use the dplyr package process and analyze data, using these basic operations:

  • Sort: Largest to smallest, oldest to newest, alphabetical etc.

  • Filter: Select a defined subset of the data.

  • Summarize/Aggregate: Deriving one value from a series of other values to produce a summary statistic. Examples include: count, sum, mean, median, maximum, minimum etc. Often you’ll group data into categories first, and then aggregate by group.

  • Join: Merging entries from two or more datasets based on common field(s), e.g. unique ID number, last name and first name.

Here, again, are some of the most useful functions in dplyr:

  • select Choose which columns to include.
  • filter Filter the data.
  • arrange Sort the data, by size for continuous variables, by date, or alphabetically.
  • group_by Group the data by a categorical variable.
  • summarize Summarize, or aggregate (for each group if following group_by). Often used in conjunction with functions including:
    • mean(x) Calculate the mean, or average, for variable x.
    • median(x) Calculate the median.
    • max(x) Find the maximum value.
    • min(x) Find the minimum value.
    • sum(x) Add all the values together.
    • n() Count the number of records. Here there isn’t a variable in the brackets of the function, because the number of records applies to all variables.
    • n_distinct(x) Count the number of unique values in variable x.
  • mutate Create new column(s) in the data, or change existing column(s).
  • rename Rename column(s).
  • bind_rows Merge two data frames into one, combining data from columns with the same name.

There are a number of join functions in dplyr to combine data from two data frames. Here are the most useful:

  • inner_join returns values from both tables only where there is a match.
  • left_join returns all the values from the first-mentioned table, plus those from the second table that match.
  • semi_join filters the first-mentioned table to include only values that have matches in the second table.
  • anti_join filters the first-mentioned table to include only values that have no matches in the second table.

Here is a useful reference for managing joins with dplyr.

Setting up

OpenRStudio, save a new script in the same folder as the downloaded data and set the working directory to this folder by selecting from the top menu Session>Set Working Directory>To Source File Location. (Doing so means we can load the files in this directory without having to refer to the full path for their location, and anything we save will be written to this folder.)

Also install the DT package, which turns R data frames into searchable web tables.

Now load the packages we’ll use today:

# load packages to read and write csv files, process data, work with dates, and make searchable web tables
library(readr)
library(dplyr)
library(lubridate)
library(DT)

At this point, and at regular intervals, save your script, by clicking the save/disk icon in the script panel, or using the ⌘-S/Ctrl-S keyboard shortcut.

Load and examine the disciplinary actions data

# load ca medical board disciplinary actions data
ca_discipline <- read_csv("ca_discipline.csv")

# view structure of data
glimpse(ca_discipline)
## Observations: 7,561
## Variables: 10
## $ alert_date  <date> 2008-04-18, 2008-04-21, 2008-04-23, 2008-04-28, 200…
## $ last_name   <chr> "Boyajian", "Cragen", "Chow", "Gravich", "Kabacy", "…
## $ first_name  <chr> "John", "Richard", "Hubert", "Anna", "George", "Kama…
## $ middle_name <chr> "Arthur", "Darin", "Wing", NA, "E.", "Fouad", "A.", …
## $ name_suffix <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ city        <chr> "Boise", "Temecula", "San Gabriel", "Los Angeles", "…
## $ state       <chr> "ID", "CA", "CA", "CA", "WA", "WA", "WV", "NM", "NV"…
## $ license     <chr> "A25855", "A54872", "G45435", "A40805", "G13766", "C…
## $ action_type <chr> "Surrendered", "Surrendered", "Superior Court Order/…
## $ action_date <date> 2008-04-18, 2008-04-21, 2008-03-10, 2008-04-25, 200…

Filter, sort, and clean data

To get used to working with dplyr, we will now start filtering and sorting the data according to the doctors’ locations, and the types of disciplinary actions they faced.

First, let’s look at the types of disciplinary actions in the data:

# look at types of disciplinary actions
types <- ca_discipline %>%
  select(action_type) %>%
  unique()

This code first copies the ca_disciplinary data frame into a new object called types. Then (%>%) it selects the action_type variable only. Finally, it uses a function called unique to display the unique values in that variable, with no duplicates. This should be the result:

Now make a searchable web table to display the data:

# make a searchable web table
datatable(types, extensions = "Responsive")

datatable is the function from the DT that creates a searchable web table from a data frame; extensions = "Responsive" means that the table will be condensed to fit the device/screen on which it is viewed – any columns that don’t fit will be hidden, but can be opened up and examined for any row by clicking on a green cross that will appear if the table has been condensed in this way.

Many of the values for action_types seem to be subtle variations of the same thing. If we were going to all analyze the types in detail, we may need to group some of these together, after speaking to an expert to understand what they all mean. If you search the table for “reprimand”“, however, it’s pretty obvious that the four entries containing that word all amount to the same thing. To get a sense of how we could write code to clean up these categories, let’s convert those to all say”Reprimand" :

# clean data for reprimands
ca_discipline <- ca_discipline %>%
  mutate(action_type = case_when(grepl("reprimand", ignore.case = TRUE, action_type) ~ "Reprimand",
                                 TRUE ~ action_type))

The case_when function allows you to set up conditional statements. Here we are changing the action_type variable so that if the text contains the word “reprimand,” matched using the grepl function, irrespective of the case of the letters, the entry is changed to “Reprimand,” otherwise (the TRUE condition) it is left with the original action_type value. You can string together multiple conditional statements in a case_when function to edit the data as you wish. If cleaning a dataset involves a few simple changes like this, I often do it in R rather than using Open Refine, like we did last quarter.

Some of the action types are clear and unambiguous: Revoked is the medical board’s most severe sanction, which cancels a doctor’s license to practise. So let’s first filter the data to look at those actions:

# filter for license revocations only
revoked <- ca_discipline %>%
  filter(action_type == "Revoked")
  
datatable(revoked, extensions = "Responsive")


Some of these doctors are not actually based in California. Doctors can be licensed to practise in more than one state, and the Medical Board of California typically issues its own sanction if a doctor is disciplined in their home state. So let’s now filter the data to look only at doctors with revoked licenses who were based in California, and sort them by city.

# filter for license revocations by doctors based in California, and sort by city
revoked_ca <- ca_discipline %>%
  filter(action_type == "Revoked"
         & state == "CA") %>%
  arrange(city)

datatable(revoked_ca, extensions = "Responsive")


Here, the filter combines two conditions with &. That means that both have to be met for the data to be included. Then arrange sorts the data, which for a text variable will be in alphabetical order. If you wanted to sort in reverse alphabetical order, the code would be: (arrange(desc(city)).

Group and summarize data

Next we will group and summarize data by counting disciplinary actions by year and by month. But before doing that, we need to use the year and month functions from the lubridate package to extract the year and month from action_date.

# extract year and month from action_date
ca_discipline <- ca_discipline %>%
  mutate(year = year(action_date),
         month = month(action_date))

Now we have two extra columns in the data, giving the year and the month as a number.

Previously we used the dplyr mutate function to modify an existing variable. Here we used it to create new variables. You can create or modify multiple variables in the same mutate function, separating each one by commas. Notice the use of = to make a variable equal to the output of code, within the mutate function.

Now we can calculate the number of license revocations for doctors based in California by year:

# license revokations for doctors based in California, by year
revoked_ca_year <- ca_discipline %>%
  filter(action_type == "Revoked" 
         & state == "CA") %>%
  group_by(year) %>%
  summarize(revocations = n())

The code first filters the data, as before, then groups by the new variable year using group_by, then summarizes by counting the number of records for each year using summarize. The last function creates a new variable called revocations from n(), which is a count of the rows in the data for each year.

This should be the result:

datatable(revoked_ca_year, extensions = "Responsive")

Looking at this result and the raw ca_discipline data, we only have partial data for 2008. So if we want to count the number of license revocations by month over all the years, we should first filter out the data for 2008, which will otherwise skew the result.

# license revokations for doctors based in California, by month
revoked_ca_month <- ca_discipline %>%
  filter(action_type == "Revoked" 
         & state == "CA"
         & year >= 2009) %>%
  group_by(month) %>%
  summarize(revocations = n())

datatable(revoked_ca_month)


Notice how we used >= to filter for data where the year was 2009 or greater. The following code will achieve the same result:

# license revocations for doctors based in California, by month
revoked_ca_month <- ca_discipline %>%
  filter(action_type == "Revoked" 
         & state == "CA"
         & year != 2008) %>%
  group_by(month) %>%
  summarize(revocations = n())

We can group and summarize by more than one variable at a time. The following code counts the number of actions of all types, by month and year. Again, we will first filter out the incomplete data for 2008.

# disciplinary actions for doctors in California by year and month, from 2009 to 2017
actions_year_month <- ca_discipline %>%
  filter(state == "CA"
         & year >= 2009) %>%
  group_by(year, month) %>%
  summarize(actions = n()) %>%
  arrange(year, month)

datatable(actions_year_month, extensions = "Responsive")


Notice that the group_by and arrange functions group and sort the data, respectively, by two variables, separated by commas.

Load and examine the Medicare opioid prescription data

Now we will work with the larger data on doctors in California prescribing opioids under Medicare Part D, learn how to join data in different data frames, and see how to make some simple charts to help make sense of your data.

Load and view the data:

# load opioid prescription data
ca_opioids <- read_csv("ca_medicare_opioids.csv")

glimpse(ca_opioids)
## Observations: 346,911
## Variables: 22
## $ npi                           <int> 1003001363, 1003001363, 1003001363…
## $ nppes_provider_last_org_name  <chr> "STEVENS", "STEVENS", "STEVENS", "…
## $ nppes_provider_first_name     <chr> "CHARLES", "CHARLES", "CHARLES", "…
## $ nppes_provider_city           <chr> "EL CENTRO", "EL CENTRO", "EL CENT…
## $ nppes_provider_state          <chr> "CA", "CA", "CA", "CA", "CA", "CA"…
## $ specialty_description         <chr> "Anesthesiology", "Anesthesiology"…
## $ description_flag              <chr> "S", "S", "S", "S", "S", "S", "S",…
## $ drug_name                     <chr> "ACETAMINOPHEN-CODEINE", "BUTRANS"…
## $ generic_name                  <chr> "ACETAMINOPHEN WITH CODEINE", "BUP…
## $ bene_count                    <int> 12, 16, 46, 84, 288, 32, 25, 33, 1…
## $ total_claim_count             <int> 25, 53, 90, 427, 981, 139, 142, 16…
## $ total_30_day_fill_count       <dbl> 25.0, 53.0, 92.0, 431.0, 989.9, 13…
## $ total_day_supply              <int> 750, 1496, 2700, 12809, 29170, 406…
## $ total_drug_cost               <dbl> 692.97, 14974.74, 6922.46, 81409.3…
## $ bene_count_ge65               <int> NA, NA, 23, 51, 189, 12, NA, NA, 3…
## $ bene_count_ge65_suppress_flag <chr> "#", "#", NA, NA, NA, NA, "*", "*"…
## $ total_claim_count_ge65        <int> NA, 37, 49, 272, 602, 52, 18, 34, …
## $ ge65_suppress_flag            <chr> "#", NA, NA, NA, NA, NA, NA, NA, N…
## $ total_30_day_fill_count_ge65  <dbl> NA, 37.0, 51.0, 273.0, 608.3, 52.0…
## $ total_day_supply_ge65         <int> NA, 1046, 1490, 8144, 17959, 1520,…
## $ total_drug_cost_ge65          <dbl> NA, 9706.65, 3889.49, 49675.50, 17…
## $ year                          <int> 2013, 2013, 2013, 2013, 2013, 2013…

Create a summary, showing the number of opioid prescriptions written by each doctor, the total cost of the opioids prescribed, and the cost per claim

# Create a summary, showing the number of opioid prescriptions written by each provider, the total cost of the opioids prescribed, and the cost per claim
provider_summary <- ca_opioids %>% 
  group_by(npi,
           nppes_provider_last_org_name,
           nppes_provider_first_name,
           nppes_provider_city,
           specialty_description) %>%
  summarize(prescriptions = sum(total_claim_count),
            cost = sum(total_drug_cost)) %>%
  mutate(cost_per_prescription = cost/prescriptions) %>%
  arrange(desc(prescriptions))

datatable(provider_summary, extensions = "Responsive")

Remember, when summarizing data, any NA values will cause the summary to be calculated as NA, unless you specify that nulls should be removed. If any of the cost or prescription values had been missing, for example, the sum functions would have to be written in this form: sum(total_claim_count, na.rm = TRUE).

Draw some charts from this summary data

While running data analyses, it’s often useful to create charts to look at summaries of the data. Particularly useful are histograms, to examine a distribution, and scatterplots, to see how one variable varies with another.

To make simple charts, we will use ggplot2, which we explored in more depth in the previous quarter.

ggplot2 and the grammar of graphics

As a reminder, the “gg” in ggplot2 stands for “grammar of graphics,” an approach to drawing charts devised by the statistician Leland Wilkinson. Rather than thinking in terms of finished charts like a scatter plot or a column chart, it starts by defining the coordinate system (usually the X and Y axes), maps data onto those coordinates, and then adds layers such as points, bars and so on. This is the logic behind ggplot2 code.

A reminder of ggplot2 basics:

  • ggplot This is the master function that creates a ggplot2 chart.
  • aes This function, named for “aesthetic mapping,” is used whenever data values are mapped onto a chart. So it is used when you define which variables are plotted onto the X and Y axes, and also if you want to change the size or color of parts of the chart according to values for a variable.
  • geom All of the functions that add layers to a chart start with geom, followed by an underscore, for example geom_point or geom_bar. The code in the parentheses for any geom layer styles the items in that layer, and can include aes mappings of values from data.
  • theme This function modifies the appearance of elements of a plot, used, for example, to set size and font face for text, the position of a legend, and so on.
  • scale Functions that begin with scale, followed by an underscore, are used to modify the way an aes mapping of data appears on a chart. They can change the axis range, for example, or specify a color palette to be used to encode values in the data.
  • + is used each time you add a layer, a scale, a theme, or elements like axis labels and a title. After a + you can continue on the same line of code or move the next line. I usually write a new line after each +, which makes the code easier to follow.

First load ggplot2 and the scales package, which allows numbers to be displayed as currency, with thousands separated by commas, and so on. Install scales if you don’t have it already. We’ll also load plotly, to allow us to explore the data on a chart using tooltips.

library(ggplot2)
library(scales)
library(plotly)

Make a histogram of the prescriptions data

It’s always a good idea to look at the distribution of any quantitative variable.

# histogram of the costs data
ggplot(provider_summary, aes(x = prescriptions)) +
  geom_histogram()

This code first stated that the doctor_summary data would be used for the chart in the ggplot function, then specificied that values for the number of prescriptions should be mapped to the X, or horizontal axis. We didn’t need to specify a variable for the Y axis for a histogram, because that is automatically a count of the number of records (here the number of doctors) in each bin.

The following code customizes the width of the bins, to 50 prescriptions, changes the X axis range between 0 and 3,000, and uses labels from the scales package to plot numbers with comma separators for thousands. As we learned last quarter, theme_minimal is one of the built-in themes in ggplot2, which gives a basic chart with grid lines on a white background.

ggplot(provider_summary, aes(x = prescriptions)) +
  geom_histogram(binwidth = 50) +
  theme_minimal() +
  scale_x_continuous(limits = c(0,3000),
                     labels = comma) +
  scale_y_continuous(labels = comma)

This is a very skewed distribution. Viewing this, we would conclude that the mean would not be a good summary statistic for this data; if we want to compare any doctor’s prescriptions of opioids to that for a “typical” doctor, it would be better to compare to the median number of prescriptions. So let’s calculate that number:

median(provider_summary$prescriptions)
## [1] 101

Make a scatterplot of prescriptions and costs data

Scatterplots are good for seeing how one variable relates to one another, and also for spotting outliers in data. Here we would expect the total cost of the drugs prescribed to increase with the total number of prescriptions. A scatterplot allows us to confirm whether that is the case, and allows us to spotinteresting outliers.

This is a good case for making a chart interactive, so we can hover over each point to find out who any provider is. (However, on this web page I have only included an image of the static chart.)

#### Make a scatterplot of prescriptions and costs data
scatterplot <- ggplot(provider_summary, aes(x = prescriptions, 
                                            y = cost,
                                            text = paste0("<b>Name: </b>",nppes_provider_first_name," ",nppes_provider_last_org_name,"<br>",
                                                          "<b>Location: </b>",nppes_provider_city,"<br>",
                                                          "<b>Specialty: </b>",specialty_description,"<br>",
                                                          "<b>NPI: </b>",npi))) +
  geom_point(alpha = 0.3) +
  theme_minimal() +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = dollar)

plot(scatterplot)

# make interactive version with tooltip
ggplotly(scatterplot, tooltip = "text") %>%
  config(displayModeBar = FALSE)

In this code, we mapped the number of prescriptions to the X axis and their total cost to the Y or vertical axis. We then used geom_point to add the points, and made them partially transparent using alpha = 0.3 (0 would be completely transparent; 1 would be completely opaque). labels = dollar formatted the Y axis so the numbers were shown in dollars.

Remember the words of caution: There may be some story leads here, but we would have to find out more about each doctor’s practise and patients to determine whether outliers in terms of prescriptions and costs are justified.

Join data to find doctors who prescribed opioids who have also been subject to disciplinary actions.

We will join the provider_summary data to the disciplinary actions data to see which of the doctors prescribing opioids have been in trouble with the Medical Board of California. The unique identifier in the prescriber data is their National Provider Identifier, whereas the unique identifier in the disciplinary data are the doctors’ California medical license numbers. So we also need to load the file npi_license.csv, which relates these two identifiers to one another.

# load data
npi_license <- read_csv("npi_license.csv")

First we will join these two files to one another:

# join those two data frames
ca_discipline_npi <- left_join(ca_discipline, npi_license)

By default, dplyr will join by any variables with matching names, but you can also specify the variables on which to join. So this will achieve the same result:

# join those two data frames
ca_discipline_npi <- left_join(ca_discipline, npi_license, by = "license")

datatable(ca_discipline_npi, extensions = "Responsive")


The data returned reveals some of the limitations of the npi_license data. Some of the doctors have not been joined to an npi code, while others appear to have more than one npi code (this is why there are 9,680 rows in the joined data, and only 7,651 in the ca_discipline data).

Having made this join, we can now join to the provider_summary data:

# join disciplinary action data to the opioid prescription data
provider_summary_actions <- inner_join(provider_summary, ca_discipline_npi, by = "npi") %>%
  arrange(desc(prescriptions))

datatable(provider_summary_actions,  extensions = "Responsive")


Individual doctors may appear multiple times in the data if they have more than one disciplinary action on their file, because a join will be made to each individual action.

We can search for the disciplinary records and look at public documents at this site.

The second doctor on our list, Naga Thota, surrendered his California medical license on March 22, 2017. This public document reveals that he was charged with prescribing excessive quantities of opioids to an addicted patient with whom he also engaged in a sexual relationship. In addition to disciplinary action from the Medical Board of California, he faced a criminal investigation by the federal Drug Enforcement Adminstration. In March 2017 sentenced to two years, he was sentenced to two years in a federal prison.

The prescriptions data reveal that, prior to his conviction and the loss of his medical license, Thota was also one of the largest prescribers of opioids in the state of California.

There are likely to be other story leads from this join, which you could follow up through public records and other reporting.

We’ve already seen that the npi_license data has some limitations. Another (less reliable) way to join the data would be by using the doctors’ first names, last names, and cities. First, however, we need to convert those variables in the ca_discipline_npi data to upper case, so they match the format of the ca_opioids data, using the function toupper.

# change case of variables to be used in the join
ca_discipline_npi <- ca_discipline_npi %>%
  mutate(last_name = toupper(last_name),
         first_name = toupper(first_name),
         city = toupper(city))

# join disciplinary action data to the opioid prescription data
provider_summary_actions_2 <- inner_join(provider_summary, ca_discipline_npi, by = c("nppes_provider_last_org_name" = "last_name", 
                                                                                 "nppes_provider_first_name" = "first_name",
                                                                                 "nppes_provider_city" = "city")) %>%
  arrange(desc(prescriptions))

Notice how the code is written to define the variables to be used for a custom join.

Now we can use an anti_join to see if this join picked up any doctors who were missing from the first.

# join disciplinary action data to the opioid prescription data
provider_summary_actions_extra <- anti_join(provider_summary_actions_2, provider_summary_actions)

datatable(provider_summary_actions_extra, extensions = "Responsive")


Whenever making an join in this way, it crucial to verify through further reporting that the individuals matched are the same person. For example, there are several doctors called Peter Lee in Los Angeles. The one with the disciplinary record is Peter Geon Lee, a plastic surgeon. This is not Peter Chong Lee, the physical medicine and rehabilitation specialist, with the NPI of 1063508786, listed in the ca_opioids data. There may well be other false matches in the data, joined in this way.

Some more practice

  • Using the ca_discipline data, count the number of revoked licenses in each city in 2017 only, and sort so that the cities with the most revoked licenses appear first.

  • Count the number of disciplinary actions of any type by city and by year, not including the incomplete data for 2008.

  • Find all of the unique drugs, by generic_name, in the ca_opioids data.

  • Make a summary by provider of total prescriptions, total costs pf those prescriptions, and cost per prescriptions, for fentanyl alone. (Look carefully at your list of generic drugs before doing this!)

Further reading

Introduction to dplyr

RStudio Data Wrangling Cheet Sheet Also introduces the tidyr package, which can manage wide-to-long transformations, and text-to-columns splits, among other data manipulations.

Stack Overflow For any work involving code, this question-and-answer site is a great resource for when you get stuck, to see how others have solved similar problems. Search the site, or browse R questions

R Graphics Cookbook by Winston Chang (Chang also has a helpful website with much of the same information, available for free.)

RStudio ggplot2 Cheat Sheet

ggplot2 documentation