The Tidy Approach

R
tutorial
Author

Josiah Parry

Published

April 14, 2018

Over the past several weeks I have been helping students, career professionals, and people of other backgrounds learn R. During this time one this has become apparent, people are teaching the old paradigm of R and avoiding the tidyverse all together.

I recently had a student reach out to me in need of help with the first programming assignment from the Coursera R-Programming course (part of the Johns Hopkins Data Science Specialization). This particular student was struggling with combining the her new knowledge of R data types, conditional statements, looping, control statements, scoping, and functions to solve the assignment problem set. I provided her with a walk through of each question in base R, the style of the course. I couldn’t help but empathize with her as I too learned the long way first. However I thought that she shouldn’t be learning the hard way first (see David Robinson’s blog post, “Don’t teach students the hard way first”), she should be learning the effective way.

In my written response to her, I gave her solutions to her problems in base R and using the tidyverse. Here, I will go over the problems and adress them from a tidy perspective. This will not serve as a full introduction to the tidyverse. For an introduction and a reason why the tidyverse is superior to base R, I leave you with Stat 545: Introduction to dplyr

The assignment utilizes a directory of data called `specdata` which can be downloaded here, and describes it:

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

• Date: the date of the observation in `YYYY-MM-DD` format (year-month-day)
• sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
• nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and create the directory ‘specdata’. Once you have unzipped the zip file, do not make any modifications to the files in the ‘specdata’ directory. In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

# Part I

Problem 1:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows

`````` pollutantmean <- function(directory, pollutant, id = 1:332){
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files

## 'pollutant' is a character vector of length 1 indicating
## the name of the pollutant for which we will calculate the
## mean; either "sulfate" or "nitrate"

## 'id' is an integer vector indicating the monitor ID numbers
## to be used

## Return the mean of the pollutant across all monitors list
## in the 'id' vector (ignoring NA values)
## NOTE: Do not round the result!
}``````

Before we tackle the function, I believe the best approach is to first solve the problem in a regular script. This problem has four clear steps:

1. Identify files in the directory
2. Subset files based on provided ID
4. Calculate and return the mean on the desired column

This problem gives us a directory of files from which we need to read in the data based on the provided IDs. For the sake of this walk through we will randomly sample 10 values within the range designated in the problem statement (332).

We will first generate random IDs, then identify all of the files within the specified directory and obtain their file paths using the `list.files()` function. After this we will subset our file list based on the IDs, then iterate over our file list and read in each file as a csv using `purrr:map_df()` combined with `readr::read_csv()`. Fortunately `map_df()` returns a nice and pretty data frame which lets us avoid having to explicitly bind each unique data frame.

#### Identify Files

Here we create 10 random IDs and store them in the `ids` variable. Next we use `list.files()` to look within the `specdata` folder that we downloaded above. Everyone’s path will most likely be different. Be sure to obtain the correct file path—help for Mac.

Next we identify the files we need based on the sampled `ids` and store the subset in the `files_filtered` variable. We use the values of the `ids` to locate the file paths positionally. For example, ID number 1 is the first file, number 10 is the tenth, etc.

``````# Load our handy dandy functions
library(tidyverse)

# 10 random IDs in ID range
ids <- sample(1:332, 10)

# Identify all files within the directory
files <- list.files("../../data/specdata", full.names = TRUE)

# Subset the data
files_filtered <- files[ids]

# View the files to verify
paste(ids, files_filtered)``````

Now that we have identified the files that we are going to read in, we can use `purrr:map_df()` to apply the `readr::read_csv()` function to each value of `files_filtered` and return a data frame (hence the `_df()` suffix). We supply additional arguments to `read_csv()` to ensure that every column is read in properly.

``````# Read in the subset of the data. Notice the brackets after files[]
col_types = list(
col_date(),
col_double(),
col_double(),
col_integer()
))

glimpse(specdata)``````

Next, we get to utilize some `dplyr` magic. Here we take the `specdata` object we created from reading in our files, deselct the `Date` column, then utilize `summarise_if()` to apply the `mean()` function to our data. `summarise_if()` requires that we provide a logical statement as the first argument. If (hence the `_if()` suffix) the logical statement evaluates to `TRUE` on a column then it will apply a list of functions to those columns where the statement evaluated to `TRUE`. We can also specify additional arguments to the functions. Here we specify `na.rm = TRUE` for handling missing values.

In this case, we are checking to see if our columns are of the data type `double` using the `is.double()` function. If you’re wondering why we didn’t use `is.numeric()`, it’s because the `ID` column is an integer which is considered numeric.

If we wanted to take the underlying vector of one of the columns, we can also, use `dplyr::pull(col_name)`. This will be helpful later when we want to obtain the mean of just one column.

``````# Obtain mean nitrate
specdata %>%
select(-Date) %>%
summarise_if(is.double, mean, na.rm = TRUE) %>%
# Pull just the sulfate column
pull(sulfate)

specdata %>%
select(-Date) %>%
summarise_if(is.double, mean, na.rm = TRUE) %>%
# Pull just the nitrate column
pull(nitrate)
``````

Now that we have all of the tools, we can put this together into a single function, which I will call `pollutant_mean()` to somewhat adhere—functions should take the name of verbs—to the tidyverse style guide.

Here we have three arguments:

• `directory`: Where to look for the files
• `pollutant`: Which pollutant (nitrate or sulfate) to evaluate
• This needs to be a character value unless we want to get into tidyeval, which frankly I will leave to the professionals. But I will provide an alternative solution at the end that doesn’t require quoted `pollutant` names.
• `id`: Which monitoring stations we should look at

Within the function we take everything we did in the above steps but generalize it to a function. We identify the files in the directory provided (`specdata`), subset the files positionally based on the provided `id` vector, and then iterate over the file names and read them in with `map_df()` and `read_csv()`.

Next we take our data and calculate the mean on both `sulfate` and `nitrate` columns. We then `pull()` the specified column from the `pollutant` argument and then `return()` that value.

``````
pollutant_mean <- function(directory, pollutant, id = 1:332) {
files <- list.files(directory, full.names = TRUE)
files_filtered <- files[id]
col_types = list(
col_date(),
col_double(),
col_double(),
col_integer()
))

specdata %>%
select(-Date) %>%
summarise_if(is.double, mean, na.rm = TRUE) %>%
pull(pollutant) %>%
return()

}``````

Here we can test out the function with both types of pollutants and different `id` values.

``````pollutant_mean(directory = "../../data/specdata", pollutant = "sulfate", id = sample(1:332, 20))
#pollutant_mean("../../data/specdata", "nitrate", 2)``````

# Part II:

Let us continue to the second problem in the problem set.

Problem 2:

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

The assignment provides an example function format, but I think it to be a bit misleading. So I will go about this in the way I think is best. We will work on creating a function called `complete_spec_cases()` which will take only two arguments, `directory`, and `id`. `directory` and `id` will be used in the the same way as the previous problem.

For this problem our goal is to identify how many complete cases there are by provided ID. This should be exceptionally simple. We will have to identify our files, subset them, and read them in the same way as before. Next we can identify complete cases by piping our `specdata` object to `na.omit()` which will remove any row with a missing value. Next, we have to group by the `ID` column and pipe our grouped data frame to `count()` which will count how many observations there are by group. We will then return this data frame to the user.

``````complete_spec_cases <- function(directory, id = 1:332) {

files <- list.files(directory, full.names = TRUE)

col_types = list(
col_date(),
col_double(),
col_double(),
col_integer()
))

complete_specdata <- specdata %>%
na.omit() %>%
group_by(ID) %>%
summarise(nobs = n())

return(complete_specdata)
}``````
``complete_spec_cases("../../data/specdata", id = sample(1:332, 20))``

# Part III:

This final problem is probably the most complicated, but with the method we just used above and with a bit more help from the `purrr` and `dplyr` packages, we can do this no problem.

Problem 3:

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows:

``````Correct <- function(directory, threshold = 0){
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files

## 'threshold' is a numeric vector of length 1 indicating the
## number of completely observed observations (on all
## variables) required to compute the correlation between
## nitrate and sulfate; the default is 0

## Return a numeric vector of correlations
## NOTE: Do not round the result!
}``````

Let keep this simple. The above statement essentially is asking that we find the correlation between `nitrate` and `sulfate` for each monitoring station (ID). But there is a catch! Each ID must meet a specified threshold of complete cases, and if none of the monitors meet the requirement the function must return a `numeric(0)`.

The way we will structure this function will be to first read in the data—as we have done twice now, except this time there will be no subsetting of IDs. Then we need to identify the number of complete cases by ID—as we did in problem 2—and identify the stations that meet the threshold requirement. At this point we will use an `if` statement to check if we have at least 1 monitoring station that meets our threshold, if we do not, we return the `numeric(0)`—there is most likely a more tidy way to do this, but I am not aware. If we have at least 1 monitoring station that meets the specified threshold we will use an `inner_join()` to make sure that `specdata` contains only those IDs that meet the requirement.

For the sake of this example, we will continue to use the `specdata` object we created in previous examples, and we will set our threshold to `100.` Once we identify the stations with the proper number of counts (`> 100`), we will store that data frame in an object called `id_counts`

``````
id_counts <- specdata %>%
na.omit() %>%
group_by(ID) %>%
count() %>%
filter(n > 100)

if (nrow(id_counts) < 1) {
return(numeric(0))
} else {
print("All is well.")
}

specdata <- id_counts %>%
inner_join(specdata, by = "ID") %>%
na.omit()

specdata``````

This is where it gets kind of funky. Once we have filtered down our data set, we need to calculate the correlations for each ID. The way that we do this is by nesting our data frame on the `ID` column. Calling `nest(-ID)` allows us to, for each value of ID, create a data frame for just those rows where the `ID` is the same. We will then have a new list type column where each value is actually a data frame. Let’s check out what this looks like before we hop into the function.

``````specdata %>%
nest(-ID)``````

Now that we know how to nest our data, we need to calculate the correlations for each row (ID value). We will do this by combining `mutate()` and `map()`. Here `.x` references the data that is within each nested tibble. To learn more about `purrr` I recommend the chapter on iteration from R For Data Science.

After we have done our calculations we undo our nesting using `unnest()` on the new column we created, and deselect the `data` column.

``````specdata %>%
na.omit() %>%
nest(-ID) %>%
mutate(correlation = map(data, ~cor(.x\$sulfate, .x\$nitrate))) %>%
unnest(correlation) %>%
select(-data)``````

We can now place these above examples within a new function called `pollutant_cor()`.

``````pollutant_cor <- function(directory, threshold = 0) {
files <- list.files(directory, full.names = TRUE)
col_types = list(
col_date(),
col_double(),
col_double(),
col_integer()
)) %>% na.omit()

id_counts <- specdata %>%
group_by(ID) %>%
count() %>%
filter(n > threshold)

if (nrow(id_counts) < 1) {
return(numeric(0))
}

correlations <- id_counts %>%
inner_join(specdata, by = "ID") %>%
nest(-ID) %>%
mutate(correlation = map(data, ~cor(.x\$sulfate, .x\$nitrate))) %>%
unnest(correlation)

return(correlations)

}``````

We can now test our function against two different thresholds to see how it reacts.

``pollutant_cor("../../data/specdata", 100)``

If we set the threshold to 100,000, we should expect a `numeric(0)`.

``pollutant_cor("../../data/specdata", 100000)``

It all works!