Coursera R-Programming: Week 2 Problems

The Tidy Approach

Over the past several weeks I have been helping students, career professionals, and people of other backgrounds learn R. During this time one this has become apparent, people are teaching the old paradigm of R and avoiding the tidyverse all together.

I recently had a student reach out to me in need of help with the first programming assignment from the Coursera R-Programming course (part of the Johns Hopkins Data Science Specialization). This particular student was struggling with combining the her new knowledge of R data types, conditional statements, looping, control statements, scoping, and functions to solve the assignment problem set. I provided her with a walk through of each question in base R, the style of the course. I couldn’t help but empathize with her as I too learned the long way first. However I thought that she shouldn’t be learning the hard way first (see David Robinson’s blog post, “Don’t teach students the hard way first”), she should be learning the effective way.

In my written response to her, I gave her solutions to her problems in base R and using the tidyverse. Here, I will go over the problems and adress them from a tidy perspective. This will not serve as a full introduction to the tidyverse. For an introduction and a reason why the tidyverse is superior to base R, I leave you with Stat 545: Introduction to dplyr

The assignment utilizes a directory of data called specdata which can be downloaded here, and describes it:

The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file “200.csv”. Each file contains three variables:

  • Date: the date of the observation in YYYY-MM-DD format (year-month-day)
  • sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
  • nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)

For this programming assignment you will need to unzip this file and create the directory ‘specdata’. Once you have unzipped the zip file, do not make any modifications to the files in the ‘specdata’ directory. In each file you’ll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.


Part I

Problem 1:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA. A prototype of the function is as follows

 pollutantmean <- function(directory, pollutant, id = 1:332){
    ## 'directory' is a character vector of length 1 indicating
    ## the location of the CSV files
    
    ## 'pollutant' is a character vector of length 1 indicating
    ## the name of the pollutant for which we will calculate the 
    ## mean; either "sulfate" or "nitrate"

    ## 'id' is an integer vector indicating the monitor ID numbers
    ## to be used

    ## Return the mean of the pollutant across all monitors list
    ## in the 'id' vector (ignoring NA values)
    ## NOTE: Do not round the result!
}

Before we tackle the function, I believe the best approach is to first solve the problem in a regular script. This problem has four clear steps:

  1. Identify files in the directory
  2. Subset files based on provided ID
  3. Read the files
  4. Calculate and return the mean on the desired column

This problem gives us a directory of files from which we need to read in the data based on the provided IDs. For the sake of this walk through we will randomly sample 10 values within the range designated in the problem statement (332).

We will first generate random IDs, then identify all of the files within the specified directory and obtain their file paths using the list.files() function. After this we will subset our file list based on the IDs, then iterate over our file list and read in each file as a csv using purrr:map_df() combined with readr::read_csv(). Fortunately map_df() returns a nice and pretty data frame which lets us avoid having to explicitly bind each unique data frame.

Identify Files

Here we create 10 random IDs and store them in the ids variable. Next we use list.files() to look within the specdata folder that we downloaded above. Everyone’s path will most likely be different. Be sure to obtain the correct file path—help for Mac.

Next we identify the files we need based on the sampled ids and store the subset in the files_filtered variable. We use the values of the ids to locate the file paths positionally. For example, ID number 1 is the first file, number 10 is the tenth, etc.

# Load our handy dandy functions
library(tidyverse)

# 10 random IDs in ID range
ids <- sample(1:332, 10)

# Identify all files within the directory
files <- list.files("../../data/specdata", full.names = TRUE)

# Subset the data
files_filtered <- files[ids]

# View the files to verify
paste(ids, files_filtered)
##  [1] "302 ../../data/specdata/302.csv" "140 ../../data/specdata/140.csv"
##  [3] "326 ../../data/specdata/326.csv" "324 ../../data/specdata/324.csv"
##  [5] "59 ../../data/specdata/059.csv"  "104 ../../data/specdata/104.csv"
##  [7] "181 ../../data/specdata/181.csv" "147 ../../data/specdata/147.csv"
##  [9] "78 ../../data/specdata/078.csv"  "232 ../../data/specdata/232.csv"

Reading the Files

Now that we have identified the files that we are going to read in, we can use purrr:map_df() to apply the readr::read_csv() function to each value of files_filtered and return a data frame (hence the _df() suffix). We supply additional arguments to read_csv() to ensure that every column is read in properly.

# Read in the subset of the data. Notice the brackets after files[]
specdata <- map_df(files_filtered, read_csv, 
                   col_types = list(
                     col_date(),
                     col_double(),
                     col_double(),
                     col_integer()
                   ))

glimpse(specdata)
## Observations: 24,470
## Variables: 4
## $ Date    <date> 2001-01-01, 2001-01-02, 2001-01-03, 2001-01-04, 2001-...
## $ sulfate <dbl> NA, NA, NA, NA, NA, NA, 4.78, NA, NA, NA, NA, NA, NA, ...
## $ nitrate <dbl> NA, NA, NA, NA, NA, NA, 6.12, NA, NA, NA, NA, NA, NA, ...
## $ ID      <int> 302, 302, 302, 302, 302, 302, 302, 302, 302, 302, 302,...

Next, we get to utilize some dplyr magic. Here we take the specdata object we created from reading in our files, deselct the Date column, then utilize summarise_if() to apply the mean() function to our data. summarise_if() requires that we provide a logical statement as the first argument. If (hence the _if() suffix) the logical statement evaluates to TRUE on a column then it will apply a list of functions to those columns where the statement evaluated to TRUE. We can also specify additional arguments to the functions. Here we specify na.rm = TRUE for handling missing values.

In this case, we are checking to see if our columns are of the data type double using the is.double() function. If you’re wondering why we didn’t use is.numeric(), it’s because the ID column is an integer which is considered numeric.

If we wanted to take the underlying vector of one of the columns, we can also, use dplyr::pull(col_name). This will be helpful later when we want to obtain the mean of just one column.

# Obtain mean nitrate
specdata %>% 
  select(-Date) %>% 
  summarise_if(is.double, mean, na.rm = TRUE) %>% 
  # Pull just the sulfate column
  pull(sulfate)
## [1] 3.539072
specdata %>% 
  select(-Date) %>% 
  summarise_if(is.double, mean, na.rm = TRUE) %>% 
  # Pull just the nitrate column
  pull(nitrate)
## [1] 1.604677

Now that we have all of the tools, we can put this together into a single function, which I will call pollutant_mean() to somewhat adhere—functions should take the name of verbs—to the tidyverse style guide.

Here we have three arguments:

  • directory: Where to look for the files
  • pollutant: Which pollutant (nitrate or sulfate) to evaluate
    • This needs to be a character value unless we want to get into tidyeval, which frankly I will leave to the professionals. But I will provide an alternative solution at the end that doesn’t require quoted pollutant names.
  • id: Which monitoring stations we should look at

Within the function we take everything we did in the above steps but generalize it to a function. We identify the files in the directory provided (specdata), subset the files positionally based on the provided id vector, and then iterate over the file names and read them in with map_df() and read_csv().

Next we take our data and calculate the mean on both sulfate and nitrate columns. We then pull() the specified column from the pollutant argument and then return() that value.

pollutant_mean <- function(directory, pollutant, id = 1:332) {
  files <- list.files(directory, full.names = TRUE)
  files_filtered <- files[id]
  specdata <- map_df(files_filtered, read_csv, 
                     col_types = list(
                       col_date(),
                       col_double(),
                       col_double(),
                       col_integer()
                     ))
  
  specdata %>% 
    select(-Date) %>% 
    summarise_if(is.double, mean, na.rm = TRUE) %>% 
    pull(pollutant) %>% 
    return()

}

Here we can test out the function with both types of pollutants and different id values.

pollutant_mean(directory = "../../data/specdata", pollutant = "sulfate", id = sample(1:332, 20))
## [1] 3.300254
#pollutant_mean("../../data/specdata", "nitrate", 2)

Part II:

Let us continue to the second problem in the problem set.

Problem 2:

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

The assignment provides an example function format, but I think it to be a bit misleading. So I will go about this in the way I think is best. We will work on creating a function called complete_spec_cases() which will take only two arguments, directory, and id. directory and id will be used in the the same way as the previous problem.

For this problem our goal is to identify how many complete cases there are by provided ID. This should be exceptionally simple. We will have to identify our files, subset them, and read them in the same way as before. Next we can identify complete cases by piping our specdata object to na.omit() which will remove any row with a missing value. Next, we have to group by the ID column and pipe our grouped data frame to count() which will count how many observations there are by group. We will then return this data frame to the user.

complete_spec_cases <- function(directory, id = 1:332) {

  files <- list.files(directory, full.names = TRUE)
  
  specdata <- map_df(files[id], read_csv,
                     col_types = list(
                       col_date(),
                       col_double(),
                       col_double(),
                       col_integer()
                     ))
  
  complete_specdata <- specdata %>% 
    na.omit() %>% 
    group_by(ID) %>% 
    summarise(nobs = n())
  
  return(complete_specdata)
}
complete_spec_cases("../../data/specdata", id = sample(1:332, 20))
## # A tibble: 19 x 2
##       ID  nobs
##    <int> <int>
##  1    11   443
##  2    44   283
##  3    48    62
##  4    77   345
##  5   116   806
##  6   127   428
##  7   171   614
##  8   182   465
##  9   184   816
## 10   198   858
## 11   209   151
## 12   249   230
## 13   272   253
## 14   290    91
## 15   302   937
## 16   306   203
## 17   314   888
## 18   326   215
## 19   327   162

Part III:

This final problem is probably the most complicated, but with the method we just used above and with a bit more help from the purrr and dplyr packages, we can do this no problem.

Problem 3:

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows:

Correct <- function(directory, threshold = 0){
    ## 'directory' is a character vector of length 1 indicating
    ## the location of the CSV files

    ## 'threshold' is a numeric vector of length 1 indicating the
    ## number of completely observed observations (on all
    ## variables) required to compute the correlation between
    ## nitrate and sulfate; the default is 0

    ## Return a numeric vector of correlations
    ## NOTE: Do not round the result!
}

Let keep this simple. The above statement essentially is asking that we find the correlation between nitrate and sulfate for each monitoring station (ID). But there is a catch! Each ID must meet a specified threshold of complete cases, and if none of the monitors meet the requirement the function must return a numeric(0).

The way we will structure this function will be to first read in the data—as we have done twice now, except this time there will be no subsetting of IDs. Then we need to identify the number of complete cases by ID—as we did in problem 2—and identify the stations that meet the threshold requirement. At this point we will use an if statement to check if we have at least 1 monitoring station that meets our threshold, if we do not, we return the numeric(0)—there is most likely a more tidy way to do this, but I am not aware. If we have at least 1 monitoring station that meets the specified threshold we will use an inner_join() to make sure that specdata contains only those IDs that meet the requirement.

For the sake of this example, we will continue to use the specdata object we created in previous examples, and we will set our threshold to 100. Once we identify the stations with the proper number of counts (> 100), we will store that data frame in an object called id_counts

id_counts <- specdata %>% 
    na.omit() %>% 
    group_by(ID) %>% 
    count() %>% 
    filter(n > 100) 
  
  if (nrow(id_counts) < 1) {
    return(numeric(0))
  } else {
    print("All is well.")
  }
## [1] "All is well."
  specdata <- id_counts %>% 
    inner_join(specdata, by = "ID") %>%
    na.omit() 
  
  specdata
## # A tibble: 4,138 x 5
## # Groups:   ID [9]
##       ID     n Date       sulfate nitrate
##    <int> <int> <date>       <dbl>   <dbl>
##  1    59   445 2004-09-09    2.40   0.383
##  2    59   445 2004-09-12    1.45   0.383
##  3    59   445 2004-09-18    4.75   0.281
##  4    59   445 2004-09-24    9.47   0.623
##  5    59   445 2004-09-30    6.67   0.381
##  6    59   445 2004-10-03    2.90   0.326
##  7    59   445 2004-10-06    4.20   0.351
##  8    59   445 2004-10-09    2.41   0.539
##  9    59   445 2004-10-12    1.61   0.491
## 10    59   445 2004-10-15    1.95   0.306
## # ... with 4,128 more rows

This is where it gets kind of funky. Once we have filtered down our data set, we need to calculate the correlations for each ID. The way that we do this is by nesting our data frame on the ID column. Calling nest(-ID) allows us to, for each value of ID, create a data frame for just those rows where the ID is the same. We will then have a new list type column where each value is actually a data frame. Let’s check out what this looks like before we hop into the function.

specdata %>% 
  nest(-ID)
## # A tibble: 9 x 2
##      ID data              
##   <int> <list>            
## 1    59 <tibble [445 × 4]>
## 2    78 <tibble [275 × 4]>
## 3   104 <tibble [385 × 4]>
## 4   140 <tibble [407 × 4]>
## 5   147 <tibble [302 × 4]>
## 6   181 <tibble [286 × 4]>
## 7   232 <tibble [886 × 4]>
## 8   302 <tibble [937 × 4]>
## 9   326 <tibble [215 × 4]>

Now that we know how to nest our data, we need to calculate the correlations for each row (ID value). We will do this by combining mutate() and map(). Here .x references the data that is within each nested tibble. To learn more about purrr I recommend the chapter on iteration from R For Data Science.

After we have done our calculations we undo our nesting using unnest() on the new column we created, and deselect the data column.

specdata %>% 
  na.omit() %>% 
    nest(-ID) %>% 
    mutate(correlation = map(data, ~cor(.x$sulfate, .x$nitrate))) %>% 
  unnest(correlation) %>% 
  select(-data)
## # A tibble: 9 x 2
##      ID correlation
##   <int>       <dbl>
## 1    59     0.0911 
## 2    78     0.00564
## 3   104    -0.129  
## 4   140     0.248  
## 5   147    -0.00120
## 6   181     0.0872 
## 7   232    -0.0745 
## 8   302     0.192  
## 9   326     0.140

We can now place these above examples within a new function called pollutant_cor().

pollutant_cor <- function(directory, threshold = 0) {
  files <- list.files(directory, full.names = TRUE)
  specdata <- map_df(files, read_csv, 
                     col_types = list(
                       col_date(),
                       col_double(),
                       col_double(),
                       col_integer()
                     )) %>% na.omit()
  
  id_counts <- specdata %>% 
    group_by(ID) %>% 
    count() %>% 
    filter(n > threshold) 
  
  if (nrow(id_counts) < 1) {
    return(numeric(0))
  }
  
  correlations <- id_counts %>% 
    inner_join(specdata, by = "ID") %>% 
    nest(-ID) %>% 
    mutate(correlation = map(data, ~cor(.x$sulfate, .x$nitrate))) %>% 
    unnest(correlation)
  
  return(correlations)

  
}

We can now test our function against two different thresholds to see how it reacts.

pollutant_cor("../../data/specdata", 100)
## # A tibble: 254 x 3
##       ID data                 correlation
##    <int> <list>                     <dbl>
##  1     1 <tibble [117 × 4]>       -0.223 
##  2     2 <tibble [1,041 × 4]>     -0.0190
##  3     3 <tibble [243 × 4]>       -0.141 
##  4     4 <tibble [474 × 4]>       -0.0439
##  5     5 <tibble [402 × 4]>       -0.0682
##  6     6 <tibble [228 × 4]>       -0.124 
##  7     7 <tibble [442 × 4]>       -0.0759
##  8     8 <tibble [192 × 4]>       -0.160 
##  9     9 <tibble [275 × 4]>       -0.0868
## 10    10 <tibble [148 × 4]>        0.161 
## # ... with 244 more rows

If we set the threshold to 100,000, we should expect a numeric(0).

pollutant_cor("../../data/specdata", 100000)
## numeric(0)

It all works!

Avatar
Josiah Parry
Social Data Scientist

Related