R Security Concerns and 2019 CRAN downloads

R
Author

Josiah Parry

Published

February 4, 2020

The number of downloads that a package has can tell us a number of things. For example, how much other packages rely on it. A great case of this is rlang. Not many of use rlang, know what it does, or even bother to dive into it. But in reality it is the backbone of so many packages that we know and love and use every day.

The other thing that we can infer from the download numbers is how trusted a package may be. Packages that have been downloaded time and time and again have likely been scoped out thoroughly for any bugs. This can be rather comforting for security conscious groups.

With a little rvest-web-scraping-magic we can obtain the name of every package published to CRAN.

library(rvest)
library(tidyverse)
url <- "https://cran.r-project.org/web/packages/available_packages_by_name.html"
cran_packages <- html_session(url) %>% 
    html_nodes("table a") %>% 
    html_text()

RStudio keeps daily logs of packages downloaded from their CRAN mirror. The package cranlogs makes the data available via and API and an R package. Below is a lightweight function which sums the total number of downloads for the entire year. furrr was used to speed up the computation a bit. This takes ~40 minutes to run, as such use the exported data in data/cran-2019-total-downloads.csv.

get_year_downloads <- function(pkg) {
  cranlogs::cran_downloads(pkg, 
                           from = "2019-01-01",
                           to = "2019-12-31") %>% 
    group_by(package) %>% 
    summarise(total_downloads = sum(count))
}
total_cran_downloads <- furrr::future_map_dfr(
  cran_packages, 
  .f = possibly(get_year_downloads, otherwise = tibble())
  )
total_cran_downloads <- read_csv("data/cran-2019-total-downloads.csv")
glimpse(total_cran_downloads)

While this is helpful, these packges also have dependencies, and those dependencies have dependencies. The R core team have built out the tools package which contains, yes, wonderful tools. The function tools::package_dependencies() provides us with the dependencies.

The below code identifies the top 500 downloaded packages and their dependencies.

top_500_cran <- top_n(total_cran_downloads, 500, total_downloads)
#pull the package dependencies
pkg_deps <- top_500_cran %>%
  pull(package) %>%
  tools::package_dependencies(recursive = TRUE)
head(pkg_deps)

Package Checks

Each package on CRAN goes through a rigorous checking process. The r cmd check is ran on each package for twelve different flavors from a combination of Linux and Windows. If you trust the checks that the R Core team do, I wouldn’t reinvent the wheel.

The data are not provided directly from CRAN though an individual has provided these data via an API. I recommend using the API as a check against packages on ingest. I’d also do this process for every time you’re syncing. Again, this doesn’t mean that there are no vulnerabilities. But if there are functions that will literally break the machine, then the checks in general shouldn’t work.

The biggest risk is really in the development and publication of applications. The greatest risk you are likely to face are going to be internal or accidental. For example using base R and Shiny, a developer can make an app unintentionally malicious—i.e. permitting system calls or creating a SQL injection. Though this would be rather difficult to build into an app, it is possible. The process here would be to institute a peer review process for the apps developed. Also, you’re going to want to sandbox the applications—which Connect does and will improve with launcher in the future.

Instituting Checks

We can use the cchecks packge to interact with R-Hub’s CRAN check API. They have done a wonderful job aggregating package check data. The data it returns, however, is in a rather deeply nested list. Below is a function defintion which can tidy up some of the important information produced from the API query.

#devtools::install_github("ropenscilabs/cchecks")
library(cchecks)
library(tidyverse)
tidy_checks <- function(checks) {
  
  check_res <- map(checks, pluck, "data", "checks")
  check_pkg <- map(checks, pluck, "data", "package")
  check_deets <- map(checks, pluck, "data", "check_details")
  check_summ <- map(checks, pluck, "data", "summary")
  
  tibble(pkg = unlist(check_pkg),
         check_results = check_res,
         check_details = check_deets,
         check_summary = check_summ) %>% 
    unnest_wider(check_summary)
  
}
# query the API for packages
checks  <- cch_pkgs(c("spotifyr", "genius"))
# tidy up the checks
clean_checks <- tidy_checks(checks)
# get check results
clean_checks %>% 
  unnest(check_results)
# get check details
clean_checks %>% 
  unnest_wider(check_details)

You can pull these checks for the top 500 packages and their dependencies in a rather straightforward manner now. You can iterate through these all. Note that this is an API and you may see some lag time. So go make some tea.

top_checks <- cch_pkgs(head(top_500_cran$package))
tidy_checks(top_checks)

Resources