Excel in prod

I recently had a conversation that touched on using plumber to automate the parsing of Excel documents for administering data science assets. This brings up some very interesting points:

  1. Excel is sometimes unavoidable and we need to be okay with that.
  2. How can we incorporate Excel into production?

Note that this is no time to 💩 on Excel. It serves very real business purposes and unfortunately not everyone can learn to program 😕. Here’s a fun one for the h8rs: almost every presidential election campaign’s data program is based on the back of Google Sheets.

In this post I set out to explore if and how one can incorporate Excel into productionized code. Please see the GitHub repository for the code used here.

What does it mean to productionize code—aka put it in prod1? There is no one definition of what this mean and each and every organization will operationalize it differently.

An operationalized definition, at least in the social science perspective, is how a thing is defined so as to have a shared understanding of said thing.

Greg Wilson has defined it as “code that isn’t going to make the operations team cry” 2 (emphasis his). This is my favorite definition because it is whimsical, sardonic, honest, and acknowledges that the code will have to leave the data science inner circle.

As I understand it, the current data science discourse emphasizes the use of RESTful APIs as the best, or at least the dominant, way of productionizing code.

An API is an application programming interface. When I was first learning what APIs were my uncle told me to think of them as “machines talking to other machines.” That’s not so far off!

A RESTful API is a special kind of API that does representational state transfer. Frankly, I do not know what that really means. As I understand it, REST is actually an opinionated way of architecting APIs. RESTful APIs use HTTP requests which makes them very easy to access. RESTful APIs are key in developing micro-services and micro-services are at the core of putting code in prod3. Within the python ecosystem Flask is one of the leading libraries for making micro-services. Within the R space a package called plumber is taking on that role.

We can envisage a hypothetical scenario in which we receive Excel files via some data collection process. Once that Excel file is received it is used to kick off some other process—e.g. report writing or data processing. Often people may create Shiny applications to provide a UI for uploading and data input. This is really great when we want to develop a user-centric interface. But what about when we want to automate the process or at least make the processing available to other tools? This is when we can turn to plumber as a way to create a micro-service to handle this.

The above graphic (made with nomnoml) illustrates two different ways we can approach this. First, we will receive the Excel file. From there we may want to upload the file into a shared drive, a database, or both. Alternatively, we may not want to store the data, but rather use it immediately.

From an API development perspective, we can imagine each process as an API endpoint. An endpoint is essentially a url which says where each application interaction will happen. In this small example, we will create two endpoints: /read_excel and /upload. The first will, you guessed it, read an Excel file that is sent with a request. The second will upload said file.

Before we can approach creating the API, we need to first figure out how we can even send a file through an API. And before we can figure that out, we need to know what type of requests we can make to an API. Since the REST API will be an HTTP API, its imperative we know what type of requests we can make with an HTTP protocol. There are 7 HTTP request types.

Frankly, I do not remember what HEAD, PATCH, and OPTIONS do—if you don’t use it you lose it, right? For super simple APIs all we need to know are GET and POST requests—catch me never DELETEing anything, I can’t be trusted.

GET is used to get data from a resource. You can pass key value pairs as parameters into the GET request. “GET requests are only used to request data (not modify).”4 You should never, ever, ever, ever, ever, ever send sensitive information through a GET request.

That brings us to the POST method. POST methods are used to send data to a server for the purpose of creating, modifying, or updating resources5. When you have a lot of parameters to send, or if they’re sensitive, or if you need to send a file via API use POST.

In approaching this API design I had three questions.

  1. How do you even send a file via HTTP request?
  2. Once we send it, how do we access the file and where does it go?
  3. How do we get the data from the API to R?

I don’t speak Linux so bless httr for making this easy(ish). httr contains two functions that will be central to POSTing an excel file. There are POST() for making the POST request and upload_file() which will be used for uploading the file in the post request.

Can we just take a moment to appreciate how perfectly named functions can be sometimes? The more self-explanatory the better.

If you don’t have much experience crafting requests with httr, I recommend you start with the quickstart vignette.

The structure of our POST request will look like

POST(end_point, 
     body = list(param = upload_file("file-path.ext"))
     )

Building the first endpoint

Now we get how the file will be sent. But the tough part is actually building the plumber API endpoint which will receive it. There is a fair amount of discourse on how files should be uploaded via plumber6. Fortunately, @krlmlr pointed out mime::parse_multipart() which can be used for handling files sent in requests7.

Note: MIME is a standard way of sending files through the internet. I know nothing about MIME types and greatly appreciate the work that Yihui Xie, Jeffrey Horner, and Bian Beilei have done with the {mime} package for abstracting all of this away.

parse_multipart() will take the incoming request and return a named list. Most importantly for us the returned object contains our POSTed file to a temporary location. Within the resultant list is a path to to the temporary file. In our plumber function definition, we parse the request, and pull out the datapath. That saved path is then fed to readxl::read_excel() which returns a tibble!

#* Read excel file 
#* @post /read_excel

function(req) {
  
  multipart <- mime::parse_multipart(req)
  
  fp <- purrr::pluck(multipart, 1, "datapath", 1)
  
  readxl::read_excel(fp)
  
}

A note on developing plumber APIs

Unfortunately it’s not easy to illustrate the development of a plumber API in blog format. My secret for developing plumber APIs is a combination of the RStudio background jobs launcher and the rstudioapi package. To figure out the structure of the named list returned from parse_multipart() I returned the multipart object from the API.

In the session I was using to develop the API I had 3 scripts. The first, plumber.R contains the plumber endpoint definitions. The second, activate.R contains the following two lines of code:

pr <- plumber::plumb("plumber.R")
pr$run(host = "127.0.0.1", port = 5846)

These two lines start the API defined in plumber.R. In my third script, where I was developing from, I had the following function call:

# start a background job using the RStudio job launcher
rstudioapi::jobRunScript(path = file.path(here::here(), "activate.R"),
                         workingDir = here::here())

This function call sources the activate.R in a background session. Having the API running in another session frees up the current session I work from to develop sample POST requests.This provides a rather fast paced iterative way of testing endpoint function definitions.

I hope you’re able to understand all that use of the word session

For example, to figure out where datapath was, my API definition was strictly

#* Read excel file 
#* @post /read_excel

function(req) {
  mime::parse_multipart(req)
}

allowing me to work with the resultant object from an active R session.

Uploading files

Disclaimer: Uploading files is inherently risky business. Every time you put a new file into your system you are creating an opportunity for vulnerabilities. I am not a security expert nor an API expert so take me with a grain of salt.

Defining an upload process is rather straight forward now that we are able to access the temporary file. We will use fs::file_copy() to copy the temporary file to a permanent location. To do this, we need to determine where it will be uploaded. For the sake of example I am hard coding the upload path to be at ./data. You could feasibly create another parameter which determines where the file will be copied to, but I didn’t 🤷🏻‍♂️.

#* Upload excel file
#* @post /upload
function(req) {
  
  multipart <- mime::parse_multipart(req)
  
  fp <- purrr::pluck(multipart, 1, "datapath", 1)
  
  f_name <- purrr::pluck(multipart, 1, "name")
  
  u_path <- file.path("data", f_name)
  
  fs::file_copy(fp, u_path)
  
}

Creating API wrappers

Creating API wrappers is one of my favorite activities because it’s rather simple and feels super empowering 💪🏼. As mentioned earlier, all we will need to do to create the POST request is to specify where to make the request (the endpoint), and provide some parameters to it.

Spin up the API in the background.

# start a background job using the RStudio job launcher
rstudioapi::jobRunScript(path = file.path(here::here(), "activate.R"),
                         workingDir = here::here())

We first define an object called b_url (base url) with our endpoint. Next we specify the path of the file we want to upload within the upload_file() command. In the repository I’ve included test.xls which contains information about top coded variables in the American Community Survey (social science, amirite?). Note that uploaded file is part of a named list within the body argument. Any parameters that need to be passed to your API need to be defined in the list provided to the body. I think the name of the uploaded file needs to match that of what is defined in the plumber API (req). I may be wrong, but for safe measures!

library(httr)
library(tidyverse)

# define the url
b_url <- "http://127.0.0.1:5846/read_excel"

# make the request!
req <- POST(b_url, body = list(req = upload_file("data/test.xls")))

We have now uploaded the file and made our request! Though the request is useless to us if we can’t access the data 😮. We can get the content of the request using httr::content(). I set type = "text/json" because I find it easier to make json into a tibble than a named list.

# get json
res_json <- content(req, type = "text/json")

# show the first 50 characters of resultant json
strtrim(res_json, 50)
## [1] "[{\"statefip\":1,\"costelec\":7440,\"costgas\":5760,\"cos"

To get this json into a tibble will use jsonlite::fromJSON() and tibble::as_tibble().

res <- content(req, type = "text/json") %>% 
  jsonlite::fromJSON() %>% 
  as_tibble() 

glimpse(res)
## Observations: 52
## Variables: 27
## $ statefip     <int> 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18…
## $ costelec     <int> 7440, 7080, 7320, 6600, 7800, 6240, 7800, 6720, 648…
## $ costgas      <int> 5760, 6240, 4800, 5880, 5400, 5520, 6840, 6360, 528…
## $ costwatr     <int> 3100, 3400, 3800, 3100, 4100, 3300, 3100, 2800, 360…
## $ costfuel     <int> 3900, 7100, 2800, 3300, 2900, 2400, 5700, 3500, 430…
## $ condofee     <int> 1100, 690, 1000, 870, 1200, 920, 940, 950, 1400, 14…
## $ rent         <int> 2400, 2800, 2900, 2000, 3900, 3200, 3300, 2800, 390…
## $ proptx99     <int> 66, 69, 69, 67, 69, 69, 69, 69, 69, 69, 69, 69, 69,…
## $ propinsr     <int> 5400, 7500, 5000, 5200, 8100, 7200, 9800, 6000, 770…
## $ mortamt1     <int> 3400, 4000, 5000, 3600, 7300, 5400, 7400, 4000, 700…
## $ mortamt2     <int> 1600, 2900, 2400, 2200, 3900, 3600, 2900, 2700, 540…
## $ moblhome     <int> 4300, 7900, 10300, 4800, 13900, 12000, 8000, 10100,…
## $ rooms        <int> 16, 16, 15, 15, 14, 17, 17, 16, 16, 14, 18, 14, 17,…
## $ bedrooms     <int> 8, 8, 6, 6, 7, 7, 7, 7, 7, 6, 7, 7, 7, 7, 8, 8, 7, …
## $ valueh       <int> 1551000, 1945000, 2387000, 1901000, 6288000, 348100…
## $ incwage      <int> 391000, 439000, 412000, 422000, 565000, 498000, 718…
## $ incbus00     <int> 335000, 256000, 280000, 295000, 372000, 320000, 448…
## $ incinvst     <int> 299000, 96000, 306000, 263000, 353000, 290000, 3230…
## $ incretir     <int> 109000, 148000, 138000, 113000, 169000, 129000, 158…
## $ incother     <int> 64000, 33200, 73000, 67000, 78000, 76000, 79000, 72…
## $ incwelfr     <int> 10200, 11800, 12700, 11300, 18900, 14500, 19000, 75…
## $ incsupp      <int> 22700, 24000, 25800, 22700, 25400, 26200, 26600, 24…
## $ incss        <int> 34500, 32400, 34400, 35100, 35300, 35100, 37700, 36…
## $ age          <int> 93, 91, 93, 94, 94, 93, 96, 92, 94, 95, 92, 96, 94,…
## $ trantime     <int> 152, 162, 145, 129, 141, 157, 142, 157, 111, 150, 1…
## $ incbus00_min <int> -6900, -6500, -5000, -7800, -4800, -5800, -4800, -6…
## $ incincst_min <int> -1500, -800, -1600, -920, -2300, -1800, -1800, -140…

Boom!!! It worked. Now time to make it a function. To make this generalizable we need to make it so that users can specify file paths for upload_file().

# define `post_excel()`
post_excel <- function(file) {

  b_url <- "http://127.0.0.1:5846/read_excel"
  
  req <- POST(b_url, body = list(req = upload_file(file)))
  
  res <- content(req, type = "text/json") %>% 
    jsonlite::fromJSON() %>% 
    tibble::as_tibble()
  
  res
  
}

You’ve created a wrapper to your API! Now you have a micro-service running and accessible via an R wrapper.

post_excel("data/test.xls")
## # A tibble: 52 x 27
##    statefip costelec costgas costwatr costfuel condofee  rent proptx99
##       <int>    <int>   <int>    <int>    <int>    <int> <int>    <int>
##  1        1     7440    5760     3100     3900     1100  2400       66
##  2        2     7080    6240     3400     7100      690  2800       69
##  3        4     7320    4800     3800     2800     1000  2900       69
##  4        5     6600    5880     3100     3300      870  2000       67
##  5        6     7800    5400     4100     2900     1200  3900       69
##  6        8     6240    5520     3300     2400      920  3200       69
##  7        9     7800    6840     3100     5700      940  3300       69
##  8       10     6720    6360     2800     3500      950  2800       69
##  9       11     6480    5280     3600     4300     1400  3900       69
## 10       12     6720    3480     3400     3600     1400  3300       69
## # … with 42 more rows, and 19 more variables: propinsr <int>,
## #   mortamt1 <int>, mortamt2 <int>, moblhome <int>, rooms <int>,
## #   bedrooms <int>, valueh <int>, incwage <int>, incbus00 <int>,
## #   incinvst <int>, incretir <int>, incother <int>, incwelfr <int>,
## #   incsupp <int>, incss <int>, age <int>, trantime <int>,
## #   incbus00_min <int>, incincst_min <int>

We can create a similar function for the /upload endpoint.

upload <- function(file) {
  b_url <- "http://127.0.0.1:5846/upload"
  
  usethis::ui_info(glue::glue("Copying {file} to /data/{file}"))
  
  req <- POST(b_url, body = list(req = upload_file(file)))
  
  invisible(req)
}
upload("data/test.xls")
## ℹ Copying data/test.xls to /data/data/test.xls

Note: I recommend using the ui_*() functions from {usethis} to provide informative messages to the user.
Second note: If you intend on only allowing a file to be uploaded once, as this function does, you should probably actually be using a PUT request.

Badah-bing badah-boom. You now have the ability to create a micro-service with plumber that is able to handle Microsoft Excel files. That is no small feat! What’s next? You should create a nice little python wrapper for your newly created API. The python wrapper will be a great asset to your team and now your R based tools are accessible to anyone or anything that can make HTTP requests!!!