Chunking your csv

Writing data subsets

R
tutorial
Author

Josiah Parry

Published

October 27, 2018

Sometimes due to limitations of software, file uploads often have a row limit. I recently encountered this while creating texting campaigns using Relay. Relay is a peer-to-peer texting platform. It has a limitation of 20k contacts per texting campaign. This is a limitation when running a massive Get Out the Vote (GOTV) texting initiative.

In order to solve this problem, a large csv must be split into multiple csv’s for upload. Though this could be solved with excel and Google Sheets, who wants to labor over that?

Here I will go through the methodology of writing a csv into multiple. I will use data from the Quantitative Social Science book by Kosuke Imai.

library(tidyverse)

social <- read_csv("https://raw.githubusercontent.com/kosukeimai/qss/master/CAUSALITY/social.csv")

dim(social)

This dataset has 305k observations and 6 columns. For this example let’s say we wanted to split this into files of 15,000 rows or fewer. We can use the following custom funciton:

write_csv_chunk <- function(filepath, n, output_name) {
  df <- read_csv(filepath) # 1. read original file
  
  n_files <- ceiling(nrow(df)/n) # 2. identify how many files to make
  
  chunk_starts <- seq(1, n*n_files, by = n) #  3. identify the rown number to start on
  
  for (i in 1:n_files) { # 4. iterate through the csv to write the files
    chunk_end <- n*i # 4a
    df_to_write <- slice(df, chunk_starts[i]:chunk_end) # 4b
    fpath <- paste0(output_name, "_", i, ".csv") # 4c
    write_csv(df_to_write,  fpath) # 4d
    message(paste0(fpath, " was written.")) # 4e
  }
}

The function has a few steps. Let’s walk through them. The step numbers are commented above.

  1. Read in the csv.
  2. Identify the number of files that will have to be created.
  1. Identify the row number to begin splitting the dataframe for each file.
  1. This is the fun part, writing our files. The number of iterations is the number of files.
soc_fpath <- "https://raw.githubusercontent.com/kosukeimai/qss/master/CAUSALITY/social.csv"
write_csv_chunk(filepath = soc_fpath, n = 25000, "../../static/data/chunk_data/social_chunked")

Now that we have these files split up, it will be good to know how to get them back into one piece! Check out my blog post on reading multiple csvs in as one data frame here.