Chunking your csv

Writing data subsets

Sometimes due to limitations of software, file uploads often have a row limit. I recently encountered this while creating texting campaigns using Relay. Relay is a peer-to-peer texting platform. It has a limitation of 20k contacts per texting campaign. This is a limitation when running a massive Get Out the Vote (GOTV) texting initiative.

In order to solve this problem, a large csv must be split into multiple csv’s for upload. Though this could be solved with excel and Google Sheets, who wants to labor over that?

Here I will go through the methodology of writing a csv into multiple. I will use data from the Quantitative Social Science book by Kosuke Imai.

library(tidyverse)

social <- read_csv("https://raw.githubusercontent.com/kosukeimai/qss/master/CAUSALITY/social.csv")

dim(social)
## [1] 305866      6

This dataset has 305k observations and 6 columns. For this example let’s say we wanted to split this into files of 15,000 rows or fewer. We can use the following custom funciton:

write_csv_chunk <- function(filepath, n, output_name) {
  df <- read_csv(filepath) # 1. read original file
  
  n_files <- ceiling(nrow(df)/n) # 2. identify how many files to make
  
  chunk_starts <- seq(1, n*n_files, by = n) #  3. identify the rown number to start on
  
  for (i in 1:n_files) { # 4. iterate through the csv to write the files
    chunk_end <- n*i # 4a
    df_to_write <- slice(df, chunk_starts[i]:chunk_end) # 4b
    fpath <- paste0(output_name, "_", i, ".csv") # 4c
    write_csv(df_to_write,  fpath) # 4d
    message(paste0(fpath, " was written.")) # 4e
  }
}

The function has a few steps. Let’s walk through them. The step numbers are commented above.

  1. Read in the csv.
  2. Identify the number of files that will have to be created.
  • This will be the number of rows of the data frame divided by the number of rows we want each file to have. This number will be rounded up to handle a remainder.
  • In this case ceiling(nrow(social) / 15000) is equal to 21.
  1. Identify the row number to begin splitting the dataframe for each file.
  • This will be a factor of our n plus 1, but will never exceed the nrow(df),
  1. This is the fun part, writing our files. The number of iterations is the number of files.
  • 4a: The ending row number is the iteration number multiplied by the number of rows.
  • 4b: use slice() to cute the original data frame into the chunk beginning and chunk end
  • 4c: Creating the file paththat will be written.
  • 4d: Write the csv!
  • 4e: Print a message about the file being printed.
soc_fpath <- "https://raw.githubusercontent.com/kosukeimai/qss/master/CAUSALITY/social.csv"
write_csv_chunk(filepath = soc_fpath, n = 25000, "../../static/data/chunk_data/social_chunked")
## Parsed with column specification:
## cols(
##   sex = col_character(),
##   yearofbirth = col_double(),
##   primary2004 = col_double(),
##   messages = col_character(),
##   primary2006 = col_double(),
##   hhsize = col_double()
## )
## ../../static/data/chunk_data/social_chunked_1.csv was written.
## ../../static/data/chunk_data/social_chunked_2.csv was written.
## ../../static/data/chunk_data/social_chunked_3.csv was written.
## ../../static/data/chunk_data/social_chunked_4.csv was written.
## ../../static/data/chunk_data/social_chunked_5.csv was written.
## ../../static/data/chunk_data/social_chunked_6.csv was written.
## ../../static/data/chunk_data/social_chunked_7.csv was written.
## ../../static/data/chunk_data/social_chunked_8.csv was written.
## ../../static/data/chunk_data/social_chunked_9.csv was written.
## ../../static/data/chunk_data/social_chunked_10.csv was written.
## ../../static/data/chunk_data/social_chunked_11.csv was written.
## ../../static/data/chunk_data/social_chunked_12.csv was written.
## ../../static/data/chunk_data/social_chunked_13.csv was written.

Now that we have these files split up, it will be good to know how to get them back into one piece! Check out my blog post on reading multiple csvs in as one data frame here.

Avatar
Josiah Parry
Social Data Scientist

Related