Reading Multiple csvs as 1 data frame

Reading chunked csv files

R
tutorial
Author

Josiah Parry

Published

October 27, 2018

In an earlier posting I wrote about having to break a single csv into multiple csvs. In other scenarios one data set maybe provided as multiple a csvs.

Thankfully purrr has a beautiful function called map_df() which will make this into a two liner. This process has essentially 3 steps.

  1. Create a vector of all .csv files that should be merged together.
  2. Read each file using readr::read_csv()
  3. Combine each dataframe into one.

map_df() maps (applys) a function to each value of an object and produces a dataframe of all outputs.

For this example I will use the csvs I created in a previous tutorial utilizing a dataset from the Quantitative Social Science book.

# Get all csv file names 
file_names <- list.files("../../static/data/chunk_data", pattern = "\\.csv", full.names = TRUE)
file_names
#>  [1] "../../static/data/chunk_data/social_chunked_1.csv" 
#>  [2] "../../static/data/chunk_data/social_chunked_10.csv"
#>  [3] "../../static/data/chunk_data/social_chunked_11.csv"
#>  [4] "../../static/data/chunk_data/social_chunked_12.csv"
#>  [5] "../../static/data/chunk_data/social_chunked_13.csv"
#>  [6] "../../static/data/chunk_data/social_chunked_2.csv" 
#>  [7] "../../static/data/chunk_data/social_chunked_3.csv" 
#>  [8] "../../static/data/chunk_data/social_chunked_4.csv" 
#>  [9] "../../static/data/chunk_data/social_chunked_5.csv" 
#> [10] "../../static/data/chunk_data/social_chunked_6.csv" 
#> [11] "../../static/data/chunk_data/social_chunked_7.csv" 
#> [12] "../../static/data/chunk_data/social_chunked_8.csv" 
#> [13] "../../static/data/chunk_data/social_chunked_9.csv"
library(tidyverse)
# apply 
all_csvs <- map_df(file_names, read_csv)
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 5866 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.

# preview the data
head(all_csvs)
#> # A tibble: 6 × 6
#>   sex    yearofbirth primary2004 messages   primary2006 hhsize
#>   <chr>        <dbl>       <dbl> <chr>            <dbl>  <dbl>
#> 1 male          1941           0 Civic Duty           0      2
#> 2 female        1947           0 Civic Duty           0      2
#> 3 male          1951           0 Hawthorne            1      3
#> 4 female        1950           0 Hawthorne            1      3
#> 5 female        1982           0 Hawthorne            1      3
#> 6 male          1981           0 Control              0      3