Reading Multiple csvs as 1 data frame
Reading chunked csv files
R
tutorial
In an earlier posting I wrote about having to break a single csv into multiple csvs. In other scenarios one data set maybe provided as multiple a csvs.
Thankfully purrr
has a beautiful function called map_df()
which will make this into a two liner. This process has essentially 3 steps.
- Create a vector of all
.csv
files that should be merged together. - Read each file using
readr::read_csv()
- Combine each dataframe into one.
map_df()
maps (applys) a function to each value of an object and produces a dataframe of all outputs.
For this example I will use the csvs I created in a previous tutorial utilizing a dataset from the Quantitative Social Science book.
# Get all csv file names
file_names <- list.files("../../static/data/chunk_data", pattern = "\\.csv", full.names = TRUE)
file_names
#> [1] "../../static/data/chunk_data/social_chunked_1.csv"
#> [2] "../../static/data/chunk_data/social_chunked_10.csv"
#> [3] "../../static/data/chunk_data/social_chunked_11.csv"
#> [4] "../../static/data/chunk_data/social_chunked_12.csv"
#> [5] "../../static/data/chunk_data/social_chunked_13.csv"
#> [6] "../../static/data/chunk_data/social_chunked_2.csv"
#> [7] "../../static/data/chunk_data/social_chunked_3.csv"
#> [8] "../../static/data/chunk_data/social_chunked_4.csv"
#> [9] "../../static/data/chunk_data/social_chunked_5.csv"
#> [10] "../../static/data/chunk_data/social_chunked_6.csv"
#> [11] "../../static/data/chunk_data/social_chunked_7.csv"
#> [12] "../../static/data/chunk_data/social_chunked_8.csv"
#> [13] "../../static/data/chunk_data/social_chunked_9.csv"
library(tidyverse)
# apply
all_csvs <- map_df(file_names, read_csv)
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 5866 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 25000 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): sex, messages
#> dbl (4): yearofbirth, primary2004, primary2006, hhsize
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# preview the data
head(all_csvs)
#> # A tibble: 6 × 6
#> sex yearofbirth primary2004 messages primary2006 hhsize
#> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 male 1941 0 Civic Duty 0 2
#> 2 female 1947 0 Civic Duty 0 2
#> 3 male 1951 0 Hawthorne 1 3
#> 4 female 1950 0 Hawthorne 1 3
#> 5 female 1982 0 Hawthorne 1 3
#> 6 male 1981 0 Control 0 3