Importing and combining multiple CSV files in R with differing numbers and names of rows

Question

I have a folder with a couple hundred .csv files that I'd like to import and merge. Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).

How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA. I'd like column x to go away. Although, the rows should stay in the order they were originally in from the csv.

Here's an example:

Data frame 112_c1.csv:

x       y
1  -0.5604
3  -0.2301
4   1.5587
5   0.0705
6   0.1292


Dataframe 112_c2.csv:

x         y
2   -0.83476
3   -0.82764
8    1.32225
9    0.36363
13   0.9373
42  -1.5567
50  -0.12237
51  -0.4837

Dataframe 113_c1.csv:

x       y
5   1.5783
6   0.7736
9   0.28273
15  1.44565
23  0.999878
29 -0.223756
=

Desired result

112_c1.y   112_c2.y  113_c1.y
-0.5604   -0.83476   1.5783
-0.2301   -0.82764   0.7736
1.5587     1.32225   0.28273
0.0705     0.36363   1.44565
0.1292     0.9373    0.999878
NA        -1.5567    -0.223756
NA        -0.12237   -0.223756
NA        -0.12237   NA
NA        -0.4837    NA

I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:

df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))

Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.

Sticking them together side-by-side defies an underlying premise of a data.frame: that each row is an observation, each value on that row is fundamentally tied together. In a survey, each row is a respondent. In a data log, each row is a point-in-time. While not insurmountable, another issue is that since they have different rows, you will have some columns with more rows than the others, which is not how frames work; the way around this is to lengthen the shorter ones, filling with NA. — r2evans
– r2evans, Commented Mar 8, 2020 at 0:19
What is it that you ultimately need to do with this data? There might be more appropriate methods or structures to use in place of a data.frame. — r2evans
– r2evans, Commented Mar 8, 2020 at 0:23
@r2evans I understand that it's weird. In this case, each column (ie each csv file) is a participant. The rows have different names because they are observations at different times, and the analysis selects different values based on frame rates from a video. I want to combine them for further analysis. The first thing I'll do is create a standardized average score from each columns and append that to another dataset. The next thing I'll do is analyse the time-series points for each participant. — socialresearcher
– socialresearcher, Commented Mar 8, 2020 at 0:26
Perhaps: (1) add a column to each indicating the participant ID; (2) read them in and combine them by rows, so you'll have three columns (id, x, y). From there, analyzing them can be done by-id (dplyr::group_by or data.table's x[,,by=.(id)] semantics, as well as some base-R methods) and/or other ways. (Perhaps add "row number" as a column as well, in case you need to impose that order.) This is very much a data-science-y kind of issue :-) — r2evans
– r2evans, Commented Mar 8, 2020 at 0:31
@dario I'm open to other suggestions. I usually work with data frames in R and will do further analyses, so I just went to that out of habit. But, as I mentioned in my other comment, each column is a participant's time series data, with row 1 being time 0 and the last row being the last time point. — socialresearcher
– socialresearcher, Commented Mar 8, 2020 at 0:31

dario · Accepted Answer · 2020-03-08 01:02:16Z

1

Suggested solution using a function to calculate summary statistics right when loading data:

 readCalc <- function(file_path) {
   df <- read.csv(file_path)
   return(data.frame(file=file_path,
                     column = names(df),
                     averages = apply(df, 2, mean),
                     N = apply(df, 2, length),
                     min = apply(df, 2, min),
                     stringsAsFactors = FALSE, row.names = NULL))
 }


 df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))

If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.

answered Mar 8, 2020 at 1:02

dario

6,5032 gold badges15 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

socialresearcher Over a year ago

Got it. Thanks. I see how this works. This can be easily modified to do other operations as well- very helpful. Thanks for your time!

dario Over a year ago

You are welcom! Glad I could help ;)

Caitlin · Accepted Answer · 2020-03-08 01:21:21Z

Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.

library(tidyverse)

# store csv file paths
data_path <- "data"   # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")

# read csv files and combine into a single dataframe 
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
  tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names 
  mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names 
  tidyr::unnest(Data) %>% #unnest the Data column into multiple columns 
  select(-File) %>% #remove the File column 
  na.omit() %>% #remove the NA rows 
  spread(name, X2) %>% #reshape the dataframe from long to wide 
  select(-X1) %>% #remove the x column 
  mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows

ecology · Accepted Answer · 2020-03-08 00:25:43Z

0

Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?

x <- c(1:6)
y <- c(1:3)
z <- c(1:10)

cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 
}

df <- as.data.frame(cbind.fill(x,y,z))

colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")

   112_c1.y 112_c2.y 113_c1.y
1         1        1        1
2         2        2        2
3         3        3        3
4         4       NA        4
5         5       NA        5
6         6       NA        6
7        NA       NA        7
8        NA       NA        8
9        NA       NA        9
10       NA       NA       10

edited Mar 8, 2020 at 0:25

answered Mar 8, 2020 at 0:19

ecology

6734 gold badges9 silver badges32 bronze badges

3 Comments

r2evans Over a year ago

Are you adding anything to the other answer?

r2evans Over a year ago

Also, as was said in that answer, this is returning a matrix and not a data.frame.

ecology Over a year ago

fixed :) and not adding anything significant to that answer but maybe the specific column names the op requested.

Collectives™ on Stack Overflow

Importing and combining multiple CSV files in R with differing numbers and names of rows

3 Answers 3

2 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related