0

I have a folder with a couple hundred .csv files that I'd like to import and merge. Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).

How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA. I'd like column x to go away. Although, the rows should stay in the order they were originally in from the csv.

Here's an example:

Data frame 112_c1.csv:

x       y
1  -0.5604
3  -0.2301
4   1.5587
5   0.0705
6   0.1292


Dataframe 112_c2.csv:

x         y
2   -0.83476
3   -0.82764
8    1.32225
9    0.36363
13   0.9373
42  -1.5567
50  -0.12237
51  -0.4837

Dataframe 113_c1.csv:

x       y
5   1.5783
6   0.7736
9   0.28273
15  1.44565
23  0.999878
29 -0.223756
=

Desired result

112_c1.y   112_c2.y  113_c1.y
-0.5604   -0.83476   1.5783
-0.2301   -0.82764   0.7736
1.5587     1.32225   0.28273
0.0705     0.36363   1.44565
0.1292     0.9373    0.999878
NA        -1.5567    -0.223756
NA        -0.12237   -0.223756
NA        -0.12237   NA
NA        -0.4837    NA

I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:

df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))

Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.

13
  • 1
    Sticking them together side-by-side defies an underlying premise of a data.frame: that each row is an observation, each value on that row is fundamentally tied together. In a survey, each row is a respondent. In a data log, each row is a point-in-time. While not insurmountable, another issue is that since they have different rows, you will have some columns with more rows than the others, which is not how frames work; the way around this is to lengthen the shorter ones, filling with NA. Commented Mar 8, 2020 at 0:19
  • What is it that you ultimately need to do with this data? There might be more appropriate methods or structures to use in place of a data.frame. Commented Mar 8, 2020 at 0:23
  • @r2evans I understand that it's weird. In this case, each column (ie each csv file) is a participant. The rows have different names because they are observations at different times, and the analysis selects different values based on frame rates from a video. I want to combine them for further analysis. The first thing I'll do is create a standardized average score from each columns and append that to another dataset. The next thing I'll do is analyse the time-series points for each participant. Commented Mar 8, 2020 at 0:26
  • 1
    Perhaps: (1) add a column to each indicating the participant ID; (2) read them in and combine them by rows, so you'll have three columns (id, x, y). From there, analyzing them can be done by-id (dplyr::group_by or data.table's x[,,by=.(id)] semantics, as well as some base-R methods) and/or other ways. (Perhaps add "row number" as a column as well, in case you need to impose that order.) This is very much a data-science-y kind of issue :-) Commented Mar 8, 2020 at 0:31
  • 1
    @dario I'm open to other suggestions. I usually work with data frames in R and will do further analyses, so I just went to that out of habit. But, as I mentioned in my other comment, each column is a participant's time series data, with row 1 being time 0 and the last row being the last time point. Commented Mar 8, 2020 at 0:31

3 Answers 3

1

Suggested solution using a function to calculate summary statistics right when loading data:

 readCalc <- function(file_path) {
   df <- read.csv(file_path)
   return(data.frame(file=file_path,
                     column = names(df),
                     averages = apply(df, 2, mean),
                     N = apply(df, 2, length),
                     min = apply(df, 2, min),
                     stringsAsFactors = FALSE, row.names = NULL))
 }


 df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))

If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.

Sign up to request clarification or add additional context in comments.

2 Comments

Got it. Thanks. I see how this works. This can be easily modified to do other operations as well- very helpful. Thanks for your time!
You are welcom! Glad I could help ;)
1

Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.

library(tidyverse)

# store csv file paths
data_path <- "data"   # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")

# read csv files and combine into a single dataframe 
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
  tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names 
  mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names 
  tidyr::unnest(Data) %>% #unnest the Data column into multiple columns 
  select(-File) %>% #remove the File column 
  na.omit() %>% #remove the NA rows 
  spread(name, X2) %>% #reshape the dataframe from long to wide 
  select(-X1) %>% #remove the x column 
  mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows 

Comments

0

Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?

x <- c(1:6)
y <- c(1:3)
z <- c(1:10)

cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 
}

df <- as.data.frame(cbind.fill(x,y,z))

colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")

   112_c1.y 112_c2.y 113_c1.y
1         1        1        1
2         2        2        2
3         3        3        3
4         4       NA        4
5         5       NA        5
6         6       NA        6
7        NA       NA        7
8        NA       NA        8
9        NA       NA        9
10       NA       NA       10

3 Comments

Are you adding anything to the other answer?
Also, as was said in that answer, this is returning a matrix and not a data.frame.
fixed :) and not adding anything significant to that answer but maybe the specific column names the op requested.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.