Combine some csv files into one - different number of columns

Question

I already loaded 20 csv files with function:

tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))

or

list_of_data = lapply(tbl, read.csv)

That how it looks like:

> head(tbl)
[1] "F1.csv"          "F10_noS3.csv"    "F11.csv"         "F12.csv"         "F12_noS7_S8.csv"
[6] "F13.csv"

I have to combine all of those files into one. Let's call it a master file but let's try with making a one table with all of the names. In all of those csv files is a column called "Accession". I would like to make a table of all "names" from all of those csv files. Of course many of the accessions can be repeated in different csv files. I would like to keep all of the data corresponding to the accession.

Some problems:

Some of those "names" are the same and I don't want to duplicate them
Some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the numer.
The number of columns can be different is those csv files.

That's the screenshot showing how those data looks like: http://imageshack.com/a/img811/7103/29hg.jpg

Let me show you how it looks:

AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--

<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.

Is it possible to do ?

I couldn't do a dput(head) because it's even too big data set.

I tried to use such code:

all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) : 
The number of columns is not correct.


all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))

I tried to do it for almost 2 weeks and I am not able to. So please help me.

nassimhddd · Accepted Answer · 2014-02-06 16:15:18Z

3

Your questions seems to contain multiple subquestions. I encourage you to separate them.

The first thing you apparently need is to combine data frames with different columns. You can use rbind.fill from the plyr package:

library(plyr)
all_data = do.call(rbind.fill, list_of_data)

answered Feb 6, 2014 at 16:15

nassimhddd

8,5001 gold badge31 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

hadley Over a year ago

rbind.fill(list_of_data) will be faster, dplyr::rbind_all(list_of_data) will be faster yet.

Rechlay Over a year ago

both are working great and fast enough. Any ideas how to remove duplicates and "same names" just different number after dot. Thanks! Any suggestions what I did wrong to deserve downvote ?

Mike.Gahan Over a year ago

Very useful answer (and efficient!). Thanks so much.

jessi Over a year ago

If you try to dplyr::rbind_all(list_of_data), the R session will be aborted if the list elements are not of the same length.

sbha · Accepted Answer · 2018-07-15 12:54:05Z

Here's an example using some tidyverse functions and a custom function that can combine multiple csv files with missing columns into one data frame:

library(tidyverse)

# specify the target directory
dir_path <- '~/test_dir/' 

# specify the naming format of the files. 
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'

# create sample data with some missing columns 
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)

# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
  x <- read_csv(paste0(dir_path, file_name)) %>% 
    mutate(file_name = file_name) %>% # add the file name as a column              
    select(file_name, everything())   # reorder the columns so file name is first
  return(x)
}

# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
  list.files(dir_path, pattern = re_file) %>% 
  map_df(~ read_dir(dir_path, .))

# files with missing columns are filled with NAs.

Collectives™ on Stack Overflow

Combine some csv files into one - different number of columns

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related