Computationally fastest way of merging data from multiple files into a single data.frame in R?

Question

I am trying to read multiple files (more than 20000) from a directory and storing their data in a single dataframe. Their format is as follows, they all share the first column (note that numbers in the first column match file names, more on that later):

test = read.delim("814630", head = F)
head(test)
      V1   V2
1 814630 0.00
2 839260 1.95
3 841877 2.59
4 825359 4.95
5 834026 5.16
6 825107 6.21

Then I do this to read the files (in the example I just read 5 files):

> temp = list.files()
> length(temp)
[1] 20819
> start_time <- Sys.time()
> data = lapply(temp[1:5], read.delim, head=F)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.1406569 secs

If I use mclapply from parallel package I get a similar time (when I do this for the 20000 files, it takes 15-20 mins, any advice on how to improve this time would help too):

> library(parallel)
> numCores <- detectCores()
> cl <- makeCluster(numCores)
> data = mclapply(temp[1:5], read.delim, head=F)
Time difference of 0.1495719 secs

Then I use left_join from dplyr package to merge them into a single data.frame.This second part takes a short time with data from a few files, but when I try to merge all the data it takes much longer than even reading the files (it can take several hours).

> test = data %>% reduce(left_join,by="V1")
Time difference of 0.05186105 secs

I guess there is some way of making it more efficient, but I do not have much experience optimizing repetitive tasks in R, any help would be much appreciated.

Also, here is how my final data.frame would look like after some formatting. Note that the data is a symmetric matrix. So maybe there is a way of only reading half of the data that could speed the process.

> row.names(test) = test[,1]
> test[,1] = NULL
> colnames(test) = temp[1:5]
> test = test[order(as.numeric(row.names(test))), order(as.numeric(names(test)))]
> 
> head(test)
         814630   814636   814637  814638   814639
814630     0.00   318.41 13293.00 2012.21   391.97
814636   318.41     0.00  1345.84 1377.79  1889.77
814637 13293.00  1345.84     0.00 6477.10 10638.69
814638  2012.21  1377.79  6477.10    0.00  3905.41
814639   391.97  1889.77 10638.69 3905.41     0.00

thc · Accepted Answer · 2020-03-26 18:16:42Z

1

A simple approach that assumes all the V1 exist in every table:

library(dplyr)
V1 <- read.delim(temp[1], head=F) %>% arrange(V1) %>% dplyr::select(-V2)
data = lapply(temp[1:5], function(x) {
    read.delim(x, head=F) %>% arrange(V1) %>% dplyr::select(-V1)
})
test <- cbind(V1, do.call(cbind, data))

This will be much faster than repeatedly left_join since joining gets slower then more columns/rows you have.

answered Mar 26, 2020 at 18:16

thc

9,7251 gold badge29 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

thc Over a year ago

I would also suggest using data.table::fread which is faster than read.delim.

eggrandio Over a year ago

Thanks! that works perfectly, and also using fread cuts the time in half.

Collectives™ on Stack Overflow

Computationally fastest way of merging data from multiple files into a single data.frame in R?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related