0

I am trying to read multiple files (more than 20000) from a directory and storing their data in a single dataframe. Their format is as follows, they all share the first column (note that numbers in the first column match file names, more on that later):

test = read.delim("814630", head = F)
head(test)
      V1   V2
1 814630 0.00
2 839260 1.95
3 841877 2.59
4 825359 4.95
5 834026 5.16
6 825107 6.21

Then I do this to read the files (in the example I just read 5 files):

> temp = list.files()
> length(temp)
[1] 20819
> start_time <- Sys.time()
> data = lapply(temp[1:5], read.delim, head=F)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.1406569 secs

If I use mclapply from parallel package I get a similar time (when I do this for the 20000 files, it takes 15-20 mins, any advice on how to improve this time would help too):

> library(parallel)
> numCores <- detectCores()
> cl <- makeCluster(numCores)
> data = mclapply(temp[1:5], read.delim, head=F)
Time difference of 0.1495719 secs

Then I use left_join from dplyr package to merge them into a single data.frame.This second part takes a short time with data from a few files, but when I try to merge all the data it takes much longer than even reading the files (it can take several hours).

> test = data %>% reduce(left_join,by="V1")
Time difference of 0.05186105 secs

I guess there is some way of making it more efficient, but I do not have much experience optimizing repetitive tasks in R, any help would be much appreciated.

Also, here is how my final data.frame would look like after some formatting. Note that the data is a symmetric matrix. So maybe there is a way of only reading half of the data that could speed the process.

> row.names(test) = test[,1]
> test[,1] = NULL
> colnames(test) = temp[1:5]
> test = test[order(as.numeric(row.names(test))), order(as.numeric(names(test)))]
> 
> head(test)
         814630   814636   814637  814638   814639
814630     0.00   318.41 13293.00 2012.21   391.97
814636   318.41     0.00  1345.84 1377.79  1889.77
814637 13293.00  1345.84     0.00 6477.10 10638.69
814638  2012.21  1377.79  6477.10    0.00  3905.41
814639   391.97  1889.77 10638.69 3905.41     0.00

1 Answer 1

1

A simple approach that assumes all the V1 exist in every table:

library(dplyr)
V1 <- read.delim(temp[1], head=F) %>% arrange(V1) %>% dplyr::select(-V2)
data = lapply(temp[1:5], function(x) {
    read.delim(x, head=F) %>% arrange(V1) %>% dplyr::select(-V1)
})
test <- cbind(V1, do.call(cbind, data))

This will be much faster than repeatedly left_join since joining gets slower then more columns/rows you have.

Sign up to request clarification or add additional context in comments.

2 Comments

I would also suggest using data.table::fread which is faster than read.delim.
Thanks! that works perfectly, and also using fread cuts the time in half.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.