I am trying to read multiple files (more than 20000) from a directory and storing their data in a single dataframe. Their format is as follows, they all share the first column (note that numbers in the first column match file names, more on that later):
test = read.delim("814630", head = F)
head(test)
V1 V2
1 814630 0.00
2 839260 1.95
3 841877 2.59
4 825359 4.95
5 834026 5.16
6 825107 6.21
Then I do this to read the files (in the example I just read 5 files):
> temp = list.files()
> length(temp)
[1] 20819
> start_time <- Sys.time()
> data = lapply(temp[1:5], read.delim, head=F)
> end_time <- Sys.time()
> end_time - start_time
Time difference of 0.1406569 secs
If I use mclapply from parallel package I get a similar time (when I do this for the 20000 files, it takes 15-20 mins, any advice on how to improve this time would help too):
> library(parallel)
> numCores <- detectCores()
> cl <- makeCluster(numCores)
> data = mclapply(temp[1:5], read.delim, head=F)
Time difference of 0.1495719 secs
Then I use left_join from dplyr package to merge them into a single data.frame.This second part takes a short time with data from a few files, but when I try to merge all the data it takes much longer than even reading the files (it can take several hours).
> test = data %>% reduce(left_join,by="V1")
Time difference of 0.05186105 secs
I guess there is some way of making it more efficient, but I do not have much experience optimizing repetitive tasks in R, any help would be much appreciated.
Also, here is how my final data.frame would look like after some formatting. Note that the data is a symmetric matrix. So maybe there is a way of only reading half of the data that could speed the process.
> row.names(test) = test[,1]
> test[,1] = NULL
> colnames(test) = temp[1:5]
> test = test[order(as.numeric(row.names(test))), order(as.numeric(names(test)))]
>
> head(test)
814630 814636 814637 814638 814639
814630 0.00 318.41 13293.00 2012.21 391.97
814636 318.41 0.00 1345.84 1377.79 1889.77
814637 13293.00 1345.84 0.00 6477.10 10638.69
814638 2012.21 1377.79 6477.10 0.00 3905.41
814639 391.97 1889.77 10638.69 3905.41 0.00