0

I have a folder that has 5000 csv files, each file belonging to one location and containing daily rainfall from 1980 till 2015. Sample structure of a file is as follows:

sample.file <- data.frame(location.id = rep(1001, times = 365 * 36), 
                      year = rep(1980:2015, each = 365),
                      day = rep(1:365, times = 36),
                      rainfall = sample(1:100, replace = T, 365 * 36))

I want to read one file and calculate for each year, total rainfall and write the output again. There are multiple ways I can do this:

Method 1

for(i in seq_along(names.vec)){

  name <- namees.vec[i]
  dat <- fread(paste0(name,".csv"))

  dat <- dat %>% dplyr::group_by(year) %>% dplyr::summarise(tot.rainfall = sum(rainfall))

 fwrite(dat, paste0(name,".summary.csv"), row.names = F)
}

Method 2:

my.files <- list.files(pattern = "*.csv")
dat <- lapply(my.files, fread)
dat <- rbindlist(dat)
dat.summary <- dat %>% dplyr::group_by(location.id, year) %>% 
               dplyr::summarise(tot.rainfall = sum(rainfall))

Method 3:

I want to achieve this using foreach. How can I parallelise the above task using do parallel and for each function?

6
  • How about method4: fread files, rbind them and keep using data.table for performance (ie, allFilesBinded[, sum(rainfall), .(location.id, year)])? btw, since 1.11.0 fread is parallelized. Commented Sep 13, 2018 at 12:51
  • pbapply package provides easy paralleling Commented Sep 13, 2018 at 12:51
  • as easy as pblapply(my.files, fread, cl = mycl) Commented Sep 13, 2018 at 12:52
  • I can't test without your input, but I would go for: library(data.table); do.call(rbind, lapply(list.files(pattern = "*.csv"), fread))[, sum(rainfall), .(location.id, year)] Commented Sep 13, 2018 at 13:02
  • 1
    Learn more about parallelism with {foreach} with this guide. Commented Sep 13, 2018 at 13:22

2 Answers 2

2

Below is the skeleton for your foreach request.

require(foreach)
require(doSNOW)
cl <- makeCluster(10, # number of cores, don't use all cores your computer have
                  type="SOCK") # SOCK for Windows, FORK for linux
registerDoSNOW(cl)
clusterExport(cl, c("toto", "truc"), envir=environment()) # R object needed for each core
clusterEvalQ(cl, library(tcltk)) # libraries needed for each core
my.files <- list.files(pattern = "*.csv")
foreach(i=icount(my.files), .combine=rbind, inorder=FALSE) %dopar% {
  # read csv file
  # estimate total rain
  # write output
}
stopCluster(cl)

But the parallelization is really better when the computation time (CPU) per independant iteration is higher than the remaining operations. In your case, the improvement can be low because each core will need to have drive access for reading and for writing, and as the writing is a physical operation, it can be better to do it sequentially (safer for the hardware and eventually more efficient to have independant locations in the drive for each file compared to shared location for multiple files, needing indexes and so on to distinguish them for your OS -- the previous need confirmation, it is just a thought).

HTH

Bastien

Sign up to request clarification or add additional context in comments.

2 Comments

The parallelization should only be done over the reading of the files. The model estimation(*) and saving to file should not be part of parallelization. (*) Yes, theoretically you might be able to do it here since its a summation.
To avoid the risk of conveying that foreach() is a for loop, rather than an "apply" function, please consider adding an explicit return value, e.g. can dat.summary <- foreach(...) just like dat <- lapply(my.files, fread).
0

pbapply package is easiest paralleling approach

library (pbapply)

mycl <- makeCluster(4)
mylist <- pblapply(my.files, fread, cl = mycl)

4 Comments

pbapply doesn't do any parallelization. All it does is add a progress bar. You can replace pblapply with parallel::parLapply and it will work exactly the same.
How does this answer the question if it's about adding the progress bar?
cl = mycl enables paralleling. please try or read package reference. also have progress bar
cl : A cluster object created by makeCluster, or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evalua- tions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.