Using foreach function to parallelise calculation

Question

I have a folder that has 5000 csv files, each file belonging to one location and containing daily rainfall from 1980 till 2015. Sample structure of a file is as follows:

sample.file <- data.frame(location.id = rep(1001, times = 365 * 36), 
                      year = rep(1980:2015, each = 365),
                      day = rep(1:365, times = 36),
                      rainfall = sample(1:100, replace = T, 365 * 36))

I want to read one file and calculate for each year, total rainfall and write the output again. There are multiple ways I can do this:

Method 1

for(i in seq_along(names.vec)){

  name <- namees.vec[i]
  dat <- fread(paste0(name,".csv"))

  dat <- dat %>% dplyr::group_by(year) %>% dplyr::summarise(tot.rainfall = sum(rainfall))

 fwrite(dat, paste0(name,".summary.csv"), row.names = F)
}

Method 2:

my.files <- list.files(pattern = "*.csv")
dat <- lapply(my.files, fread)
dat <- rbindlist(dat)
dat.summary <- dat %>% dplyr::group_by(location.id, year) %>% 
               dplyr::summarise(tot.rainfall = sum(rainfall))

Method 3:

I want to achieve this using foreach. How can I parallelise the above task using do parallel and for each function?

How about method4: fread files, rbind them and keep using data.table for performance (ie, allFilesBinded[, sum(rainfall), .(location.id, year)])? btw, since 1.11.0 fread is parallelized. — pogibas
– pogibas, Commented Sep 13, 2018 at 12:51
I can't test without your input, but I would go for: library(data.table); do.call(rbind, lapply(list.files(pattern = "*.csv"), fread))[, sum(rainfall), .(location.id, year)] — pogibas
– pogibas, Commented Sep 13, 2018 at 13:02
Learn more about parallelism with {foreach} with this guide. — F. Privé
– F. Privé, Commented Sep 13, 2018 at 13:22

Bastien · Accepted Answer · 2018-09-13 15:13:23Z

2

Below is the skeleton for your foreach request.

require(foreach)
require(doSNOW)
cl <- makeCluster(10, # number of cores, don't use all cores your computer have
                  type="SOCK") # SOCK for Windows, FORK for linux
registerDoSNOW(cl)
clusterExport(cl, c("toto", "truc"), envir=environment()) # R object needed for each core
clusterEvalQ(cl, library(tcltk)) # libraries needed for each core
my.files <- list.files(pattern = "*.csv")
foreach(i=icount(my.files), .combine=rbind, inorder=FALSE) %dopar% {
  # read csv file
  # estimate total rain
  # write output
}
stopCluster(cl)

But the parallelization is really better when the computation time (CPU) per independant iteration is higher than the remaining operations. In your case, the improvement can be low because each core will need to have drive access for reading and for writing, and as the writing is a physical operation, it can be better to do it sequentially (safer for the hardware and eventually more efficient to have independant locations in the drive for each file compared to shared location for multiple files, needing indexes and so on to distinguish them for your OS -- the previous need confirmation, it is just a thought).

HTH

Bastien

answered Sep 13, 2018 at 15:13

Bastien

1666 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

HenrikB Over a year ago

The parallelization should only be done over the reading of the files. The model estimation(*) and saving to file should not be part of parallelization. (*) Yes, theoretically you might be able to do it here since its a summation.

HenrikB Over a year ago

To avoid the risk of conveying that foreach() is a for loop, rather than an "apply" function, please consider adding an explicit return value, e.g. can dat.summary <- foreach(...) just like dat <- lapply(my.files, fread).

Selcuk Akbas · Accepted Answer · 2018-09-13 12:54:35Z

0

pbapply package is easiest paralleling approach

library (pbapply)

mycl <- makeCluster(4)
mylist <- pblapply(my.files, fread, cl = mycl)

answered Sep 13, 2018 at 12:54

Selcuk Akbas

7011 gold badge8 silver badges20 bronze badges

4 Comments

Hong Ooi Over a year ago

pbapply doesn't do any parallelization. All it does is add a progress bar. You can replace pblapply with parallel::parLapply and it will work exactly the same.

pogibas Over a year ago

How does this answer the question if it's about adding the progress bar?

Selcuk Akbas Over a year ago

cl = mycl enables paralleling. please try or read package reference. also have progress bar

Selcuk Akbas Over a year ago

cl : A cluster object created by makeCluster, or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evalua- tions.

Collectives™ on Stack Overflow

Using foreach function to parallelise calculation

Method 1

Method 2:

Method 3:

2 Answers 2

2 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Method 1

Method 2:

Method 3:

2 Answers 2

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related