1

I am currently working on a program to evaluate the out-of-sample performance of several forecasting models on simulated data. For those who are familiar with finance, it works exactly like backtesting a trading strategy, except that I would evaluate forecasts and not transactions.

Some of the objects I currently manipulate using for loops for this type of task are 7 dimensional arrays (dimensions stand for Monte Carlo replications, data generating processes, forecast horizons, 3 dimensions for model parameter selection, and one dimension for all the periods covered in the out-of-sample analysis). Obviously, it is painfully slow, so parallel computing has became a must for me.

My problem is: how do I keep track of more than 2 dimensions in R? Let's just show you using 'for loops' and only 3 dimensions what I mean:

x <- array(dim=c(2,2,2))
     for (i in 1:2){
       for (j in 1:2){
         for (k in 1:2){
           x[i,j,k] <- i+j+k
         }
       }
     }

If I use something like 'foreach', I am very annoyed by the fact that, to my knowledge, available combining functionalities will return lists, matrices or vectors -- but not arbitrarily large multidimensional arrays. For instance:

library(doParallel)
library(foreach)

# Get the number of cores to use
no_cores <- max(1, detectCores()-1)

# Make cluster object using no_cores
cl <- makeCluster(no_cores)

# Initialize cluster for parallel computing
registerDoParallel(cl)

x <- foreach(i=1:2, .combine=rbind)%:%
       foreach(j=1:2, .combine=cbind)%:%
         foreach(k=1:2, .combine=c)%dopar%{
           i+j+k
     }

Here, I basically combine results into vectors, then matrices and, finally, I pile up matrices by rows. Another option would be to use lists, or pile matrices through columns, but you can imagine the mess when you have 7 dimensions and millions of iterations to track.

I suppose I could also write my own 'combine' function and get the kind of output I want, but I suspect that I am not the first person to encounter this problem. Either there is a way to do exactly what I want, or someone here can point out a way to think differently about storing my results. It wouldn't be surprising that I am taking an absurdly inefficient path toward solving this problem -- I am an economist, not a data scientist, after all!

Any help would be greatly appreciated. Thanks in advance.

2 Answers 2

1

There is one available solution that I finally stumbled upon tonight. I can create an appropriate combination function along the dimension of my choice using the 'abind' function of the 'abind' package:

library(abind)

# Get the number of cores to use
no_cores <- max(1, detectCores()-1)

# Make cluster object using no_cores
cl <- makeCluster(no_cores)

# Initialize cluster for parallel computing
registerDoParallel(cl)

mbind <- function(...) abind(..., along=3)

x <- foreach(i=1:2, .combine=mbind)%:%
   foreach(j=1:2, .combine=cbind)%:%
     foreach(k=1:2, .combine=c)%dopar%{
       i+j+k
 }

I would still like to see if someone has other means of doing what I want to do, however. There might be many ways to do it and I am new to R, yet this solution is a distinct possibility.

Sign up to request clarification or add additional context in comments.

1 Comment

This is the package and approach that I use, but you should use the .multicombine=TRUE option along with mbind, otherwise mbind will never be called with more than two arguments, which can significantly hurt your performance. See stackoverflow.com/a/17572065/2109128 for my answer to a similar question.
0

What I would do and I already use in one of my packages, bigstatsr.

Take only one dimension and cut it in no_cores blocks. It should have sufficient iterations (e.g. 20 for 4 cores). For each iteration, construct part of the array you want and store it in a temporary file. The, use the content of these files to fill the whole array. By doing so, you fill only preallocated objects, which should be faster and easier.

Example:

x.all <- array(dim=c(20,2,2))
no_cores <- 3    
tmpfile <- tempfile()    
range.parts <- bigstatsr:::CutBySize(nrow(x.all), nb = no_cores)

library(foreach)
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)

foreach(ic = 1:no_cores) %dopar% {

  ind <- bigstatsr:::seq2(range.parts[ic, ])
  x <- array(dim = c(length(ind), 2, 2))

  for (i in seq_along(ind)){
    for (j in 1:2){
      for (k in 1:2){
        x[i,j,k] <- ind[i]+j+k
      }
    }
  }

  saveRDS(x, file = paste0(tmpfile, "_", ic, ".rds"))
}
parallel::stopCluster(cl)

for (ic in 1:no_cores) {
  ind <- bigstatsr:::seq2(range.parts[ic, ])
  x.all[ind, , ] <- readRDS(paste0(tmpfile, "_", ic, ".rds"))
}

print(x.all)

Instead of writing files, you could also directly return the no_cores parts of the array in foreach and combine them with the right abind.

1 Comment

It is actually a very good solution. It did take me some time working with R to 'get' what you were trying to do here, but I'll keep this suggestion on the back burner until I have some new fancy work to do.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.