0

I would like to run a large loop with the foreach function.This means using the %dopar% operator.

I can't find any questions already answered to this problem exactly. If this is a duplicate though, please point me in the right direction and I'll close this question.

I have been having mixed success. It works for simple examples on my machine, as per the help documentation, however I cannot seem to get good results for my own work. My example is slightly more complicated, so the devil seems to be in the detail, as always! I have also read the 'white paper' provided by the package creators Revolution Analytics (you can get it here). I don't see how best to maybe use the .combine argument to apply results to my global output list. I would like to assign claculated value to one big list as opposed to using cbind or c

My example is pretty convoluted, but if I simplify it any further then any answers might not address my issue.

I will perform a kind of moving-linear model. So fit a model using lm() over 50 obersvations [1:50], predict the 51st observation [51], saving the results to a list. Then I will shift it all one observation further. So a lm over [2:51] and predict the 52nd observation [52]. I will use a total of 100 observations, so I can make a maximum of 50 predictions.

## ============================================ ##
##  Setup the backend for the foreach function  ##
## ============================================ ##

## doMC calls upon cores on demand, uses them and closes them
library(doMC)
registerDoMC(detectCores())     #detectCores() uses all cores

## for Windows users
#library(doParallel) --> for Windows users
#registerDoParallel(detectCores())

## ======================== ##
##  Create some dummy data  ##
## ======================== ##

## three columns, one hundred observations
my_data <- data.table(outcome = runif(100), V1 = 3*runif(100), V2 = sqrt(runif(100)))

## Have a look at the data if you like - using the DT package
library(DT)
datatable(my_data, options = list(pageLength = nrow(my_data)))

## ================================= ##
##  Perform the loop the normal way  ##
## ================================= ##

## Create container (a list of lists) for results
my_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(my_results)){my_results[[i]] <- step_results}

## Use a for loop to stpe through all the 50 'slices'
for(i in 1:50) {        #max. 50 predictions possible

    ## Fit a linear model
    my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])

    ## Predict the next step
    my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)]) 

    error <- my_data$outcome[i+50] - my_pred    #simply measure the delta to the actual value

    ## Assign some results to the container created earlier
    my_results[[i]][[1]] <- my_fit$coefficients
    my_results[[i]][[2]] <- my_fit$rank
    my_results[[i]][[3]] <- error

}
str(my_results)    ## Keep this container to compare to our next one

## ============================================ ##
##  Perform the loop using foreach and %dopar%  ##
## ============================================ ##

## Create same results object for results as previously for parallel results
par_results <- sapply(c(paste0("step_", seq(1:50))), function(x) NULL)
step_results <- sapply(c("coefs", "rank", "error"), function(x) NULL)
for(i in 1:length(par_results)){par_results[[i]] <- step_results}

my_results_par <- foreach(i = 1:50) %dopar%
    {        #max. 50 predictions possible

        my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])     
        my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)]) 
        error <- my_data$outcome[i+50] - my_pred 

        ## Assign some results to the container created earlier
        par_results[[i]][[1]] <- my_fit$coefficients
        par_results[[i]][[2]] <- my_fit$rank
        par_results[[i]][[3]] <- error

        Sys.sleep(i/20)    #Allows time to see R processes spawn on your system
        return(par_results)
    }

## We can see straight away that this didn't work as I would like it to
identical(my_results, my_results_par)   #FALSE

## This shows that the output seems good on the surface
class(my_results_par)
length(my_results_par)
## This shows that it doesn't (WARNING: very long)
str(my_results_par)

You can try out the various .combine arguments in the foreach function, for example:

foreach(i = 1:50, .combine = "c") {computation}

or

foreach(i = 1:50, .combine = "cbind") {computation}

these prodice a vector and a matrix respectively, but do not contain all the results that I was trying to save in each loop.

Questions

  1. Does that structure give you a clue as to what is going on?
  2. How might I use .combine argument to create my desired output?
  3. Is what I am trying to do even possible??
  4. Do I need to put the loop with foreach at a different point in the algorithm?

I have read that you can supply a custom function to foreach... might this be the way to do it? I still don't see how I would combine the results.

1 Answer 1

1

Yes, this can easily be done. We can modify your code for the foreach-step to the following, where we export the data.table package to each worker.

my_results_par <- foreach(i = 1:50, .combine = append, .packages = c("data.table")) %dopar%
    {      
        my_fit <- lm("outcome ~ V1 + V2", data = my_data[i:(i+49)])     
        my_pred <- predict(my_fit, newdata = my_data[i+50, .(V1, V2)]) 
        error <- my_data$outcome[i+50] - my_pred 

        par_results <- list(
            coefs = my_fit$coefficients,
            rank = my_fit$rank,
            error = error
        )
        par_results <- list(par_results)
        names(par_results) <- paste0("step_", i)
        return(par_results)
    }
identical(my_results, my_results_par)   
[1] TRUE
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for your response! I'll try it. I haven't found it anywhere in the documentation, but do you know if is possible to add a longer list of packages to the .packages argument? I have seen the inclusion of .libPaths(), which should provide each worker with exaclty the same environment as the master for calculations, however my problem seemed rather to be the combination/returning of several results from each worker. I do however need to use several non-base functions. I have had several errors regarding foreach not finding the iterator, so wondered about packages, hence my question.
A question to a specific step of your suggested solution: why is the par_results <- list(par_results) step necessary?
yes, you can easily add more packages to the .packages argument, but I don't know the downside of adding many to the argument. I suspect there might be a reason why all the attached libraries are not used as default. I would go with the attached packages from search instead of .libPaths().
The par_results <- list(par_results) is just an extra wrapper to get the structure identical, eg. a list of lists. Without it, you would only get 1 list containing all the elements, coefs, rank, error. Try to outcomment it and run the snip-
Finally got round to implementing, worked like a charm, many thanks! One last question... where does your .combine = append get passed to? Also, how do you know about it? I can't find mention of it anywhere in the documentation!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.