parallel computing in R , implementing bootstrap

Question

I'm currently trying to compute model estimators using the BLB bootstrap , and would like to do so parallel. my code works fine when I'm not doing it parallel. the problem when I'm computing in parallel is that the results I get from each core contains NA values. I don't understand how I get NA values while the Iris Data set's values don't contain NA at all. here is the code that I'm using :

library(doParallel)
library(itertools)

 num_of_cores <- detectCores()
 cl <- makePSOCKcluster(num_of_cores)
 registerDoParallel(cl)

 attach(iris)
 data <- iris
 coeftmp <- data.frame()
 system.time(
 r <- foreach(dat = isplitRows(data, chunks=num_of_cores),
             .combine = cbind) %dopar% {

                 BLBsize = round(nrow(dat)^0.6)
                 for (i in 1:400){
                         set.seed(i)

                         # sampling B(n) data points from the original data set without replacement
                         sample_BOFN <- dat[sample(nrow(dat), size = BLBsize, replace = FALSE), ]

                          # sampling from the subsample with replacment
                         sample_bootstrap <- sample_BOFN[sample(nrow(sample_BOFN), size = nrow(sample_BOFN), replace = TRUE), ]

                         bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
                         coeftmp <- rbind(coeftmp, bootstrapModel$coefficients)

                 }
                 #calculating the estimators of the model with mean
                  colMeans(coeftmp)

         })

Since I don't know how many cores you have, I'm not sure if this question will solve your problem. But it might: stackoverflow.com/questions/33221779 — alexwhitworth
– alexwhitworth, Commented Nov 14, 2015 at 16:33
Also, it's unclear to my why you sample w/o replacement for sample_BOFN if you're bootstrapping. But it also doesn't appear that you're using sample_BOFN, so you may wish to remove this from the (example) code. — alexwhitworth
– alexwhitworth, Commented Nov 14, 2015 at 16:34
I'm trying to implement BLB bootstrap which require sampling from subsamples w/o replacement. so that's why. — navri
– navri, Commented Nov 14, 2015 at 22:34
actually the reference to that link did not help, because I have 4 cores, and I'm splitting my data set with iterator into 4 chunks. I would like to train model on each core with the BLB bootstrap. I don't understand how it's possible that I get NA values ? (I'm running the code on MAC btw) — navri
– navri, Commented Nov 15, 2015 at 11:08

alexwhitworth · Accepted Answer · 2015-11-16 18:08:09Z

I think you're going to have to go through a few iterations of the debugger on this to solve it. But you're getting NAsfrom this line

bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)

I am guessing that you get a singularity from one of your sample_bootstraps, since a singularity would give you an NA coefficient. But it's possible something else is causing this error, though it's definitely from this line of code.... you'll need to step through the debugger to isolate it.

... ie, this is not a complete answer. But this should allow you to solve your own problem:

You can see this by investigating:

r2 <- foreach(dat = isplitRows(data, chunks=1)) %dopar% {

     BLBsize = round(nrow(dat)^0.6)
     for (i in 1:400){
       set.seed(i)

       # sampling B(n) data points from the original data set without replacement
       sample_BOFN <- dat[sample(nrow(dat), size = BLBsize, replace = FALSE), ]

       # sampling from the subsample with replacment
       sample_bootstrap <- sample_BOFN[sample(nrow(sample_BOFN), size = nrow(sample_BOFN), replace = TRUE), ]

       bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
       coeftmp <- rbind(coeftmp, bootstrapModel$coefficients)

     }
     #calculating the estimators of the model with mean
     # return a list, not just the colMeans -- for debugging purposes
     return(list(coeftmp= coeftmp, result= colMeans(coeftmp)))

   }

   sum(is.na(r2[[1]][[1]])) # no missing coefficients with 1 core

r <- foreach(dat = isplitRows(data, chunks=num_of_cores)) %dopar% {

     BLBsize = round(nrow(dat)^0.6)
     for (i in 1:400){
       set.seed(i)

       # sampling B(n) data points from the original data set without replacement
       sample_BOFN <- dat[sample(nrow(dat), size = BLBsize, replace = FALSE), ]

       # sampling from the subsample with replacment
       sample_bootstrap <- sample_BOFN[sample(nrow(sample_BOFN), size = nrow(sample_BOFN), replace = TRUE), ]

       bootstrapModel <- glm(sample_bootstrap$Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width, data = sample_bootstrap)
       coeftmp <- rbind(coeftmp, bootstrapModel$coefficients)

     }
     #calculating the estimators of the model with mean
     # return a list, not just the colMeans -- for debugging purposes
     return(list(coeftmp= coeftmp, result= colMeans(coeftmp)))

   }

 # lots of missing values in your coeftmp results.
 lapply(r, function(l) {sum(is.na(l[[1]]))})

Collectives™ on Stack Overflow

parallel computing in R , implementing bootstrap

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related