4

In R, I am parallelising my simulation using packages foreach, doFuture, and doRNG.
I have two nested foreach loop: the inner loop generate and analyse data for each iteration, and the outer loop repeats that procedure for different scenarios of parameter values. A simple example of my code structure is as follows:

library(foreach)
library(doFuture)
library(doRNG)
library(data.table)

plan(multisession, workers = 2)
registerDoRNG()
set.seed(123)
scenarios <- data.table(x = c(0.1, 0.2, 0.3), scenario = 1:3)

out_dir <- getwd()

results <- foreach(s = 1:3, .combine = rbind) %do% {
  this_x <- scenarios$x[s]
  scen_id <- scenarios$scenario[s]
  
  sim_res <- foreach(i = 1:100, .combine = rbind,
                     .options.future = list(seed = TRUE)) %dofuture% {
           y <- rnorm(1, this_x)
           data.frame(rep = i, y = y)
              
  }
  out_file <- file.path(out_dir, paste0("scenario_", scen_id, ".csv"))
  fwrite(sim_res, out_file)
  
    summary <- data.frame(scenario = scen_id, mean_y = mean(sim_res$y))  
                        summary
}

print(results)

While the registerDoRNG() function ensures safe reproducibility for parallel session, I have two issues that I would like your help:

1. The results of each scenario depends on the order that the scenario is run. So e.g. the results of scenario 1 and 3 when I use foreach(s = 1:3...) will be different from those when I use foreach(s = 3:1...)

2. I have not figured out how to store the state of the random-number generator at each iteration. This would be helpful if for example my R crash in the middle of the simulation and I want to start from where I left off, or if I want to investigate the results of a particular iteration.

I thought of setting a different seed for each iteration, like in here, but I read from here that this approach does not guarantee high-quality random numbers.

1
  • 3
    I invite whoever posted a close-vote to discuss how this needs more focus. While the text of the question might make one believe there are two questions (that's fine), I suggest it one of two: fix #1 to ensure the order of nested random pulls is maintained; or fix #2 to preserve the state of the RNG. Commented Aug 28 at 17:20

1 Answer 1

6

For the record, from the second link you posted, the "not good" use of

seeds <- sample.int(n = 10)
y <- parLapply(cl, seeds, function(seed) {
  set.seed(seed)
  runif(n = 5)
})

might be "not good" because in this case it only seeds with integers 1 to 10:

sample.int(n=10)
#  [1]  7  5 10  2  9  8  4  3  1  6

I agree with the assessment that that is not a good approach to controlling random seeds.

In my slightly different approach, we pull from a much larger set of seeds,

sample.int(.Machine$integer.max, size=10)
#  [1]  397286749 1052135137 1011271536  329404477 1906443373 1687197862 2021799661 1642042756 2121729063 1740460742

which will support a much broader surface of randomness. Since it relies on integers for seeding the random pool, it's certainly "ideally a little less random" than a non-seeded approach, but .Machine$integer.max is very large and supports the vast majority of what one might need.


I suggest we can use this for a very robust method that allows you to store the seed in your results.

The most detailed resolution would be to set a seed for each inner loop:

outerloop <- 3; innerloop <- 100 # for clarity/reference here
seeds <- replicate(outerloop,
                   sample.int(.Machine$integer.max, size = innerloop),
                   simplify = FALSE)
results <- foreach(s = 1:outerloop, .combine = rbind) %do% {
  # ...
  sim_res <- foreach(i = 1:innerloop, .combine = rbind,
                     .options.future = list(seed = TRUE)) %dofuture% {
           set.seed(seeds[[s]][i])
           # ...
  }
  # ...
}

I think that's what you're targeting, but in case you only want/need to set it in the outer loop, then it can be a bit simpler:

seeds <- sample.int(.Machine$integer.max, size = outerloop)
results <- foreach(s = 1:innerloop, .combine = rbind) %do% {
  set.seed(seeds[s])
  # ...
}

You can choose how/where to embed the seeds[s] within results.


Another approach is to use sample.int(.) inside each simulation and accompany its value in the output; for frames, this might be as an attribute (perhaps fragile) or as an invariant column (inefficient but rugged).

results <- foreach(s = 1:outerloop, .combine = rbind) %do% {
  # ...
  sim_res <- foreach(i = 1:innerloop, .combine = rbind,
                     .options.future = list(seed = TRUE)) %dofuture% {
           seed <- sample.int(.Machine$integer.max, size=1)
           res <- ...
           ### store seed as an attribute of the frame ...
           attr(res, "seed") <- seed
           ### ... or as a column
           res$seed <- seed
           res
  }
  # ...
}

This works in effectively the same way, though I often prefer the pre-pulled seeds for two reasons:

  • I find storing random seeds separately to be slightly safer: if I want to check that a change to my code fixed a bug, I cannot do it unless I have control over the seeds going in;

  • if you're doing it many many times, it can be measurably slower to do it one at a time, in the below case it's 183x slower. Generally I won't worry about this premature optimization unless (a) the inner calc is small, and (b) I'm doing well over 1e6 iterations.

bench::mark(
  `sample-1` = replicate(1e5, sample.int(.Machine$integer.max, size=1)), 
  `sample-n` = sample.int(.Machine$integer.max, size=1e5),
  check = FALSE)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 2 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory               time             gc                
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>               <list>           <list>            
# 1 sample-1   296.94ms 315.05ms      3.17    3.29MB     20.6     2    13      630ms <NULL> <Rprofmem [597 × 3]> <bench_tm [2]>   <tibble [2 × 3]>  
# 2 sample-n     1.77ms   1.82ms    549.      1.38MB      0     275     0      501ms <NULL> <Rprofmem [2 × 3]>   <bench_tm [275]> <tibble [275 × 3]>

The biggest risk with this is rerunning the outer/inner loops without updating seeds for the purpose of assessing consistent variance ... but it'll be the same variability.

If your problem is that once in a million sims you want to be able to rerun one single run, and if calc-time is slow (so the cost of a single-pull is negligible), then sampling inside the inner loop is the quickest to code and mitigate the risk of not changing seeds.


Another option to just "save the state" is to save saved <- .Random.seed inside each inner loop, and return it with the data, either as an attribute or a list(.Random.seed, results). Clearly that won't work for saving to a .csv file, but if you are open-minded about that, then you can:

  • save a .rds or .rda file that includes the results and the seed, though this is specific to R;
  • save to a .parquet file with the seed as an attribute, preserved in most cases on reading it, though admittedly I don't know the ease of accessing frame attributes when reading in python (e.g.)
  • save the data to .csv and the saved-seed(s) to separate files, perhaps the latter to a per-sim or per-outerloop or per-innerloop .rds file
Sign up to request clarification or add additional context in comments.

6 Comments

For clarity and, (hardly dare suggest this is incomplete), completeness, and speaking from a global set.seed(42) setter, why this is necessary as it appears to globalists that set.seed does the trick but is stuck in global environment, doesn't in operate or mediate results in for loop environment, and iff set in for loop first level doesn't further operate in nested levels...I suppose it goes without saying.
@r2evans: This is a very insightful answer, thank you very much! I have some follow up questions: 1. Could you clarify your statement some more: "The biggest risk with this is rerunning the outer/inner loops without updating seeds for the purpose of assessing consistent variance"? Do you mean the consistency of the Monte Carlo SE with different starting seeds?
@r2evans: My 2nd question: What about the strategy of setting the same seed for all scenarios (the outer loop)? Because I only pool results per scenario, I guess it does not affect the randomness of the iterations of the inner loop in each scenario? I read from the tutorial paper of Morris (Stat Med. 2019) that setting the same seed for different DGM (scenarios in my case) makes them more comparable. Do you think it is a sensible approach?
(1) "biggest risk", meaning if you want to run with different seeds and forget to update seeds, then you're not going to get anything different. It's minor and with even a small amount of attention, you'll be fine. I suggest against reusing seeds too much, relying on the stochastic nature of your process to give your research what you need. IMHO, the reason to have seeds is not to keep them locked, but to be able to replay when you find something interesting.
(2) "same seed" makes no sense to me, unless all of the other parameters of the simulation are different. In that case ... meh, I don't have a strong opinion since I don't know the rest of the story.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.