0

I am running n instances of the same code in parallel and want each instance to use independent random numbers.

For this purpose, before I start the parallel computations I create a list of random states, like this:

import numpy.random as rand
rand_states = [(rand.seed(rand.randint(2**32-1)),rand.get_state())[1] for j in range(n)]

I then pass one element of rand_states to each parallel process, in which I basically do

rand.set_state(rand_state)
data = rand.rand(10,10)

To make things reproducible, I run np.random.seed(0) at the very beginning of everything.

Does this work like I hope it does? Is this the proper way to achieve it?

(I cannot just store the data arrays themselves beforehand, because (i) there are a lot of places where random numbers are generated in the parallel processes and (ii) that would introduce unnecessary logic coupling between the parallel code and the managing nonparallel code and (iii) in reality I run M slices across N<M processors and the data for all M slices is too big to store)

6
  • Did you try it? Does every instance get an independent set of inputs? Is it reproducible? Commented May 6, 2019 at 18:28
  • @mkrieger1 most of the times yes, I am quite sure I had some bugs because it sometimes wasn't doing the same thing, but I cannot reproduce it right now. Commented May 6, 2019 at 18:31
  • Why aren't you just using RandomState objects? Commented May 6, 2019 at 18:31
  • @user2357112 because that means being intrusive. The data is generated by other functions not all written by me that don't use RandomState objects Commented May 6, 2019 at 18:33
  • That's a serious deficiency in those other functions, then. Commented May 6, 2019 at 18:35

1 Answer 1

2

numpy.random.get_state sets the state for the global instance of the NumPy generator. However, each parallel process should use its own instance of a PRNG instead. NumPy 1.17 and later provides a numpy.random.Generator class for this purpose. (In fact, numpy.random.get_state and other numpy.random.* functions are now legacy functions since NumPy 1.17. NumPy's new RNG system was the result of a proposal to change the RNG policy.)

An excellent way to seed multiple processes is to make use of so-called "counter-based" PRNGs (Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", 2011) and other PRNGs that give each seed its own non-overlapping "stream" of random numbers. An example is the bit generator numpy.random.SFC64, newly added in NumPy 1.17.

There are several other strategies for seeding multiple processes, but almost all of them involve having each process use its own PRNG instance rather than sharing a global PRNG instance (as with the legacy numpy.random.* functions such as numpy.random.seed). These strategies are explained in my section "Seeding Multiple Processes", which is not NumPy-specific, and the page "Parallel Random Number Generation" in the NumPy documentation.

Sign up to request clarification or add additional context in comments.

7 Comments

and is it better to set_state on rand_state_obj or to just precompute a random list of seeds to initialize the rand_state_obj's?
Each instance should be initialized with a state that is unrelated to the state used by any other instance. In this sense, generating each random state using successive runs of a linear PRNG such as Mersenne Twister (which is what Numpy uses), or using the same PRNG as the parallel processes use, is not ideal because of the risk of correlated random numbers. A better choice may be generating each state using a hash function, where each parallel process is assigned its own identifier and each state is the hash of a fixed seed plus the parallel process's identifier.
and is there a canonical solution for this? sorry for sounding lazy here, but I feel like whatever I cook up by myself might just end up being flawed in a subtle way.
I can only give a general guideline here, since you didn't specify how each parallel process is created in your application. In general, assign each parallel process a unique identifier. Pass the fixed seed to each process. Then within the parallel process, generate the random state by hashing the unique identifier and the fixed seed using a hash function, then use set_state to set that process's RandomState object to use that random state.
Also, ensuring repeatable "randomness" with parallel processes is not necessarily trivial, especially because changing how the parallel processes interact may change the random results generated; for more information, see P. L'Ecuyer, D. Munger, et al. "Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs". April 17, 2015.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.