Is this proper use of numpy seeding for parallel code?

Question

I am running n instances of the same code in parallel and want each instance to use independent random numbers.

For this purpose, before I start the parallel computations I create a list of random states, like this:

import numpy.random as rand
rand_states = [(rand.seed(rand.randint(2**32-1)),rand.get_state())[1] for j in range(n)]

I then pass one element of rand_states to each parallel process, in which I basically do

rand.set_state(rand_state)
data = rand.rand(10,10)

To make things reproducible, I run np.random.seed(0) at the very beginning of everything.

Does this work like I hope it does? Is this the proper way to achieve it?

(I cannot just store the data arrays themselves beforehand, because (i) there are a lot of places where random numbers are generated in the parallel processes and (ii) that would introduce unnecessary logic coupling between the parallel code and the managing nonparallel code and (iii) in reality I run M slices across N<M processors and the data for all M slices is too big to store)

Did you try it? Does every instance get an independent set of inputs? Is it reproducible? — mkrieger1
– mkrieger1, Commented May 6, 2019 at 18:28
@mkrieger1 most of the times yes, I am quite sure I had some bugs because it sometimes wasn't doing the same thing, but I cannot reproduce it right now. — Bananach
– Bananach, Commented May 6, 2019 at 18:31
@user2357112 because that means being intrusive. The data is generated by other functions not all written by me that don't use RandomState objects — Bananach
– Bananach, Commented May 6, 2019 at 18:33

Peter O. · Accepted Answer · 2021-05-25 19:47:27Z

2

numpy.random.get_state sets the state for the global instance of the NumPy generator. However, each parallel process should use its own instance of a PRNG instead. NumPy 1.17 and later provides a numpy.random.Generator class for this purpose. (In fact, numpy.random.get_state and other numpy.random.* functions are now legacy functions since NumPy 1.17. NumPy's new RNG system was the result of a proposal to change the RNG policy.)

An excellent way to seed multiple processes is to make use of so-called "counter-based" PRNGs (Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", 2011) and other PRNGs that give each seed its own non-overlapping "stream" of random numbers. An example is the bit generator numpy.random.SFC64, newly added in NumPy 1.17.

There are several other strategies for seeding multiple processes, but almost all of them involve having each process use its own PRNG instance rather than sharing a global PRNG instance (as with the legacy numpy.random.* functions such as numpy.random.seed). These strategies are explained in my section "Seeding Multiple Processes", which is not NumPy-specific, and the page "Parallel Random Number Generation" in the NumPy documentation.

edited May 25, 2021 at 19:47

answered May 6, 2019 at 18:31

Peter O.

33.1k14 gold badges86 silver badges97 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Bananach Over a year ago

and is it better to set_state on rand_state_obj or to just precompute a random list of seeds to initialize the rand_state_obj's?

Peter O. Over a year ago

Each instance should be initialized with a state that is unrelated to the state used by any other instance. In this sense, generating each random state using successive runs of a linear PRNG such as Mersenne Twister (which is what Numpy uses), or using the same PRNG as the parallel processes use, is not ideal because of the risk of correlated random numbers. A better choice may be generating each state using a hash function, where each parallel process is assigned its own identifier and each state is the hash of a fixed seed plus the parallel process's identifier.

Bananach Over a year ago

and is there a canonical solution for this? sorry for sounding lazy here, but I feel like whatever I cook up by myself might just end up being flawed in a subtle way.

Peter O. Over a year ago

I can only give a general guideline here, since you didn't specify how each parallel process is created in your application. In general, assign each parallel process a unique identifier. Pass the fixed seed to each process. Then within the parallel process, generate the random state by hashing the unique identifier and the fixed seed using a hash function, then use set_state to set that process's RandomState object to use that random state.

Peter O. Over a year ago

Also, ensuring repeatable "randomness" with parallel processes is not necessarily trivial, especially because changing how the parallel processes interact may change the random results generated; for more information, see P. L'Ecuyer, D. Munger, et al. "Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs". April 17, 2015.

|

Collectives™ on Stack Overflow

Is this proper use of numpy seeding for parallel code?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related