8

I set numpy random seed at the beginning of my program. During the program execution I run a function multiple times using multiprocessing.Process. The function uses numpy random functions to draw random numbers. The problem is that Process gets a copy of the current environment. Therefore, each process is running independently and they all start with the same random seed as the parent environment.

So my question is how can I share the random state of numpy in the parent environment with the child process environment? Just note that I want to use Process for my work and need to use a separate class and do import numpy in that class separately. I tried using multiprocessing.Manager to share the random state but it seems that things do not work as expected and I always get the same results. Also, it does not matter if I move the for loop inside drawNumpySamples or leave it in main.py; I still cannot get different numbers and the random state is always the same. Here's a simplified version of my code:

# randomClass.py
import numpy as np
class myClass(self):
    def __init__(self, randomSt):
        print ('setup the object')
        np.random.set_state(randomSt)
    def drawNumpySamples(self, idx)
        np.random.uniform()

And in the main file:

    # main.py
    import numpy as np
    from multiprocessing import Process, Manager
    from randomClass import myClass

    np.random.seed(1) # set random seed
    mng = Manager()
    randomState = mng.list(np.random.get_state())
    myC = myClass(randomSt = randomState)

    for i in range(10):
        myC.drawNumpySamples() # this will always return the same results

Note: I use Python 3.5. I also posted an issue on Numpy's GitHub page. Just sending the issue link here for future reference.

6
  • Do you actually need them all to share a single state, or do you just need distinct random numbers instead of repeating the same ones? Because I think (I’ll have to check when I’m in front of a computer) it would be a lot easier to just seed each process independently during its setup if the latter is acceptable. Commented Mar 20, 2018 at 2:38
  • @abarnert I need them to share the same random state because I need to eventually be able to replicate my results when I release my research code to public. Commented Mar 20, 2018 at 2:40
  • If you need to be repeatable between runs but non-repeating between processes within a run, you could just use the parent’s RNG to generate a seed to pass to each child to use to seed its RNG at startup. (In fact, I think this would make it easier to have repeatable runs, not harder, because non-deterministic timing of children pulling from the same RNG is no longer an issue.) Commented Mar 20, 2018 at 2:40
  • @abarnert Actually that's a good idea. I'll try it tomorrow. Commented Mar 20, 2018 at 2:45
  • One thing you need to check (assuming it works in the first place) is how much random state you need to pass to each child to get enough entropy. I don’t know that a 32-bit int is enough, but enough bits for a complete random state is probably overkill. But I need to think (or sleep) on this. Commented Mar 20, 2018 at 2:48

4 Answers 4

9

Even if you manage to get this working, I don’t think it will do what you want. As soon as you have multiple processes pulling from the same random state in parallel, it’s no longer deterministic which order they each get to the state, meaning your runs won’t actually be repeatable. There are probably ways around that, but it seems like a nontrivial problem.

Meanwhile, there is a solution that should solve both the problem you want and the nondeterminism problem:

Before spawning a child process, ask the RNG for a random number, and pass it to the child. The child can then seed with that number. Each child will then have a different random sequence from other children, but the same random sequence that the same child got if you rerun the entire app with a fixed seed.

If your main process does any other RNG work that could depend non-deterministically on the execution of the children, you'll need to pre-generate the seeds for all of your child processes, in order, before pulling any other random numbers.


As senderle pointed out in a comment: If you don't need multiple distinct runs, but just one fixed run, you don't even really need to pull a seed from your seeded RNG; just use a counter starting at 1 and increment it for each new process, and use that as a seed. I don't know if that's acceptable, but if it is, it's hard to get simpler than that.

As Amir pointed out in a comment: a better way is to draw a random integer every time you spawn a new process and pass that random integer to the new process to set the numpy's random seed with that integer. This integer can indeed come from np.random.randint().

Sign up to request clarification or add additional context in comments.

7 Comments

Great answer! Rather than using random numbers as seeds, I'd suggest a deterministic hash based on the process number. (This will produce different results depending on the number of processes, but then, so will the above!)
@senderle But the PID is not going to be repeatable across runs; random numbers pulled from a seeded RNG are.
Sorry, I didn't mean the PID -- I meant the number of the process in order of creation. Pretty sure there's a way to get that from within the child processes without having to pass anything...
@senderle Ah, that makes sense. Sure, just increment a counter for each new Process statement, or enumerate the iterable of args you pass to Pool or ProcessPoolExecutor, etc. That should work, and it's nice and simple. The only problem with that is that you get repeatable random numbers whether you want them or not—child #1 always gets seed 1 no matter how you seed the RNG, and so on. So there's no way to do multiple, distinct repeatable runs, just one specific one. Which may or may not be a problem (I haven't published scientific papers using numpy, so I don't know…)
@abarnert What if I fix the random seed at the beginning of the execution of the program and draw a random integer every time I spaw a new process and pass that random integer to the new process to set the numpy's random seed with that integer? But evwn this way, as far as I remember, the problem is not going to get resolved fully. Because once you set a random seed in the spawned process, the random state ia still going to stay fixed. So drawing new numbers would not change the random state which ia very weird.
|
2

You need to update the state of the Manager each time you get a random number:

import numpy as np
from multiprocessing import Manager, Pool, Lock

lock = Lock()
mng = Manager()
state = mng.list(np.random.get_state())

def get_random(_):
    with lock:
        np.random.set_state(state)
        result = np.random.uniform()
        state[:] = np.random.get_state()
        return result

np.random.seed(1)
result1 = Pool(10).map(get_random, range(10))

# Compare with non-parallel version
np.random.seed(1)
result2 = [np.random.uniform() for _ in range(10)]

# result of Pool.map may be in different order
assert sorted(result1) == sorted(result2)

3 Comments

Isn't it possible to use Process here anymore? For some reason I have to use Process in my work.
I am doing what you are suggesting here in my code. Things do not work. Could that be either because I'm using Process or because I'm using this in a class and import numpy separately in that class?
I don’t think it’s a Process vs. Pool issue (or even a bug in your translation of his code to Process, since he’s not using setup on the Pool or anything). Probably something simple. But as I said in my answer, I don’t think it’s a good idea anyway, because you need deterministic RNGs in processes, which means you can’t share a state unless you do something to sequence their use of the RNG deterministically (which still hasn’t come to me, but it probably involves waiting on barriers or passing a sync token through the children in order or something?).
2

Fortunately, according to the documentation, you can access the complete state of the numpy random number generator using get_state and set it again using set_state. The generator itself uses the Mersenne Twister algorithm (see the RandomState part of the documentation).

This means you can do anything you want, though whether it will be good and efficient is a different question entirely. As abarnert points out, no matter how you share the parent's state—this could use Alex Hall's method, which looks correct—your sequencing within each child will depend on the order in which each child draws random numbers from the MT state machine.

It would perhaps be better to build a large pool of pseudo-random numbers for each child, saving the start state of the entire generator once at the start. Then each child can draw a PRNG value until its particular pool runs out, after which you have the child coordinate with the parent for the next pool. The parent enumerates which children got which "pool'th" number. The code would look something like this (note that it would make sense to turn this into an infinite generator with a next method):

class PrngPool(object):
    def __init__(self, child_id, shared_state):
        self._child_id = child_id
        self._shared_state = shared_state
        self._numbers = []

    def next_number(self):
        if not self.numbers:
            self._refill()
        return self.numbers.pop(0)  # XXX inefficient

    def _refill(self):
        # ... something like Alex Hall's lock/gen/unlock,
        # but fill up self._numbers with the next 1000 (or
        # however many) numbers after adding our ID and
        # the index "n" of which n-through-n+999 numbers
        # we took here.  Any other child also doing a
        # _refill will wait for the lock and get an updated
        # index n -- eg, if we got numbers 3000 to 3999,
        # the next child will get numbers 4000 to 4999.

This way there is not nearly as much communication through Manager items (MT state and our ID-and-index added to the "used" list). At the end of the process, it's possible to see which children used which PRNG values, and to re-generate those PRNG values if needed (remember to record the full MT internal start state!).

Edit to add: The way to think about this is like this: the MT is not actually random. It is periodic with a very long period. When you use any such RNG, your seed is simply a starting point within the period. To get repeatability you must use non-random numbers, such as a set from a book. There is a (virtual) book with every number that comes out of the MT generator. We're going to write down which page(s) of this book we used for each group of computations, so that we can re-open the book to those pages later and re-do the same computations.

10 Comments

But you don’t need to pre-pull random numbers for each child. If you just pre-pull a seed for each one, and then let them seed their own copy of the RNG with that, it will have the same effect, and much simpler. Unless the processes are interacting with each other nondeterministically in a way that will cause their pattern of RNG calls to be different in different runs, that should be sufficient.
Plus, your way doesn’t actually guarantee repeatability. If different children exhaust their initial pools in a nondeterministic order, they’ll get different random numbers beyond that initial pool in different runs.
@abarnert: the repeatability occurs by deliberately re-computing the n'th random numbers or internal states: i.e., if you want to repeat, you either record or reset. It's otherwise just a technique for doing what both of you suggested and reducing initial (but not repeat) overhead. Note that this means that you MUST remember: "group A got random numbers starting at 0, 1000, 9000, and 12000; group B got random numbers staring at 2000, 4000, ...; group C got random numbers starting at 3000, 5000, ..." so that you can, when repeating, assign those sets to each group computation.
But how do you remember that? I mean, you can print it to logs and read it off manually, but then how do you ensure that on the repeat, each process gets the same batches even if they run at different relative speeds this time? Also, how does this reduce initial overhead compared to my solution? Surely grabbing 1000 random numbers at start is more expensive than 1, and then passing every random number through IPC (even if nicely batched) is more expensive than just passing 1 at startup and using the local RNG?
The MT state is (depending on implementation, I'm using their guide for theirs) 2508 bytes of data, so that fits pretty easily in one page (assuming 4k or larger pages) and hence is pretty straightforward to copy. At that same time, we also copy (or pickle or whatever) who got which. Grabbing N random numbers is to amortize all this overhead; that part is done locally, without any state-sharing; it's only the after-state that gets shared. It's true that repeating this later is very expensive: computationally, or waiting (enforce same sequencing).
|
1

You can use np.random.SeedSequence. See https://numpy.org/doc/stable/reference/random/parallel.html:

from numpy.random import SeedSequence, default_rng

ss = SeedSequence(12345)

# Spawn off 10 child SeedSequences to pass to child processes.
child_seeds = ss.spawn(10)
streams = [default_rng(s) for s in child_seeds]

This way, each of you thread/process will get a statistically independent random generator.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.