3

My Python program has multiple threads accessing the same Numpy array instance. Since Numpy releases the GIL, these threads can end up accessing the array simultaneously.

If the threads are concurrently accessing the same array element, this can clearly cause a race condition where the result depends on the specific order in which the threads happen to execute. However, in languages such as C++, concurrent conflicting memory access by multiple threads may cause a data race that results in completely undefined behaviour.

I would like to understand what semantics are guaranteed by Numpy in case of concurrent array access. Are there rules I must follow to ensure that my program has sequential consistency? What happens if I break those rules?

  1. If the threads simultaneously access the same array but never simultaneously access the same array element, is there any guarantee that this can not cause a data race?
  2. If one thread writes an array element that another thread is simultaneously reading, can this cause the write action to fail or the written data to become corrupted?
  3. Is there any guarantee that the consequences of concurrent conflicting array access will be limited to the contents of the array, or can it also lead to undefined behaviour in other parts of the program or maybe crash the Python interpreter?
  4. Do the answers to these questions depend on the underlying machine architecture, such as x86 vs arm?

I really hope to understand what the precise rules are in these cases.

I found a similar question, but the answer only confirms that the threads can cause conflicting access. No explanation of the semantics of Numpy in such cases: Is python numpy array operation += thread safe?
Another similar question without answers: Are ndarray assingments to different indexes threadsafe?

# Example of a program that performs simultaneous array access.

import threading
import numpy as np

a = np.zeros(100000, dtype=np.int16)

def countup():
    for i in range(10000):
        a[:] += 1

def countdown():
    for i in range(10000):
        a[:] -= 1

t1 = threading.Thread(target=countup)
t2 = threading.Thread(target=countdown)
t1.start()
t2.start()
t1.join()
t2.join()

# Some elements of the array will be non-zero.
print(np.amin(a), np.amax(a), np.sum(a != 0))
2
  • "benign race conditions" -- corrupt data/incorrect results are not what I'd call benign. Possible even worse than an outright crash, since then you've caught it right there and then and at least have a decent chance at debugging it. Commented Mar 26, 2024 at 18:31
  • 1
    @DanMašek -- I agree my mention of "benign" race conditions is unclear. I rephrased that part of the question. Commented Mar 26, 2024 at 19:33

1 Answer 1

4

Yes, race conditions can occur on Numpy arrays when the target Numpy function release the GIL and multiple threads access to the same array with at least one writing into it. Note that what matters is the access to the internal Numpy data buffer which can be shared by multiple array views. Besides, AFAIK most Numpy functions release the GIL.

If the threads simultaneously access the same array but never simultaneously access the same array element, is there any guarantee that this can not cause a data race?

As long as there are synchronization mechanisms (e.g. locks or atomics) enforcing that (or more specifically memory barriers), this is fine : multiple threads can accesses different parts of the internal buffer. The cache coherence protocol is responsible to ensure cache lines are coherent between the L1 cache of the different cores so software threads coherence is guaranteed.

If one thread writes an array element that another thread is simultaneously reading, can this cause the write action to fail or the written data to become corrupted?

Technically, in C and C++ a race condition is undefined behaviour and Numpy inherits such a behaviour because it is written mostly in C.

Indeed, when there is no synchronization mechanism, a threads on a core may operate on dirty data that has been invalidated and a race condition can occur because of that. This often happens because threads store items temporary in (SIMD) registers and a cache line can be invalidated meanwhile. Read-modify-write x86 instruction are not atomic by default unless a lock prefix is explicitly used. Numpy never use atomic instruction for basic array operation because they are generally (far) much slower (and this would not solve all kind of race condition anyway).

AFAIK on x86, writes never fail, but threads can still operate on corrupted data (and then write corrupted data indirectly). Indeed, unaligned writes for example are not guaranteed to be atomically done so a thread can read a partially updated item. This happens for Numpy array containing strings for example (and possibly array with the complex type that may not be aligned internally). If you play with Numpy low-level views (np.view), then I think you can get arrays containing items that are not aligned. On other platforms, storing an item is not guaranteed to be done atomically (e.g. a CPU can perform multiple memory request for a single item). You should really not rely on such a behaviour for sake of portability, especially in a Python program using Numpy.

Note that when multiple threads access exclusive items of the same cache line with at least one writing into it, the cache coherence protocol ensure the accesses are coherence but this coherence mechanism is particularly expensive (due to internal low-level synchronization between cores increasing significantly the latency of memory operations). This effect is called false sharing.

Is there any guarantee that the consequences of conflicting array access will be limited to the contents of the array, or can it also lead to undefined behaviour in other parts of the program or maybe crash the Python interpreter?

Yes, as long as threads does not concurrently work on views sharing the same internal Numpy data buffer (or a part of it). The structure of the CPython interpreter is protected by the GIL. AFAIK, Numpy releases the GIL only when a low-level C processing is performed and this processing does not access interpreter structures. When it does, it must not release the GIL. Otherwise, it would be a bug.

Do the answers to these questions depend on the underlying machine architecture, such as x86 vs arm?

Overall, the observed effects can be different, but the presence of a race condition (as specified before or as specified in the C/C++ languages) is independent of the architecture. Consequently, a program with a race condition can behave correctly on x86 but incorrectly on ARM for example (data corruption or crash). One reason is the atomicity of read/stores. Another reason is that x86 has a stronger memory ordering model than ARM (or most other architectures). See this article for more information.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks for your answer! It confirms some things I suspected. One concern I have is that you seem to be reasoning from the current implementation of Numpy and the way it might interact with various low-level aspects of the machine. What I would hope to know instead is which guarantees the Numpy library wants to provide intentionally (not just by coincidence). I'd expect this must be documented somewhere, especially if a mistake can cause undefined behaviour.
AFAIK, it is not much documented and I agree it should be. That being said the behaviour is rather intuitive IMHO because this definition of race condition applies in most languages. Note that architecture details are never specified in languages or API (except low-level architecture-dependent ones). Which part of the answer do you think is dependent of the current implementation of Numpy ?
To support the statement that separate-element access is safe, you provide an argument about a certain type of cache coherence protocol. I think this argument rests on an implicit assumption that Numpy will never access memory outside the selected index range. But for all I know, Numpy may "optimize" its access pattern, causing it to read or write outside the selected range in a way that does not affect a single-threaded program. This is one reason why low-level arguments about concurrency are unsound in C++. Instead, the high-level API should specify the rules and semantics.
"implicit assumption that Numpy will never access memory outside the selected index range" Yes, but when you provide a view to Numpy, it should only write in the target items. This is AFAIK always the case but I agree that it should be documented (and not yet done). If you want to be very safe, you can read array in parallel but not write in the same array in parallel. That being said, I do not expect this behaviour to change in Numpy any time soon.
Note that multithreading in Numpy is not very frequent because it is not very efficient. Numpy functions tends to put a lot of pressure on memory (more than needed) so a parallel implementation is often memory-bound. Thus, parallel code using Numpy tends not to scale. On top of that the GIL is locked for a short period of time so if you do many Numpy calls on small/medium-sized arrays, it does not scale at all with multiple threads.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.