2

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.

On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (9.2.3.1):

However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized

which hints to load/store atomicity "per-default" in shared memory. Thus, would the instructions ld.weak.shared.cta and ld.relaxed.shared.cta have the same effect? Or is it an information the compiler needs anyway to avoid optimizing away load and store?

More generally, supposing variables are properly aligned, would __shared__ int and __shared__ cuda::atomic<int, cuda::thread_scope_block> provide the same guarantees (when considering only load and store operations)?

Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are __device__ int and __device__ cuda::atomic<int, cuda::thread_scope_block> equivalent in term of atomicity of load/store operations?

Thanks for any insight.

5
  • I don't have a complete answer but note that a non-atomic access allows compiler optimizations that will definitely change behavior, e.g. reordering, removing redundant loads, etc. So a fairer comparison would be with __shared__ volatile int. Commented Jun 11, 2022 at 11:58
  • Loads and stores being serialized doesn't mean atomicity. E.g. two threads load the same value from one address, both add one to it and both write back. Even with all the accesses being serialized, this is still a race condition and resulting in +1 instead of +2. Commented Jun 11, 2022 at 18:06
  • 1
    So no you wont get atomicity without requesting it explicitly in any of these cases I would say. Commented Jun 11, 2022 at 18:09
  • 2
    Thanks but, actually, I mean load and store atomicity, not an atomicAdd. Let's suppose I'd be happy to get 1 as a result if load and store are atomic, and I avoid torn reads and writes (for instance). Commented Jun 12, 2022 at 13:07
  • 1
    Hm, okay than I just want to say that all the stuff about shared memory conflicts is normally in terms of a single warp, so I'm not sure if one can interpret anything about accesses from different warps in the same block from that line. Commented Jun 12, 2022 at 22:09

1 Answer 1

0

Serialization does not imply atomicity: thread A writes the 2 first bytes of an integer, then thread B reads the variable a, and finally thread A writes the last 2 bytes. All of this happening in sequence (not in parallel), but still not being atomic.

Further, serialization is not guaranteed in all cases, see:

Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

Conclusion: use atomic.

Sign up to request clarification or add additional context in comments.

4 Comments

In most systems, tearing within a single byte is impossible. And if you have small integer values, e.g. 0x00000000 or 0x00000001, you don't care where the upper bytes come from, only the byte with the 1 bit you care about. But if a compiler could hoist / sink a load / store out of a loop, then you need something that will prevent that, which usually also means avoiding tearing. And if you need any ordering wrt. anything else, you need to do something about it.
Serialization is just a performance problem for atomic loads; you'd hope loads coming from processors which share a coherent cache could all simultaneously hit if the cache line was in shared state. Multicasting loads to cores waiting for them doesn't imply a lack of atomicity either, just a lack of ordering. In CPUs, even memory_order_seq_cst loads can access the same cache line in parallel from different cores (because CPUs have coherent caches.) For seq_cst, you can always invent some order that's consistent with what happened, but it doesn't mean they have to wait for each other.
So anyway, serialization in old devices doesn't imply atomicity (and lack of it in new devices doesn't imply non-atomicity), but if your values are known to always be small, or every write puts the same value in all but one byte (or even bit), then you only need atomicity of a single byte.
Interesting comments, but perhaps a bit out-of-scope here IMHO: * the answer is for 4-bytes integer, but indeed if you only use 1 byte or less of the 4 bytes, I'd like to believe you get load/store atomicity, but I don't think it's guaranteed anywhere in the CUDA doc? * I was only considering read/write atomicity, so I think memory order consistency is not relevant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.