3

I stumbled into an interesting issue -- when it comes to intrusive lockless stacks (single-linked lists) it seems there is a consensus on how push() should look like. All internet/AI searches (and cppreference.com) always produce code written along these lines:

struct stack
{
    struct node { node* next; };

    std::atomic<node*> head;

    void push(node* new_node)
    {
        new_node->next = head.load(std::memory_order_relaxed);

        // simplified: real code has an ABA counter in the low 12 bits of head
        while (!head.compare_exchange_weak(new_node->next, new_node,
                                           std::memory_order_release,
                                           std::memory_order_relaxed)) {}
    }
};

but when it comes to pop() search results are inconsistent wrt memory ordering, e.g.:

    node* pop()
    {
        node* p = head_.load(std::memory_order_**acquire**);

        while (p && !head_.compare_exchange_weak(p, p->next, 
                                                 std::memory_order_**acq_rel**,
                                                 std::memory_order_**relaxed**));
        return p;
    }

I've seen various permutations of highlighted parameters (e.g. relaxed + acq_rel + acquire or relaxed + acquire + relaxed). I understand it doesn't matter on x86, but it is nice to do things right...

What are the correct values of three highlighted parameters?

Note: I know about ABA problem and in my code I solve(*) it using tagged pointers, with the low 12 bits of head being a counter. This is a page allocator.

*solve = severely reduce possibility of happening :)

After-thoughts (see details in comments to Peter's answer):

  • It seems the only way to have formally correct slist in C++26 is to:

    • (prevent data race UB) make node::next atomic

    • (prevent strict aliasing UB) do not perform non-atomic node access; i.e. if you use slist for page pool -- portion of the page containing node can't be used for other purposes

    • use Peter's version of pop(); i.e. acquire+relaxed+acquire

  • On real ISAs with normal compilers you can use the suggested pop() and use entire page (trampling over non-atomic node::next); just add related comments :) seq_cst for CAS will compile to the same asm on x86 as acquire/etc, but seq_cst doesn't make it any safer on other ISAs where it does compile differently so there's no reason to prefer it here other than using the default so the source is more compact.

12
  • 1
    @HTNW I don't see infinite loop here -- failed path will update p and p->next will be (re-)loaded in next iteration. Do I miss smth? Commented Oct 21 at 0:24
  • 1
    I think there's another race here (I hope this one is real!). If two threads pop at the same time, it's possible that in between one thread doing p = head.load(...); and evaluating p->next (which happens before the compare_exchange_weak!), the other thread can finish pop() and modify p->next (or free p, etc.). Voila: UB. I don't think any amount of memory ordering on head helps with this. Commented Oct 21 at 4:00
  • 1
    @HTNW pop() doesn't modify p->next... but subsequent push() does, see my comment to Peter's answer. I.e. it seems I need to make node::next atomic or we can't avoid concurrent non-atomic access. Commented Oct 21 at 4:11
  • 1
    CAS_strong is irrelevant; spurious failure of CAS_weak isn't a problem even on ISAs where it can happen. Just using my pop is safe in practice on any real ISA with mainstream compilers, not just x86, despite the data-race and/or strict-aliasing UB on next. There's no plausible way for that to be compile-time-visible and cause it to mis-compile, and it's a non-problem in asm. Commented Oct 21 at 20:59
  • 1
    seq_cst doesn't make it any more or less safe to use the whole page. It will compile the same on x86 as the minimum required orders I specified, but worse on other ISAs for no benefit. I mean yes it's technically true that you can use seq_cst for any atomic RMWs on x86 with no downside in the asm, but why do that? Seems like a weird thing to mention. Commented Oct 21 at 21:04

1 Answer 1

6

The pure load is correct: acquire is the minimum for correctness.
The CAS in pop is too strong in the success order (just acquire or even relaxed is fine), and too weak in the fail order (acquire is needed to set up for the next iteration, for the same reason acquire was needed on the initial load.)

 node* pop()
 {
    node* p = head_.load(std::memory_order_acquire);  // acquire is necessary

    while (p && !head_.compare_exchange_weak(p, p->next,   // deref of p
                                             std::memory_order_relaxed, // was acq_rel
                                             std::memory_order_acquire)); // was relaxed
    return p;
}

(But unfortunately C++ requires that CAS success strength be at least as strong as fail else UB, so actually you need CAS_weak(p, p->next, acquire, acquire). This prevents you from allowing a compiler to make the success path faster on ISAs like ARMv7 where acquire is done with separate fence instructions after a load, which could be put only on the fail retry branch.)

Actually consume would be sufficient here since the accesses which need to happen-after stuff in the other thread are all derefs of a pointer we loaded. consume is deprecated in ISO C++, and compilers always just promoted it to acquire. In practice relaxed loads work like consume except on DEC Alpha, at least in cases like this where the compiler can't plausibly figure out that there's only one possible value for p and use that instead of a data dependency on the load.

See also C++11: the difference between memory_order_relaxed and memory_order_consume and especially the link to Paul E. McKenney's CppCon 2016 talk about it. In practice relaxed would be safe for all these loads because all current-mainstream ISAs do support memory dependency ordering, and the data dependency isn't going to optimize away here. The C++ standard doesn't guarantee that, allowing for stuff like Alpha, and hypothetical future CPUs which do value-prediction for loads.

In the rest of this answer, I'm going to talk about what's necessary in the C++ formalism, where acquire is the weakest option to get guaranteed happens-before even with a data dependency, unfortunately.


acquire on the initial load is necessary because we deref p (to load p->next), unlike a CAS retry loop which just does arithmetic1. We need to sync-with other threads which created a node, set its next pointer, and do a release operation on head to publish it.

With just load(relaxed), evaluating p->next would have data-race UB: potentially-concurrent write of new_node->next = head.load(); to the same node object in a thread running push. (The release on the CAS in push is of course also necessary so our acquire load can sync with it.)

(See https://preshing.com/20120913/acquire-and-release-semantics/ for a good conceptual explanation of release/acquire sync and memory ordering; Jeff Preshing's articles are an excellent intro to thinking about memory ordering. Practical instead of ISO C++'s formal rules which do indeed make one's head hurt as you commented.)


For the same reason, relaxed is too weak for the CAS-fail case in pop. That reloads p from head, which might have just been updated by a new pop. This fail-case load reaches the next iteration's p->next evaluation and CAS exactly the same way as the pre-loop pure load.

The modifications to head are all atomic RMWs so there's a release-sequence, but it's not sufficient that we synced with all older heads. If the activity that made our CAS fail was a push between our load and our next CAS, writing its next pointer didn't have to happen-before writing the head value we loaded first and synced-with.
(That would be the case if we raced with a pop; our updated p will be pointing to a node that was already in the stack, and a release-sequence exits between the release-RMW that published it and our initial pure load, so the ->next assignment in that push call happens-before our access to it.)


The success order on the CAS is unnecessarily strong. It needs to be acquire so the caller can safely deref the p we return. Except we've already synced-with that value with the pure load or the CAS-fail load since we've already derefed it, so actually I think pop's CAS-success order can be relaxed!
acquire would make it fully safe even in case of ABA success, but in this algo we have to avoid that for correctness anyway.

It doesn't need to be a release operation: nothing earlier in this thread needs to be visible to other threads which acquire the updated head value we produce.

The release-sequence rule means any RMW avoids breaking sync between earlier writers and later readers. If not for that release-sequence rule, the pop CAS would indeed need to be acq_rel, so previous writes are visible to us, and our release would make them visible to the next thread which acquires the value. But those later acquire loads already sync with all earlier release ops in the modification order of head, because the modifications are an unbroken chain of RMWs.

Perhaps the people who wrote the examples you found forgot about release sequences, or else I'm missing something? Maybe they thought they needed acquire + release to "pass along" visibility of earlier ops?

acq_rel without the release is just acquire. Without the acquire part either (since we already did that with a pure load), it's just relaxed.


node::next needs to be atomic<>, and you can't free nodes.

As you and HTNW pointed out in comments, pop can sleep after loading p, then wake up and read p->next after stuff has changed. Specifically, after another thread has popped the node p is pointing to, and pushed it again. While doing so, it writes the ->next pointer. So we have a write that's concurrent with the read by the slow-pop thread.

That's data-race UB in ISO C++. In practice it's harmless on mainstream C++ implementations, other than clang++ -fsanitize=thread which AFAIK is still a conforming C++ implementation. std::atomic<node*> next which you load/store with a relaxed loads would be fine. std::atomic_ref could let you update it with plain assignments before its published, but that probably compiles to the same asm as using a temp variable and .store(curr_head, relaxed) since pointer-width relaxed-atomic loads/stores don't need any special instructions in mainstream ISAs, and these loads / stores can't optimize away even if we gave the compiler the chance to do so.

If the node gets reused, what we load for p->next doesn't matter; the CAS will fail because your tagged pointer makes sure p isn't the same as the new head.

But ISO C++ doesn't have a way to do non-atomic loads that can tear, whose result you won't use when you discover that a race may have happened. This is more of a problem for SeqLock, where the whole point is to have a payload bigger than what the machine can do atomically for free.

Worse, the thread which won the race to pop this node could delete it, and if the standard library gives this page of memory back to the OS, reading p->next will fault! So you need to never do that. The deallocation problem for lock-free data structures is difficult in languages which aren't garbage-collected.

This node-reuse race with a slow pop can happen even with sequential consistency, so no amount of memory ordering can fix it.


Footnote 1:
A CAS retry loop implementing just arithmetic on a single value, like atomic right shift, can start with a relaxed load, regardless of what memory ordering you want for the operation as a whole. Computation on the relaxed load result always safely produces something, and the eventual success CAS is exactly like fetch_add or a hypothetical fetch_shift. The work that went into preparing the value to store is irrelevant as long as it's a pure function of expected. But derefing a pointer is not pure in that sense.

The CAS loop in push isn't, on failure, updating any pointers that it will deref. It only updates new_node->next to point to the current head. BTW, I find using new_node->next as the expected arg to CAS slightly obscure/tricky; I'd have used a local temporary variable like node *curr_head and used a do{new_node->next = curr_head;}while(!CAS). But maybe I only found it tricky because I'm used to thinking in asm, so I expect the first operand of CAS to be something that can be in a register. But new_node->next has to get stored to memory as well as keeping the value in a register for the next CAS. Still, I think it's nice to make the update to new_node->next; very explicit as an assignment statement inside the loop, even though that requires more lines of code.

(Update: with atomic<node*> next, you need a temp variable anyway.)

Sign up to request clarification or add additional context in comments.

23 Comments

Since your answer is more detailed, and you are dissatisfied with mine, I’ll remove it and upvote yours.
With just load(relaxed), evaluating p->next would have data-race UB: potentially-concurrent write of new_node->next = head.load(); to the same node object in a thread running push. I am trying to understand how we can have an UB here... If load(relaxed) gets a new node -- this means push thread CAS result was observed (and this means write to p->next has happened too, due to release in CAS). And p->next load can happen only after p = load(relaxed) (is it)? I.e. implicit (data?) dependency between p = ... and p->next serves as p = load(acquire) but only wrt p->next load.
I have a feeling this "new_node->next write" concurrent to our "p->next load" is unavoidable (regardless of memory ordering flags used) and therefore is a red herring. Situation: Thread A runs pop() and is suspended right after p = head_load(...), Thread B runs pop(), followed by push() of the same node. If Thread A wakes up while Thread B runs push() -- you have concurrent read and write to the same non-atomic variable.
Oh, yeah that's a showstopper problem for this stack algorithm. That can happen even with sequential consistency, as HTNW and you discussed in comments under the question; use-after-free is the biggest real-world concern in case that unmaps the page that held the node. Otherwise its harmless on real machines to load p->next as long as p still points to valid memory. (Hypothetically just loading (into an address register) something that's no longer a valid pointer could fault or something, even if we don't use the value other than with a CAS which fails due to tagged pointers.)
Maybe Davislor deleted their answer before you could see my reply in comments under it about memory dependency ordering not being guaranteed in the ISO C++ abstract machine, nor on DEC Alpha AXP. memory_order_consume existed was intended to let us take advantage of it, but the design was so hard to implement all compilers just promoted it to acquire. Because it allowed things like preserving the data dependency even in other_var[ p-p ], requiring compiling to an actual sub instruction instead of a constant 0 without a data dependency. consume was deprecated and then removed.
I saw it, but it is unlikely I fully understood it :-/
See See Paul E. McKenney's CppCon 2016 talk: C++ Atomics: The Sad Story of memory_order_consume: A Happy Ending At Last? - spoiler alert, we didn't get a happy ending in C++17/20/23/26, instead dropping the feature. But IIRC, he does a good job explaining what memory dependency-ordering is, and what corner cases exist that make it hard for compilers. Indeed, in this case, relaxed would work in practice. Updated my answer with more about that.
See stackoverflow.com/questions/38280633/… Some real-world code like the Linux kernel does use memory dependency ordering, e.g. in RCU to allow read fast-paths to have zero barriers even on weakly-ordered ISAs like PowerPC. They target known compilers, and have rules like making sure the pointer math doesn't do anything that could optimize away, which isn't something ISO C++ wanted to try to formally specify.
Peter, this slist is taken from "memory page pool" code (4kb-aligned so we have 12 bits for ABA problem), where (after pop()) clients use entire (4kb) page (via reinterpret_cast<char*>) -- i.e. even if I use atomic node::next it gets mercilessly trampled over in parallel threads using non-atomic semantic. So, would it be correct to say that this (page pool) logic which is used in a lot of real-world code (and I used it first time nearly 30 years ago) still can't be expressed in "pure" C++ because it can't "formalize tearing reads"?
That's correct, that usage requires overlapping lifetimes of the next pointer and of whatever other objects another thread uses there (with placement-new I think would be formally required even without the data-race). So that's data-race UB and/or strict-aliasing UB. According to the C++ standard, any UB anywhere in your program means the standard has absolutely nothing to say about what happens in your program before or after that happens. It is of course perfectly fine on normal CPUs which don't do hardware race detection. And real compilers because they don't do inter-thread analysis.
Thank you, Peter! Your answer is amazing. Alas, I don't have enough knowledge about formal C++ memory model to verify it's every facet -- i.e. I'll have to take your word (wrt memory order flags). In my code I'll go back to strong CAS and add comments about two flavors of UB my code has and why it is ok on my platform/compiler combo. On the other hand, it looks like formally correct slist is possible, if you: 1) use atomic node::next; 2) do not (over-)write it when using that page after pop(); 3) use your version of pop(). Thank you for that!
...and one more note (for whoever is itching to use slist): tagged ptr is not a bullet-proof ABA fix, it severely reduces probability of it happening, but it still can happen. Better use alternative solution (ring buffer of page pointers?)
compare_exchange_strong has zero advantage here over compare_exchange_weak. Stronger memory-order parameters also have zero advantage. The version of pop in my answer is safe for the way you use it on real-world CPUs with real mainstream compilers. If ABA is possible occasionally, stronger memory orders still don't help, I don't think. They'd just make pop slower, increasing the race window if anything. Memory dependency ordering makes it safe even with relaxed; even acquire is overkill in practice.
For ABA, x86-64 also has at least 7 high bits you can use, since pointers are either 57 or 48-bit. And in user-space, the high bits are known zero, so actually 8 high bits. (Gotta clear them before deref including in p->next. stackoverflow.com/questions/16198700/… . Unless you use HW features like Intel LAM (Linear Address Masking) / AMD Upper-Address-Ignore or ARM TBI (top-byte-ignore) to let you deref non-canonical pointers. lwn.net/Articles/902094 / phoronix.com/news/AMD-Linux-UAI-Zen-4-Tagging)
cppreference states: When compare_exchange_weak would require a loop and compare_exchange_strong would not, compare_exchange_strong is preferable unless .... Wait, I still need a loop, isn't it? Sigh... let me update my afterthoughts in original question
atomic solves the memory race but there's still an ABA problem. So it's definitely needed to avoid node reuse.
The OP uses tagged pointers; since these are page-aligned, they can use the low 12 bits as a counter. (Or as a random number since I guess the old counter value gets lost when using the whole page between pop and re-push, and that would be why they say only mostly avoids ABA, not 100%.)
I use 12 bits as a counter of operations on a head (not given node) -- every successful CAS would increment that counter. In any case, my point was that it is still theoretically possible for thread A to be suspended, wake up after 2^12 ops on head and be stung with ABA. Chances are very slim, negligible, but not zero.
Oh right, yeah that should do it. But yeah it's a single counter so every page cycled through the stack increments it, not the same page 2^12 times, so it wouldn't necessarily take an extremely long sleep for this to be possible. It would need the counter to roll over just when you're re-pushing the same page. Probably not worth it to use upper bits of pointers as well, since that costs more masking with 64-bit constants that need large instructions to materialize in regs on x86-64, and you'd need to manually propagate carry between high and low chunks.
Ahh, I checked for tagged pointers in the C++ code but OP simply stated it in prose.
Peter, I am getting this warning in C++23: warning: failure memory model 'memory_order_acquire' cannot be stronger than success memory model 'memory_order_relaxed'. I wonder if it tries to tell me something (I've seen claim online that this is UB)...
Oh yeah, I thought I remembered that C++ had this requirement about the fail strength needing to be at least as strong as the success strength. It's not logically necessary, and also not necessary in asm implementations that use separate barrier instructions instead of having load-acquire instructions. (e.g. ARMv7. Branch on the result, put an ARM dmb ish only in the failure branch.) I'll update my answer, though.
Thank you! On unrelated note: there was a post few days ago about some HFT shop losing $50k -- they had no ABA mitigation (at all, lol) and it took 6 months before problem bit them in the rear. I was always skeptical about using tagged pointers for ABA, but this makes me relax a bit. :D

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.