The pure load is correct: acquire is the minimum for correctness.
The CAS in pop is too strong in the success order (just acquire or even relaxed is fine), and too weak in the fail order (acquire is needed to set up for the next iteration, for the same reason acquire was needed on the initial load.)
node* pop()
{
node* p = head_.load(std::memory_order_acquire); // acquire is necessary
while (p && !head_.compare_exchange_weak(p, p->next, // deref of p
std::memory_order_relaxed, // was acq_rel
std::memory_order_acquire)); // was relaxed
return p;
}
(But unfortunately C++ requires that CAS success strength be at least as strong as fail else UB, so actually you need CAS_weak(p, p->next, acquire, acquire). This prevents you from allowing a compiler to make the success path faster on ISAs like ARMv7 where acquire is done with separate fence instructions after a load, which could be put only on the fail retry branch.)
Actually consume would be sufficient here since the accesses which need to happen-after stuff in the other thread are all derefs of a pointer we loaded. consume is deprecated in ISO C++, and compilers always just promoted it to acquire. In practice relaxed loads work like consume except on DEC Alpha, at least in cases like this where the compiler can't plausibly figure out that there's only one possible value for p and use that instead of a data dependency on the load.
See also C++11: the difference between memory_order_relaxed and memory_order_consume and especially the link to Paul E. McKenney's CppCon 2016 talk about it. In practice relaxed would be safe for all these loads because all current-mainstream ISAs do support memory dependency ordering, and the data dependency isn't going to optimize away here. The C++ standard doesn't guarantee that, allowing for stuff like Alpha, and hypothetical future CPUs which do value-prediction for loads.
In the rest of this answer, I'm going to talk about what's necessary in the C++ formalism, where acquire is the weakest option to get guaranteed happens-before even with a data dependency, unfortunately.
acquire on the initial load is necessary because we deref p (to load p->next), unlike a CAS retry loop which just does arithmetic1. We need to sync-with other threads which created a node, set its next pointer, and do a release operation on head to publish it.
With just load(relaxed), evaluating p->next would have data-race UB: potentially-concurrent write of new_node->next = head.load(); to the same node object in a thread running push. (The release on the CAS in push is of course also necessary so our acquire load can sync with it.)
(See https://preshing.com/20120913/acquire-and-release-semantics/ for a good conceptual explanation of release/acquire sync and memory ordering; Jeff Preshing's articles are an excellent intro to thinking about memory ordering. Practical instead of ISO C++'s formal rules which do indeed make one's head hurt as you commented.)
For the same reason, relaxed is too weak for the CAS-fail case in pop. That reloads p from head, which might have just been updated by a new pop. This fail-case load reaches the next iteration's p->next evaluation and CAS exactly the same way as the pre-loop pure load.
The modifications to head are all atomic RMWs so there's a release-sequence, but it's not sufficient that we synced with all older heads. If the activity that made our CAS fail was a push between our load and our next CAS, writing its next pointer didn't have to happen-before writing the head value we loaded first and synced-with.
(That would be the case if we raced with a pop; our updated p will be pointing to a node that was already in the stack, and a release-sequence exits between the release-RMW that published it and our initial pure load, so the ->next assignment in that push call happens-before our access to it.)
The success order on the CAS is unnecessarily strong. It needs to be acquire so the caller can safely deref the p we return. Except we've already synced-with that value with the pure load or the CAS-fail load since we've already derefed it, so actually I think pop's CAS-success order can be relaxed!
acquire would make it fully safe even in case of ABA success, but in this algo we have to avoid that for correctness anyway.
It doesn't need to be a release operation: nothing earlier in this thread needs to be visible to other threads which acquire the updated head value we produce.
The release-sequence rule means any RMW avoids breaking sync between earlier writers and later readers. If not for that release-sequence rule, the pop CAS would indeed need to be acq_rel, so previous writes are visible to us, and our release would make them visible to the next thread which acquires the value. But those later acquire loads already sync with all earlier release ops in the modification order of head, because the modifications are an unbroken chain of RMWs.
Perhaps the people who wrote the examples you found forgot about release sequences, or else I'm missing something? Maybe they thought they needed acquire + release to "pass along" visibility of earlier ops?
acq_rel without the release is just acquire. Without the acquire part either (since we already did that with a pure load), it's just relaxed.
node::next needs to be atomic<>, and you can't free nodes.
As you and HTNW pointed out in comments, pop can sleep after loading p, then wake up and read p->next after stuff has changed. Specifically, after another thread has popped the node p is pointing to, and pushed it again. While doing so, it writes the ->next pointer. So we have a write that's concurrent with the read by the slow-pop thread.
That's data-race UB in ISO C++. In practice it's harmless on mainstream C++ implementations, other than clang++ -fsanitize=thread which AFAIK is still a conforming C++ implementation. std::atomic<node*> next which you load/store with a relaxed loads would be fine. std::atomic_ref could let you update it with plain assignments before its published, but that probably compiles to the same asm as using a temp variable and .store(curr_head, relaxed) since pointer-width relaxed-atomic loads/stores don't need any special instructions in mainstream ISAs, and these loads / stores can't optimize away even if we gave the compiler the chance to do so.
If the node gets reused, what we load for p->next doesn't matter; the CAS will fail because your tagged pointer makes sure p isn't the same as the new head.
But ISO C++ doesn't have a way to do non-atomic loads that can tear, whose result you won't use when you discover that a race may have happened. This is more of a problem for SeqLock, where the whole point is to have a payload bigger than what the machine can do atomically for free.
Worse, the thread which won the race to pop this node could delete it, and if the standard library gives this page of memory back to the OS, reading p->next will fault! So you need to never do that. The deallocation problem for lock-free data structures is difficult in languages which aren't garbage-collected.
This node-reuse race with a slow pop can happen even with sequential consistency, so no amount of memory ordering can fix it.
Footnote 1:
A CAS retry loop implementing just arithmetic on a single value, like atomic right shift, can start with a relaxed load, regardless of what memory ordering you want for the operation as a whole. Computation on the relaxed load result always safely produces something, and the eventual success CAS is exactly like fetch_add or a hypothetical fetch_shift. The work that went into preparing the value to store is irrelevant as long as it's a pure function of expected. But derefing a pointer is not pure in that sense.
The CAS loop in push isn't, on failure, updating any pointers that it will deref. It only updates new_node->next to point to the current head. BTW, I find using new_node->next as the expected arg to CAS slightly obscure/tricky; I'd have used a local temporary variable like node *curr_head and used a do{new_node->next = curr_head;}while(!CAS). But maybe I only found it tricky because I'm used to thinking in asm, so I expect the first operand of CAS to be something that can be in a register. But new_node->next has to get stored to memory as well as keeping the value in a register for the next CAS. Still, I think it's nice to make the update to new_node->next; very explicit as an assignment statement inside the loop, even though that requires more lines of code.
(Update: with atomic<node*> next, you need a temp variable anyway.)
pandp->nextwill be (re-)loaded in next iteration. Do I miss smth?popat the same time, it's possible that in between one thread doingp = head.load(...);and evaluatingp->next(which happens before thecompare_exchange_weak!), the other thread can finishpop()and modifyp->next(or freep, etc.). Voila: UB. I don't think any amount of memory ordering onheadhelps with this.pop()doesn't modifyp->next... but subsequentpush()does, see my comment to Peter's answer. I.e. it seems I need to makenode::nextatomic or we can't avoid concurrent non-atomic access.popis safe in practice on any real ISA with mainstream compilers, not just x86, despite the data-race and/or strict-aliasing UB onnext. There's no plausible way for that to be compile-time-visible and cause it to mis-compile, and it's a non-problem in asm.seq_cstdoesn't make it any more or less safe to use the whole page. It will compile the same on x86 as the minimum required orders I specified, but worse on other ISAs for no benefit. I mean yes it's technically true that you can useseq_cstfor any atomic RMWs on x86 with no downside in the asm, but why do that? Seems like a weird thing to mention.