How can you prevent a destructor call from being reordered above an atomic write by the compiler?

Question

Here is a possibly incorrect program for communicating between a function and a signal handler that might interrupt it. Assume that HandleSignal has been arranged to run as a signal handler. The function Foo ensures that the signal handler will see a struct with some property if it interrupts Foo while it's doing work.

struct TwoInts {
    int a;
    int b;
};
static thread_local std::atomic<const TwoInts*> t_two_ints;

void Foo() {
    // Publish a TwoInts struct for use by HandleSignal while we run. Use a
    // release fence before publishing to prevent the compiler from reordering
    // the initialization code to after the pointer is published.
    std::optional s = TwoInts{17, 19};
    std::atomic_signal_fence(std::memory_order_release);
    t_two_ints.store(&*s, std::memory_order_relaxed);

    // Do some other work…

    // Unpublish the struct before we destroy it. For exposition's sake we
    // destroy it explicitly by resetting the optional.
    t_two_ints.store(nullptr, std::memory_order_relaxed);
    s.reset();
}

void HandleSignal() {
    // Load the pointer, using an acquire load to pair with the release fence
    // in Foo, so that we see the side effects of initializing the struct.
    const TwoInts* const s = t_two_ints.load(std::memory_order_acquire);
    if (s != nullptr && s->b != s->a + 2) {
        std::quick_exit(0);
    }
}

(Godbolt with #includes, for x86-64 and AArch64)

Foo uses a release fence when initially publishing the pointer to the struct, ensuring by [atomics.fences]/3 that the compiler won't lift storing the pointer above the initialization work for the struct that that store publishes. This prevents HandleSignal from seeing an uninitialized struct.

But what about the other end? Do we need to do something to prevent the call to ~TwoInts from being reordered before the null store in Foo? It seems like we should have the inverse problem here, but it's not exactly clear how to use fences to solve it with resorting to something like an acquire fence and a compare and swap operation. Is there some other reason that we can be confident that the destructor call won't be lifted to above the write?

Ideally I'm looking for a language lawyer answer, with citations from the standard. C++20 and C++23 are both fine.

Since we're only aiming for ordering of a signal handler wrt. its main thread, not other threads, the handler could use a relaxed load and signal_fence(acquire), which costs nothing in the asm. It's kind of odd to do that optimization for the store but not the load, since both are pure win for performance and pure downside for readability; might as well do both or neither. But yes it's still correct to do it this way. — Peter Cordes
– Peter Cordes, Commented Nov 4 at 8:30
And yes re: the actual question, you probably want signal_fence(release) or signal_fence(seq_cst) after the store(nullptr), before .reset. I think C++ doesn't actually guarantee anything about that since the s.reset() isn't an atomic operation, but in practice on real compilers, fences will give compile-time ordering even between non-atomic operations. This is a similar problem to a SeqLock (although that has the additional challenge of potentially-concurrent access to the payload and then ignoring possible tearing, which in ISO C++ is fully UB so nothing is guaranteed.) — Peter Cordes
– Peter Cordes, Commented Nov 4 at 8:33
You declared t_two_ints as thread_local and access it through a signal handler. Maybe you checked it for that, but to me, it is not at all obvious that the signal handling will occur in the same thread as Foo. — prapin
– prapin, Commented Nov 4 at 9:20
@prapin that's not the point of the question, but I did say "if it interrupts Foo while it's doing work". If it runs on another thread it won't interrupt Foo. — jacobsa
– jacobsa, Commented Nov 4 at 9:35
@PeterCordes: Fences of all kinds do affect non-atomic operations, and I don’t think the fence-fence restriction applies to signal fences. — Davis Herring
– Davis Herring, Commented Nov 4 at 17:04

Nate Eldredge · Accepted Answer · 2025-11-06 17:26:49Z

You definitely have to include something. The code as it stands is clearly incorrect. There is nothing at all ensuring that the reads of your TwoInts object *s in HandleSignal will happen-before its destruction in s.reset(), so this is a data race and the behavior is undefined.

At the level of the implementation, yes, the compiler is perfectly free to reorder the store to t_two_ints with s.reset().

It would be nice if we could fix this by simply putting a fence between t_two_ints.store() and s.reset(). On a typical implementation, a two-way compiler-only barrier like GCC's asm("" : : : "memory"); would suffice, because a signal handler executes in its entirety between two instructions of the interrupted code. So this would have the semantics of "the entire signal handler either happens-before or happens-after the barrier". That would solve our problem: if the signal handler happens-before the barrier, then all its accesses to *s happen-before s.reset(); and if it happens-after, then it observes the store of nullptr to t_two_ints and so does not access our *s at all.

Unfortunately, as far as I can tell, ISO C++ doesn't formally give us a signal fence with those semantics. atomic_signal_fence is simply defined as having the same semantics as atomic_thread_fence, but only between a handler and its interrupted thread [atomic.fences p6; all citations are to C++23 draft N4950]. This implies, I believe, that we can only get the necessary happens-before through synchronization, which requires an actual acquire load observing the value of a release store (or the equivalent with relaxed load/store and acquire/release fences).

So the lowest-cost correct version I could come up with is the following. Note that we prefer to use relaxed operations together with signal fences, which are more verbose but should impose no runtime cost.

static thread_local std::atomic<const TwoInts*> t_two_ints;
static thread_local std::atomic<bool> in_handler;

void Foo() {
    std::optional s = TwoInts{17, 19};                     // #1
    std::atomic_signal_fence(std::memory_order_release);   // #2
    t_two_ints.store(&*s, std::memory_order_relaxed);      // #3
    // Do some other work…
    t_two_ints.store(nullptr, std::memory_order_relaxed);  // #4
    std::atomic_signal_fence(std::memory_order_seq_cst);   // #5
    while (in_handler.load(std::memory_order_relaxed)) {   // #6
        [[unlikely]]; // should never happen
    }
    std::atomic_signal_fence(std::memory_order_acquire);   // #7
    s.reset();                                             // #8
}

void HandleSignal() {
    in_handler.store(true, std::memory_order_relaxed);     // #9
    std::atomic_signal_fence(std::memory_order_seq_cst);   // #10
    const TwoInts* const s =
        t_two_ints.load(std::memory_order_relaxed);        // #11
    std::atomic_signal_fence(std::memory_order_acquire);   // #12
    if (s != nullptr && s->b != s->a + 2) {                // #13
        std::quick_exit(0);
    }
    std::atomic_signal_fence(std::memory_order_release);   // #14
    in_handler.store(false, std::memory_order_relaxed);    // #15
}

Try on Godbolt

In the handler, this adds two plain stores, which should have minimal cost because they can be buffered. In Foo(), it adds one plain load, which (if the handler ran) should be hot in cache or store-forwarded, and a conditional branch, which is never taken and should be predicted well.

This assumes that HandleSignal isn't reentrant; that (through some implementation-defined mechanism) its signal is blocked while it's executing, so that executions of HandleSignal are totally ordered by happens-before. If it is reentrant, we would need some more changes, such as replacing the boolean in_handler with a counter which is atomically incremented and decremented on entry and exit to HandleSignal.

Note I've replaced the acquire load in HandleSignal() with a relaxed load and a fence, as already suggested in the comment by Peter Cordes. This is preferable as noted above. So fences #2 and #12 serve to prevent a data race between #1 and #13.

Now let's prove that it avoids data races between #13 and #8. Take an arbitrary execution F of Foo(), and an arbitrary execution H of HandleSignal(). Since seq_cst fences participate in a total order S [atomics.order p4], there are two cases: either fence #5 in Foo() precedes fence #10 in HandleSignal() or vice versa.

Suppose fence #5 precedes fence #10. I claim that load #11 must either return nullptr as stored by #4, or else some value later in the modification order [intro.races p4] of t_two_ints. For if not, then by [atomics.order p3.3], taking A = #11, B = #4, and X = #3 or any earlier store, we have that #11 is coherence-ordered-before #4. Since #10 happens-before #11 and #4 happens-before #5 (by sequencing), then [atomics.order p4.4] (taking X = #10, A = #11, B = #4, Y = #5) implies that #10 precedes #5, a contradiction.

So #11 returns either nullptr or a later value than #4. In the latter case, the value in particular cannot be that stored by #3, because #3 is sequenced-before #4, thus happens-before #4 [intro.races p10.1], thus precedes #4 in the modification order [intro.races p15]. So in either case, the value loaded by #11 does not point to our *s, and so there is no data race between #13 and #8 as they operate on different objects.
Suppose instead that fence #10 precedes fence #5.

Consider load #6 (its last iteration in the loop, if there's more than one, which should not actually happen). It returns false. I claim that this false value was stored either by #15 (case 2.1), or by some store later in the modification order of in_handler (case 2.2). If not, then since the value returned by #6 cannot be the value stored by #9 (which was true), it must necessarily be a value earlier still in the modification order. (Recall that modification order is consistent with sequencing order, by [intro.races p15].) That means, by the same argument as before, that #6 is coherence-ordered-before #9 [atomics.order p3.3 again] and so fence #5 precedes fence #10, contrary to our supposition.
1. If #6 returns the value stored by #15, then by [atomics.fences p2] (taking A=#14, B=#7, X=#15, Y=#6) we have that #14 synchronizes with #7. Moreover, #7 is sequenced-before #8, so by [intro.races p9.3.1], #14 inter-thread-happens-before #8. And #13 is sequenced-before #14, so by [intro.races p9.3.2], #13 inter-thread-happens-before #8. By [intro.races p10.2], #13 thus happens-before #8, and there is no data race [intro.races p21].
2. If #6 returns a value from a modification X later in the modification order than #15, then X must be some #15' from a different execution H' of HandleSignal. By our non-reentrancy assumption, one of these executions happens-before the other. But by [intro.races p15], #15' does not happen-before #15, so it must be that H happens-before H'. We also have as before that the corresponding fence #14' happens-before (indeed, inter-thread-happens-before) #8.
  
  We would now like to conclude using transitivity of happens-before, but unfortunately, in C++23 and earlier, happens-before is not defined as transitive due to the possibility of consume operations. So instead, note that if H happens-before H', so that in particular #13 happens-before #14', then by [intro.races p10] either #13 is sequenced-before #14' (if such a thing is possible for signal handlers) or #13 inter-thread-happens-before #14'. In either case, using [intro.races p9.3.2] or [p9.3.3] respectively, #13 inter-thread-happens-before #8, and so #13 happens-before #8. Again, there is no data race.

Q.E.D.

Yeah, I also don't know of anything in the standard that would require atomic_signal_fence(seq_cst) to give the ordering we need here, keeping an atomic store before a plain assignment. In practice on real implementations like GCC, atomic_signal_fence(release) is strong enough, probably the same strength as asm("" ::: "memory"). The ISO standard's memory model with full-blown UB for so many things is often pretty disappointing, but creative idea to invent an atomic bool to work around it. Being in the same cache-line as t_two_ints makes it ~free whether Foo ran recently or not.
In some ways, there should really be an entirely separate and stronger memory model for same-thread signal handlers. ISO C++ basically acts as if handlers could execute fully in parallel with the main code, rather than just interpolated within it as in reality, or as if the main thread could also interrupt the handler. The signal fences act as just cheaper versions of the thread fences, whereas they could really be providing stronger guarantees. And that model could also take advantage of things like omitting lock prefixes on x86.
That'd be valuable for writing kernel or bare-metal code for UP systems, assuming the implementation extended the same semantics to interrupt handlers.
Agreed, yes that's another gap in what ISO C/C+ provide; atomicity wrt. interrupts is often a lot cheaper than atomicity wrt. other cores. Not all ISAs guarantee that interrupts happen between two instructions (notably m68k with its pre-increment memory-indirect addressing, where apparently context-switch saves some microarchitectural internals as well as architectural state), but most do. C++ on m68k could simply use full-strength atomic RMW instructions while other ISAs could use cheaper.
This is amazing and awful and I love it and I hate it. Thank you very much. I think it's fair to treat this answer as saying "there is no real standard way to do it; just use the inline asm barrier", and I agree.

Collectives™ on Stack Overflow

How can you prevent a destructor call from being reordered above an atomic write by the compiler?

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related