How is message queue implemented in cache coherence protocol?

Question

In Paul McKenny's famous paper "Memory Barriers: A Hardware View for Software Hackers"

3.3 Store Buffers and Memory Barriers

To see the second complication, a violation of global memory ordering, consider the following code sequences with variables “a” and “b” initially zero:
1   void foo(void)
2   {
3       a = 1;
4       b = 1;
5   }
6
7   void bar(void)
8   {
9       while (b == 0) continue;
10      assert(a == 1);
11  }
Suppose CPU 0 executes foo() and CPU 1 executes bar(). Suppose further that the cache line containing “a” resides only in CPU 1’s cache, and that the cache line containing “b” is owned by CPU 0. Then the sequence of operations might be as follows:

CPU 0 executes a=1. The cache line is not in CPU 0’s cache, so CPU 0 places the new value of “a” in its store buffer and transmits a “read invalidate” message.

CPU 1 executes while(b==0)continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message.

CPU 0 executes b=1. It already owns this cache line (in other words, the cache line is already in either the “modified” or the “exclusive” state), so it stores the new value of “b” in its cache line.

CPU 0 receives the “read” message, and transmits the cache line containing the now-updated value of “b” to CPU 1, also marking the line as “shared” in its own cache.

CPU 1 receives the cache line containing “b” and installs it in its cache.

CPU 1 can now finish executing while(b==0) continue, and since it finds that the value of “b” is 1, it proceeds to the next statement.

CPU 1 executes the assert(a==1), and, since CPU 1 is working with the old value of “a”, this assertion fails.

CPU 1 receives the “read invalidate” message, and transmits the cache line containing “a” to CPU 0 and invalidates this cache line from its own cache. But it is too late.

CPU 0 receives the cache line containing “a” and applies the buffered store just in time to fall victim to CPU 1’s failed assertion.

Step 1: CPU0 sends "read invalidate" to CPU1

Step 5: CPU1 receives value of b from CPU0 in response to CPU1's earlier (step 2) "read" message

Step 8: CPU1 receives the "read invalidate" message from step 1

How can Step 8 happen after 5?

In both 5 and 8, CPU1 is receiving stuff from CPU0. But notice that CPU0 sends "read invalidate" message before ACKing CPU1's "read" message (of b).

If CPU1 has a income message queue that is processed by order, then CPU1 has to process CPU0's "read invalidate" message earlier than it processes CPU0's response to b value "read" message. Doesn't it?

Don't post pictures of text, quote it in quote formatting.

Peter Cordes
– Peter Cordes

2022-12-06 12:40:27 +00:00
Commented Dec 6, 2022 at 12:40 — Peter Cordes
– Peter Cordes, Commented Dec 6, 2022 at 12:40

bazza · Accepted Answer · 2022-12-06 22:18:00Z

1

I think it's because CPU1's read has started, and it will complete (which involves actually waiting for and receiving the value requested) before going on to process any incoming message. I.e. CPU1's read is a blocking operation that needs to complete before any other cache management, such as processing CPU0's read-invalidate message, is carried out.

I'm not an expert in this, but that's an explanation that fits the steps!

answered Dec 6, 2022 at 22:18

bazza

8,61718 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Weipeng Over a year ago

Thanks @bazza! This makes a lot of sense. That read is blocking & synchronous can certainly explain the behavior. That being said, won't modern CPU proceed to speculatively execute next instruction(s) rather than stalling on the read which could take hundreds of cycles?

bazza Over a year ago

@Weipeng, about the only thing going for my answer is that it makes sense of the steps; it does not mean that it's actually correct! Yes, this is the kind of reason why speculative execution is designed in, making it difficult for designers to be sure they've not made their CPU vulnerable in some way (Meltdown, Spectre). This is why some people (inc myself) think that the time of SMP / cache-coherent CPUs should pass, and return to NUMA architectures like Transputer or the Cell processor; push this kind of thing up to the software stack.

Qwert Yuiop · Accepted Answer · 2025-08-16 23:00:43Z

0

It doesn't have to be a single message queue. It could have multiple queues, one for each cache-line, and the out-of-order response to messages could be because incoming messages have different latencies, or it is due to how CPU1 prioritizes its tasks.

answered Aug 16 at 23:00

Qwert Yuiop

3625 silver badges14 bronze badges

Collectives™ on Stack Overflow

How is message queue implemented in cache coherence protocol?

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related