1

In Paul McKenny's famous paper "Memory Barriers: A Hardware View for Software Hackers"

3.3 Store Buffers and Memory Barriers

To see the second complication, a violation of global memory ordering, consider the following code sequences with variables “a” and “b” initially zero:

1   void foo(void)
2   {
3       a = 1;
4       b = 1;
5   }
6
7   void bar(void)
8   {
9       while (b == 0) continue;
10      assert(a == 1);
11  }

Suppose CPU 0 executes foo() and CPU 1 executes bar(). Suppose further that the cache line containing “a” resides only in CPU 1’s cache, and that the cache line containing “b” is owned by CPU 0. Then the sequence of operations might be as follows:

  1. CPU 0 executes a=1. The cache line is not in CPU 0’s cache, so CPU 0 places the new value of “a” in its store buffer and transmits a “read invalidate” message.

  2. CPU 1 executes while(b==0)continue, but the cache line containing “b” is not in its cache. It therefore transmits a “read” message.

  3. CPU 0 executes b=1. It already owns this cache line (in other words, the cache line is already in either the “modified” or the “exclusive” state), so it stores the new value of “b” in its cache line.

  4. CPU 0 receives the “read” message, and transmits the cache line containing the now-updated value of “b” to CPU 1, also marking the line as “shared” in its own cache.

  5. CPU 1 receives the cache line containing “b” and installs it in its cache.

  6. CPU 1 can now finish executing while(b==0) continue, and since it finds that the value of “b” is 1, it proceeds to the next statement.

  7. CPU 1 executes the assert(a==1), and, since CPU 1 is working with the old value of “a”, this assertion fails.

  8. CPU 1 receives the “read invalidate” message, and transmits the cache line containing “a” to CPU 0 and invalidates this cache line from its own cache. But it is too late.

  9. CPU 0 receives the cache line containing “a” and applies the buffered store just in time to fall victim to CPU 1’s failed assertion.

Step 1: CPU0 sends "read invalidate" to CPU1

Step 5: CPU1 receives value of b from CPU0 in response to CPU1's earlier (step 2) "read" message

Step 8: CPU1 receives the "read invalidate" message from step 1

How can Step 8 happen after 5?

In both 5 and 8, CPU1 is receiving stuff from CPU0. But notice that CPU0 sends "read invalidate" message before ACKing CPU1's "read" message (of b).

If CPU1 has a income message queue that is processed by order, then CPU1 has to process CPU0's "read invalidate" message earlier than it processes CPU0's response to b value "read" message. Doesn't it?

1
  • 2
    Don't post pictures of text, quote it in quote formatting. Commented Dec 6, 2022 at 12:40

2 Answers 2

1

I think it's because CPU1's read has started, and it will complete (which involves actually waiting for and receiving the value requested) before going on to process any incoming message. I.e. CPU1's read is a blocking operation that needs to complete before any other cache management, such as processing CPU0's read-invalidate message, is carried out.

I'm not an expert in this, but that's an explanation that fits the steps!

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @bazza! This makes a lot of sense. That read is blocking & synchronous can certainly explain the behavior. That being said, won't modern CPU proceed to speculatively execute next instruction(s) rather than stalling on the read which could take hundreds of cycles?
@Weipeng, about the only thing going for my answer is that it makes sense of the steps; it does not mean that it's actually correct! Yes, this is the kind of reason why speculative execution is designed in, making it difficult for designers to be sure they've not made their CPU vulnerable in some way (Meltdown, Spectre). This is why some people (inc myself) think that the time of SMP / cache-coherent CPUs should pass, and return to NUMA architectures like Transputer or the Cell processor; push this kind of thing up to the software stack.
0

It doesn't have to be a single message queue. It could have multiple queues, one for each cache-line, and the out-of-order response to messages could be because incoming messages have different latencies, or it is due to how CPU1 prioritizes its tasks.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.