1

This question does not assume any specific architecture. Assume that we have a multicore processor with cache coherence, out-of-order execution, and branch prediction logic. We also assume that stores to memory are strictly in program order.

We have two threads running in parallel, each on a separate core.

Below are the threads’ pseudo-code. data and flag are initially 0.

Thread #1 code:

data=10;
flag=1;

Thread #2 code:

while(!flag);
print data;

With proper synchronization, Thread #2 would eventually print 1. However, the branch predictor could potentially predict that the loop is not entered, thus perform a speculative read of data, which contains 0 at that time (prior to Thread #1 setting data). The prediction is correct, i.e. ‘flag’ is eventually set to 1. In this case the print data instruction can be retired, but it prints the incorrect value of 0.

The question is whether a memory barrier would somehow prevent the speculative read of data, and cause the cpu to execute the busy wait properly. An alternative solution could be to let the branch predictor do its work, but snoop the writes done by the other core, and in case a write to data is detected, we can use the ROB to undo the premature read (and its dependent instructions) and then re-execute with the proper data.

Arch-specific answers are also welcome.

1 Answer 1

1

No, branch prediction + speculative execution is fine in an ISA with memory barriers, as long as mis-speculation is killed properly.

thus perform a speculative read of data, which contains 0 at that time

When the CPU detects the misprediction, instructions from the mis-speculated path of execution are discarded, along with their effects on architectural registers.

When the correct path of execution does eventually exit the loop, then the memory barrier will run (again), then the load of data will run (again). The fact that they earlier ran in the shadow of a mis-predicted branch has no effect.

Your pseudo-code assembly isn't very clear because it makes print data look like a single operation. In fact it will involve a load into a register and then a call print instruction.

When the data load runs on the correct path, it will have to redo the work of reading a value from cache, and cache is coherent across cores. It doesn't matter if the mis-speculated load brought the cache line into this core's L1d cache; a store by another core will have to invalidate it before that store can become globally visible.

The loop exits after seeing exit!=0; the barrier after that makes sure that later loads haven't already happened, giving acquire semantics to the load of exit (assuming it includes blocking LoadLoad reordering).

The barrier executing on the correct path makes sure that this core waits for that invalidation instead of using an early load.

A store / release barrier in the writer makes sure that the new data value is globally visible before exit = 1 is visible to any other threads on any cores.

Sign up to request clarification or add additional context in comments.

7 Comments

First, thank you for your amazing contributions on everything low-level on SOF. I learn a lot from your educational answers. There might be some ambiguity in my question. It’s assumed that by the time that the actual value of flag is read, it has already been set to 1 by the writer, so the original prediction is correct by not entering the loop, and no mis-prediction logic is carried out. In which case I would assume that data is 0, unless somehow reloaded. Hope I’m making myself clearer this time.
@DanielNitzan: If there's a barrier (explicit on a weakly-ordered ISA, or implicit on x86 where every load is an acquire load) between reading flag and reading data, then the CPU isn't allowed to use any value it read for data that isn't still valid by the time it actually reads the value of flag from cache. Modern x86 CPUs do speculatively do later loads, and handle this problem with a memory ordering machine clear rollback of out-of-order state if data isn't still valid by the time the architectural memory model allows them to read it.
Or on a weakly ordered ISA, a barrier instruction would probably just block exec of the later load until the flag load has actually taken a value. Speculative execution past a branch would just let the fence start waiting right away, like the case where there's no branch. This applies whether or not you're branching on it. I don't see how branch prediction could be relevant if you're only considering the correctly-predicted case, and with no data dependency between one load and the next (like for mo_consume).
@DanielNitzan: A message / directory-based MESI implementation like real-world modern CPU use doesn't have to actually snoop. Another core wanting to write will have to actively do an RFO (read for ownership) to gain exclusive ownership (Exclusive / Modified state) of the cache line before it can commit a store from its store buffer to its L1d cache. That means this core's Shared copy will be invalidated. A speculative early load could monitor the line for that. Or if you have an actual barrier instruction before the read of data, the load won't happen early in the first place.
@DanielNitzan: Right, since x86 has strong load ordering that would hurt performance in normal single-threaded code working on private data, real implementations speculatively load out of order. On weakly-ordered ISAs, they can just load out of order all the time without having to track the status of that cache line until retirement.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.