Do memory barriers prevent branch prediction?

Question

This question does not assume any specific architecture. Assume that we have a multicore processor with cache coherence, out-of-order execution, and branch prediction logic. We also assume that stores to memory are strictly in program order.

We have two threads running in parallel, each on a separate core.

Below are the threads’ pseudo-code. data and flag are initially 0.

Thread #1 code:

data=10;
flag=1;

Thread #2 code:

while(!flag);
print data;

With proper synchronization, Thread #2 would eventually print 1. However, the branch predictor could potentially predict that the loop is not entered, thus perform a speculative read of data, which contains 0 at that time (prior to Thread #1 setting data). The prediction is correct, i.e. ‘flag’ is eventually set to 1. In this case the print data instruction can be retired, but it prints the incorrect value of 0.

The question is whether a memory barrier would somehow prevent the speculative read of data, and cause the cpu to execute the busy wait properly. An alternative solution could be to let the branch predictor do its work, but snoop the writes done by the other core, and in case a write to data is detected, we can use the ROB to undo the premature read (and its dependent instructions) and then re-execute with the proper data.

Arch-specific answers are also welcome.

Peter Cordes · Accepted Answer · 2020-05-11 13:24:17Z

1

No, branch prediction + speculative execution is fine in an ISA with memory barriers, as long as mis-speculation is killed properly.

thus perform a speculative read of data, which contains 0 at that time

When the CPU detects the misprediction, instructions from the mis-speculated path of execution are discarded, along with their effects on architectural registers.

When the correct path of execution does eventually exit the loop, then the memory barrier will run (again), then the load of data will run (again). The fact that they earlier ran in the shadow of a mis-predicted branch has no effect.

Your pseudo-code assembly isn't very clear because it makes print data look like a single operation. In fact it will involve a load into a register and then a call print instruction.

When the data load runs on the correct path, it will have to redo the work of reading a value from cache, and cache is coherent across cores. It doesn't matter if the mis-speculated load brought the cache line into this core's L1d cache; a store by another core will have to invalidate it before that store can become globally visible.

The loop exits after seeing exit!=0; the barrier after that makes sure that later loads haven't already happened, giving acquire semantics to the load of exit (assuming it includes blocking LoadLoad reordering).

The barrier executing on the correct path makes sure that this core waits for that invalidation instead of using an early load.

A store / release barrier in the writer makes sure that the new data value is globally visible before exit = 1 is visible to any other threads on any cores.

answered May 11, 2020 at 13:24

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Daniel Over a year ago

First, thank you for your amazing contributions on everything low-level on SOF. I learn a lot from your educational answers. There might be some ambiguity in my question. It’s assumed that by the time that the actual value of flag is read, it has already been set to 1 by the writer, so the original prediction is correct by not entering the loop, and no mis-prediction logic is carried out. In which case I would assume that data is 0, unless somehow reloaded. Hope I’m making myself clearer this time.

Peter Cordes Over a year ago

@DanielNitzan: If there's a barrier (explicit on a weakly-ordered ISA, or implicit on x86 where every load is an acquire load) between reading flag and reading data, then the CPU isn't allowed to use any value it read for data that isn't still valid by the time it actually reads the value of flag from cache. Modern x86 CPUs do speculatively do later loads, and handle this problem with a memory ordering machine clear rollback of out-of-order state if data isn't still valid by the time the architectural memory model allows them to read it.

Peter Cordes Over a year ago

Or on a weakly ordered ISA, a barrier instruction would probably just block exec of the later load until the flag load has actually taken a value. Speculative execution past a branch would just let the fence start waiting right away, like the case where there's no branch. This applies whether or not you're branching on it. I don't see how branch prediction could be relevant if you're only considering the correctly-predicted case, and with no data dependency between one load and the next (like for mo_consume).

Peter Cordes Over a year ago

@DanielNitzan: A message / directory-based MESI implementation like real-world modern CPU use doesn't have to actually snoop. Another core wanting to write will have to actively do an RFO (read for ownership) to gain exclusive ownership (Exclusive / Modified state) of the cache line before it can commit a store from its store buffer to its L1d cache. That means this core's Shared copy will be invalidated. A speculative early load could monitor the line for that. Or if you have an actual barrier instruction before the read of data, the load won't happen early in the first place.

Peter Cordes Over a year ago

@DanielNitzan: Right, since x86 has strong load ordering that would hurt performance in normal single-threaded code working on private data, real implementations speculatively load out of order. On weakly-ordered ISAs, they can just load out of order all the time without having to track the status of that cache line until retirement.

|

Collectives™ on Stack Overflow

Do memory barriers prevent branch prediction?

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related