I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what I expected.
Multichase is essentially just your classic pointer chasing benchmark, except it seems like the authors have gone to greater lengths to make sure that the hardware prefetcher fails. It reports the average latency for de-referencing your pointer, and has the option to interleave "work", which is basically just an empty loop. The relevant code snippet is as follows:
static void chase_work(per_thread_t *t) {
void *p = t->x.cycle[0];
size_t extra_work = strtoul(t->x.extra_args, 0, 0);
size_t work = 0;
size_t i;
// the extra work is intended to be overlapped with a dereference,
// but we don't want it to skip past the next dereference. so
// we fold in the value of the pointer, and launch the deref then
// go into a loop performing extra work, hopefully while the
// deref occurs.
do {
x25(work += (uintptr_t)p; p = *(void **)p;
// (I uncomment the following line for the second benchmark)
// (it was not a part of the original source code)
//work += (uintptr_t)p;
for (i = 0; i < extra_work; ++i) { work ^= i; })
} while (__sync_add_and_fetch(&t->x.count, 25));
// we never actually reach here, but the compiler doesn't know that
t->x.cycle[0] = p;
t->x.dummy = work;
}
(x25 is a macro which repeats the argument 25 times, so x25(y) is just y y y .... y 25 times)
extra_work is passed in as a parameter, and in an ideal scenario the processor should execute the "work" loop out of order while the value of the pointer is fetched. This would also mean that as extra_work is varied, initially it should not lead to much of an increase in the runtime, whereas beyond a point the runtime will be linear with extra_work when the cache misses are no longer the bottleneck.
Unfortunately there was no such flat region at the start, and it was linear throughout. This made me suspect that the loop is not actually run while the processor waits for the pointer dereference.
To confirm this, I added a work += (uintptr_t)p; line just before the loop but after the dereference as shown in the code above. This should make the dereference and loop ordered, so the loop cannot proceed till the dereference succeeds. And the code already had the property that the next dereference cannot proceed till the loop finished.
So, if the loop was executing out-of-order with the dereference before, one would expect a sharp rise in the runtime with this change - however, that is not observed and the rise is barely of a few ns. I have attached the graph below, where I'm varying extra_work from 1 to 500.
My processor is 11th Gen Intel(R) Core(TM) i5-11400H and I'm on Ubuntu 22.04. It seems obvious to me that the loop should be executed out-of-order with the dereference, but that is clearly not happening, hence my questions are:
(1) why isn't it happening? (2) how to fix this to get the benchmark working as expected?
(To reproduce, clone the multichase github repo, make and run ./multichase -c work:N)


cc(so GCC on Ubuntu)-O3 -fno-unroll-loops. Can you confirm you enabled optimization for your testing?__sync_add_and_fetchis a full memory barrier on x86 due tolock add, so store/reload of local vars in theextra_workloop couldn't overlap across iterations. Although should still run in the shadow of each pointer-chasing load. Oh, but there's only one such barrier for every 25 pointer derefs, so IDK. Still, extra stores/reloads could matter when trying to figure out what's going on.x25tox200(and the value in the add & fetch) and see if similar results persist.-c work:0I get about 62.3 cycles. Withwork:20I still get 62.3 cycles, fully in the shadow of the load.work:50- 68 to 69 cycles (at 2.8 to 3.2GHz).work:70about 77c. (SKL ROB capacity is 224 uops, the inner loops are 4 or 3 uops per iteration (the x25 manual unrolled loops aren't all the same), and there are 2 uops of NOPs to align the top of the loop plus an xor-zero.workby x25 or x200 in terms of how many instructions the ROB has to hide for it to run in the shadow of a cache-miss load. You could post that as an answer. Anasm("" : "+r"(work) :: "memory")statement could probably fix that.