Pointer chasing benchmark - unexpected lack of out of order execution?

Question

I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what I expected.

Multichase is essentially just your classic pointer chasing benchmark, except it seems like the authors have gone to greater lengths to make sure that the hardware prefetcher fails. It reports the average latency for de-referencing your pointer, and has the option to interleave "work", which is basically just an empty loop. The relevant code snippet is as follows:

static void chase_work(per_thread_t *t) {
  void *p = t->x.cycle[0];
  size_t extra_work = strtoul(t->x.extra_args, 0, 0);
  size_t work = 0;
  size_t i;

  // the extra work is intended to be overlapped with a dereference,
  // but we don't want it to skip past the next dereference.  so
  // we fold in the value of the pointer, and launch the deref then
  // go into a loop performing extra work, hopefully while the
  // deref occurs.
  do {
    x25(work += (uintptr_t)p; p = *(void **)p;
        // (I uncomment the following line for the second benchmark)
        // (it was not a part of the original source code)
        //work += (uintptr_t)p;
        for (i = 0; i < extra_work; ++i) { work ^= i; })
  } while (__sync_add_and_fetch(&t->x.count, 25));

  // we never actually reach here, but the compiler doesn't know that
  t->x.cycle[0] = p;
  t->x.dummy = work;
}

(x25 is a macro which repeats the argument 25 times, so x25(y) is just y y y .... y 25 times)

extra_work is passed in as a parameter, and in an ideal scenario the processor should execute the "work" loop out of order while the value of the pointer is fetched. This would also mean that as extra_work is varied, initially it should not lead to much of an increase in the runtime, whereas beyond a point the runtime will be linear with extra_work when the cache misses are no longer the bottleneck.

Unfortunately there was no such flat region at the start, and it was linear throughout. This made me suspect that the loop is not actually run while the processor waits for the pointer dereference.

To confirm this, I added a work += (uintptr_t)p; line just before the loop but after the dereference as shown in the code above. This should make the dereference and loop ordered, so the loop cannot proceed till the dereference succeeds. And the code already had the property that the next dereference cannot proceed till the loop finished.

So, if the loop was executing out-of-order with the dereference before, one would expect a sharp rise in the runtime with this change - however, that is not observed and the rise is barely of a few ns. I have attached the graph below, where I'm varying extra_work from 1 to 500.

My processor is 11th Gen Intel(R) Core(TM) i5-11400H and I'm on Ubuntu 22.04. It seems obvious to me that the loop should be executed out-of-order with the dereference, but that is clearly not happening, hence my questions are:

(1) why isn't it happening? (2) how to fix this to get the benchmark working as expected?

(To reproduce, clone the multichase github repo, make and run ./multichase -c work:N)

From your Makefile, this looks like it's compiled with cc (so GCC on Ubuntu) -O3 -fno-unroll-loops. Can you confirm you enabled optimization for your testing? __sync_add_and_fetch is a full memory barrier on x86 due to lock add, so store/reload of local vars in the extra_work loop couldn't overlap across iterations. Although should still run in the shadow of each pointer-chasing load. Oh, but there's only one such barrier for every 25 pointer derefs, so IDK. Still, extra stores/reloads could matter when trying to figure out what's going on. — Peter Cordes
– Peter Cordes, Commented Mar 4 at 20:27
@PeterCordes yes, I have just used the default makefile with all the optimizations enabled. I have also looked at the disassembly and nothing particularly surprising popped up, the work loops are just a bunch of standard xor, add, cmp, jne instructions. I will try changing x25 to x200 (and the value in the add & fetch) and see if similar results persist. — Box Box Box Box
– Box Box Box Box, Commented Mar 4 at 20:36
Can't repro, works as expected for me on Skylake (i7-6700k on Linux with EPP = balance_performance, so 3.9GHz normally, but actually 2.6GHz while this memory-bound workload is running). With -c work:0 I get about 62.3 cycles. With work:20 I still get 62.3 cycles, fully in the shadow of the load. work:50 - 68 to 69 cycles (at 2.8 to 3.2GHz). work:70 about 77c. (SKL ROB capacity is 224 uops, the inner loops are 4 or 3 uops per iteration (the x25 manual unrolled loops aren't all the same), and there are 2 uops of NOPs to align the top of the loop plus an xor-zero. — Peter Cordes
– Peter Cordes, Commented Mar 4 at 20:57
Ah, looks like I found the issue - for some reason the disassembly has all the pointer dereferences inside the main loop bunched up together, followed by the 25 work loops bunched up together. Not really sure why it's happening or how to fix it, but oh well at least it explains the cause. @PeterCordes could you maybe check if on your system the disassembly looks similar? On my system it is dereferencing and storing the values on the stack +some registers and later reusing them in the loops all bunched up together, instead of interleaving the work loops :-| — Box Box Box Box
– Box Box Box Box, Commented Mar 4 at 21:34
Oh, yes, GCC14.2's asm output has a couple of the work-loops without a load, but most of them are interleaved with pointer-chasing loops as expected. (I had noticed the absence of a load once, but then looked again at a different inner loop that did have a load and forgot about it, xD.) But yeah, that would for sure explain it, effectively multiplying work by x25 or x200 in terms of how many instructions the ROB has to hide for it to run in the shadow of a cache-miss load. You could post that as an answer. An asm("" : "+r"(work) :: "memory") statement could probably fix that. — Peter Cordes
– Peter Cordes, Commented Mar 4 at 21:37

Box Box Box Box · Accepted Answer · 2025-03-06 11:29:40Z

3

The issue was that my compiler (gcc version 11.4.0) seems to be reordering the loops and pointer de-references - it de-references all 25 pointers in order, stores their results in various registers and on the stack, and then runs all loops together using the stored values as and when required. So indeed the loop was not running out-of-order with the pointer de-reference.

@PeterCordes suggests asm("" : "+r"(work) :: "memory") and asm("" : "+r"(work), "+r"(p)); in the comments and they both work well. I have used the latter since it avoids clobbering the full memory.

static void chase_work(per_thread_t *t) {
  void *p = t->x.cycle[0];
  size_t extra_work = strtoul(t->x.extra_args, 0, 0);
  size_t work = 0;
  size_t i;

  // the extra work is intended to be overlapped with a dereference,
  // but we don't want it to skip past the next dereference.  so
  // we fold in the value of the pointer, and launch the deref then
  // go into a loop performing extra work, hopefully while the
  // deref occurs.
  do {
    x25(work += (uintptr_t)p; p = *(void **)p;
        for (i = 0; i < extra_work; ++i) { work ^= i; }
        asm("" : "+r"(work), "+r"(p)); )
  } while (__sync_add_and_fetch(&t->x.count, 25));

  // we never actually reach here, but the compiler doesn't know that
  t->x.cycle[0] = p;
  t->x.dummy = work;
}

It works as expected, and the flat region is observed at the start :-)

However, this method is compiler specific since the inline ASM syntax is compiler specific, and it also clobbers the full memory which is probably not ideal. Doesn't seem to affect performance here, but I wonder if there's a neater way to stop the reordering.

The disassembly with the bunched up pointer de-references and work loops are here and here in case someone is interested (pretty weird compiler optimization isn't it xD)

edited Mar 6 at 11:29

answered Mar 5 at 17:00

Box Box Box Box

5,38810 gold badges52 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Peter Cordes Mar 5 at 20:54

as we just need to clobber memory - Actually, work is a local whose address doesn't escape the function. Updates to it aren't ordered by a "memory" clobber. (Even if it did get spilled from a register to the stack, which it doesn't.) That's why I used a "+r"(work) constraint to couple the memory barrier to that position in the work dependency chain. I could instead have used asm("" : "+r"(work), "+r"(p)); to require both those abstract-machine values to exist in a register at the same time, with later values depending on this asm statement. That would also leave no room for badness

Peter Cordes Mar 5 at 20:57

TL:DR: asm("":::"memory") just nudges the compiler in the right direction; it does not guarantee the code-gen you want. See GCC Inline Assembler "memory" Clobber don't prevent from re-arrange the code in ARM for example. The compiler internally treats a "memory" clobber similar to a non-inline function-call in terms of being a black-box for any globally-reachable objects. (Why can `asm volatile("" ::: "memory")` serve as a compiler barrier?)

Box Box Box Box Mar 6 at 11:30

@PeterCordes hmm yes you are right, asm("" ::: "memory"); is not enough although it does seem to work here. I have edited the answer, thanks.

Collectives™ on Stack Overflow

Pointer chasing benchmark - unexpected lack of out of order execution?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related