Confusing Caching Behaviour of a Simple C Program

Question

I am experimenting with a program to see if its caching behaviour is consistent with my conceptual understanding.

To do this I am using the Perf command:

perf stat -e cache-misses ./a.out

to record the cache-miss ratio of the following simple C program:

int main() {
    int N = 10000;
    double *arr = malloc(sizeof(double) * N * N);

    for(int i = 0; i < N; i++) {
        for(int j = 0; j < N; j++){
            arr[i * N + j] = 10.0;
        }
    }
    return 0;
}

I get a cache-miss ratio of 50.212%. If I change the array access pattern as follows:

arr[j * N + i]

I get that the cache-miss ratio is 22.206%.

These results are surprising to me.

The cache-miss ratio of 50.212% seems very high for such a simple program with a very regular memory access pattern. I would expect this to be closer to 1/(num-words-per-cache-line) which is definitely larger than 1/2. Why is the cache-miss ratio so high?
My (limited) understanding of memory suggests that iterating over the array in column-major order should result in much worse caching behaviour but the results I am getting indicate the opposite. What's going on?

What CPU do you have? cat /proc/cpuinfo will tell you the cache information too. — Jonathon Reinhart
– Jonathon Reinhart, Commented Dec 14, 2017 at 11:38
@JonathonReinhart it is an Intel(R) Xeon(R) CPU and the cache size is 12288 KB. — DzedCPT
– DzedCPT, Commented Dec 14, 2017 at 11:59
Out of curiousity, does double (*arr)[N] = malloc( sizeof(double[N][N]) ); ... arr[i][j] = 10.0 generate faster code? — Lundin
– Lundin, Commented Dec 14, 2017 at 12:46
Please show the assembly generated for the two sources (GCC’s -S switch) and all the compiler switches you used. (Optimization must not have been enabled, because this program is equivalent to int main() {}.) Does perf start counting from the start of main, or does it include initialization in the C run-time start-up code? Does it include instruction cache misses? — Eric Postpischil
– Eric Postpischil, Commented Dec 14, 2017 at 13:15

Andriy Berestovskyy · Accepted Answer · 2017-12-16 13:01:30Z

The answer is pretty simple: compiler optimize out your assignments. Here is how the disassembly looks like for your code:

10  int main() {
   0x00000000004004d6 <+0>: mov    $0x2710,%edx
   0x00000000004004db <+5>: jmp    0x4004e7 <main+17>

15          for(int j = 0; j < N; j++){
   0x00000000004004dd <+7>: sub    $0x1,%eax
   0x00000000004004e0 <+10>:    jne    0x4004dd <main+7>

14      for(int i = 0; i < N; i++) {
   0x00000000004004e2 <+12>:    sub    $0x1,%edx
   0x00000000004004e5 <+15>:    je     0x4004ee <main+24>

10  int main() {
   0x00000000004004e7 <+17>:    mov    $0x2710,%eax
   0x00000000004004ec <+22>:    jmp    0x4004dd <main+7>

16              arr[i * N + j] = 10.0;
17          }
18      }
19      return 0;
20  }
   0x00000000004004ee <+24>:    mov    $0x0,%eax
   0x00000000004004f3 <+29>:    retq

As you can see there are no assembler instructions generated for the line arr[i * N + j] = 10.0;, so those cache misses you observe with perf are unrelated.

The fix is quite easy. Simply add volatile to the pointer declaration, forcing compiler to generate the assignments, i.e.:

volatile double *arr = malloc(sizeof(double) * N * N);

The disassembly now:

10  int main() {
   0x0000000000400526 <+0>: sub    $0x8,%rsp

11      int N = 10000;
12      volatile double *arr = malloc(sizeof(double) * N * N);
   0x000000000040052a <+4>: mov    $0x2faf0800,%edi
   0x000000000040052f <+9>: callq  0x400410 <malloc@plt>
   0x0000000000400534 <+14>:    mov    $0x0,%edx

16              arr[i * N + j] = 10.0;
   0x0000000000400539 <+19>:    movsd  0xc7(%rip),%xmm0        # 0x400608
   0x0000000000400541 <+27>:    jmp    0x40055f <main+57>
   0x0000000000400543 <+29>:    movslq %edx,%rcx
   0x0000000000400546 <+32>:    lea    (%rax,%rcx,8),%rcx
   0x000000000040054a <+36>:    movsd  %xmm0,(%rcx)
   0x000000000040054e <+40>:    add    $0x1,%edx

15          for(int j = 0; j < N; j++){
   0x0000000000400551 <+43>:    cmp    %esi,%edx
   0x0000000000400553 <+45>:    jne    0x400543 <main+29>
   0x0000000000400555 <+47>:    mov    %esi,%edx

14      for(int i = 0; i < N; i++) {
   0x0000000000400557 <+49>:    cmp    $0x5f5e100,%esi
   0x000000000040055d <+55>:    je     0x400567 <main+65>
   0x000000000040055f <+57>:    lea    0x2710(%rdx),%esi
   0x0000000000400565 <+63>:    jmp    0x400543 <main+29>

17          }
18      }
19      return 0;
20  }
   0x0000000000400567 <+65>:    mov    $0x0,%eax
   0x000000000040056c <+70>:    add    $0x8,%rsp
   0x0000000000400570 <+74>:    retq

And there are much more cache misses as well.

Running your test just once might get you very noisy results. You should run your measurements few times and take a median. There are benchmark frameworks, like Google Benchmark, so please have a look.

And the last one. Both of your patterns, like i * N + j and j * N + i are easily recognizable by the CPU prefetcher, so the cache miss ratio in both cases should be quite similar...

Collectives™ on Stack Overflow

Confusing Caching Behaviour of a Simple C Program

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related