3

( I know that there have been a few somewhat related questions asked in the past, but I wasn't able to find a question regarding L1d cache misses and HyperThreading/SMT. )

After reading for a couple of days about some super interesting stuff like False Sharing, MESI/MOESI Cache Coherence Protocols , I decided to write a small "benchmark" in C (see below) , to test False Sharing in action.

I basically have an array of 8 doubles so that it fits in one Cache Line and two threads incrementing adjacent array positions.

At this point I should state that I am using a Ryzen 5 3600 , whose topology can be seen here .

I create two threads and then pin them two different logical cores, and each accesses and updates it's own array position, i.e Thread A updates array[2] and Thread B updates array[3] .

When I run the code using Hardware Threads #0 and #6 that belong to the same core , that (as shown in the topology diagram) share the L1d cache , the execution time is ~ 5 seconds.

When I use Threads #0 and #11 that don't have any caches in common , it takes ~ 9.5 seconds to complete. That time difference is expected because in this case there is "Cache Line Ping-Pong" going on.

However , and this is what bugs me, when I am using Threads #0 and #11 , the L1d cache misses are less than running with Threads #0 and #6.

My guess is that when using Threads #0 and #11 that don't have common caches, when one thread updates the contents of the shared cache line , according to the MESI/MOESI Protocol, the cache line in the other core get Invalidated. So even though there's a Ping-Pong going on, there aren't that many cache misses going on (compared to when running with Threads #0 and #6) , just a bunch of invalidations and cache line blocks transfers between the cores.

So, when using Threads #0 and #6 that have a common L1d cache , why are there more cache misses ?

(Threads #0 and #6 , also have common L2 cache, but I don't think it has any importance here because when the cache line gets invalidated, it has to be fetched in either from main memory (MESI) or from another core's cache(MOESI), so it seems impossible for the L2 to even have the data needed but also to be asked for it) .

Of course, when one thread writes into the L1d cache line , the cache line gets "dirty" but why does that matter? Shouldn't the other thread that resides on the same physical core have no problem reading the new "dirty" value?

TLDR: When testing False Sharing , there's approximately 3x more L1d cache misses when using two sibling-threads (threads that belong to the same physical core) , than when using threads that belong in two different physical cores. (2.34% vs. 0.75% miss rate, 396m vs. 118m absolute number of misses). Why is that happening?

(All statistics like L1d cache misses are measured using the perf tool in Linux. )

Also, minor secondary question , why are the sibling-threads paired in IDs 6 numbers apart? i.e Thread 0's sibling is Thread 6.. Thread i's sibling is Thread i+6. Does that help in any way? I've noticed this in both Intel and AMD CPUs.

I am super interested in Computer Architecture and I am still learning, so some of the above might be wrong, so sorry for that.

So, this is my code. Just creating two threads, binding them to specific logical cores and then hitting adjacent cache line locations.

#define _GNU_SOURCE

#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/random.h>
#include <time.h>
#include <pthread.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>

struct timespec tstart, tend;
static cpu_set_t cpuset;


typedef struct arg_s
{
       int index;
       double *array_ptr;
} arg_t;

void *work(void *arg)
{
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    int array_index = ((arg_t*)arg)->index;
    double *ptr = ((arg_t*)arg)->array_ptr;

    for(unsigned long i=0; i<1000000000; i++)
    {
            //it doesn't matter which of these is used
            // as long we are hitting adjacent positions
            ptr[array_index] ++;
            // ptr[array_index] += 1.0e5 * 4;
    }
    return NULL;
}

int main()
{
    pthread_t tid[2];

    srand(time(NULL));
    
    static int cpu0 = 0;
    static int cpu6 = 6; //change this to say 11 to run with threads 0 and 11

    CPU_ZERO(&cpuset);
    CPU_SET(cpu0, &cpuset);
    CPU_SET(cpu6, &cpuset);

    double array[8];

    for(int i=0; i<8; i++)
            array[i] = drand48();

    arg_t *arg0 = malloc(sizeof(arg_t));
    arg_t *arg1 = malloc(sizeof(arg_t));

    arg0->index = 0; arg0->array_ptr = array;       
    arg1->index = 1; arg1->array_ptr = array;


    clock_gettime(CLOCK_REALTIME, &tstart);

    pthread_create(&tid[0], NULL, work, (void*)arg0);
    pthread_create(&tid[1], NULL, work, (void*)arg1);

    pthread_join(tid[0], NULL);
    pthread_join(tid[1], NULL);


    clock_gettime(CLOCK_REALTIME, &tend);
 }

I am using GCC 10.2.0 Compiling as gcc -pthread p.c -o p

then running perf record ./p --cpu=0,6 or the same thing with --cpu=0,11 when using threads 0,6 and 0,11 respectively.

then running perf stat -d ./p --cpu=0,6 or the same with --cpu=0,11 for the other case

Running with Threads 0 and 6 :

Performance counter stats for './p --cpu=0,6':

           9437,29 msec task-clock                #    1,997 CPUs utilized          
                64      context-switches          #    0,007 K/sec                  
                 2      cpu-migrations            #    0,000 K/sec                  
               912      page-faults               #    0,097 K/sec                  
       39569031046      cycles                    #    4,193 GHz                      (75,00%)
        5925158870      stalled-cycles-frontend   #   14,97% frontend cycles idle     (75,00%)
        2300826705      stalled-cycles-backend    #    5,81% backend cycles idle      (75,00%)
       24052237511      instructions              #    0,61  insn per cycle         
                                                  #    0,25  stalled cycles per insn  (75,00%)
        2010923861      branches                  #  213,083 M/sec                    (75,00%)
            357725      branch-misses             #    0,02% of all branches          (75,03%)
       16930828846      L1-dcache-loads           # 1794,034 M/sec                    (74,99%)
         396121055      L1-dcache-load-misses     #    2,34% of all L1-dcache accesses  (74,96%)
   <not supported>     LLC-loads                                                   
   <not supported>     LLC-load-misses                                             

       4,725786281 seconds time elapsed

       9,429749000 seconds user
       0,000000000 seconds sys 

Running with Threads 0 and 11 :

 Performance counter stats for './p --cpu=0,11':

          18693,31 msec task-clock                #    1,982 CPUs utilized          
               114      context-switches          #    0,006 K/sec                  
                 1      cpu-migrations            #    0,000 K/sec                  
               903      page-faults               #    0,048 K/sec                  
       78404951347      cycles                    #    4,194 GHz                      (74,97%)
        1763001213      stalled-cycles-frontend   #    2,25% frontend cycles idle     (74,98%)
       71054052070      stalled-cycles-backend    #   90,62% backend cycles idle      (74,98%)
       24055983565      instructions              #    0,31  insn per cycle         
                                                  #    2,95  stalled cycles per insn  (74,97%)
        2012326306      branches                  #  107,650 M/sec                    (74,96%)
            553278      branch-misses             #    0,03% of all branches          (75,07%)
       15715489973      L1-dcache-loads           #  840,701 M/sec                    (75,09%)
         118455010      L1-dcache-load-misses     #    0,75% of all L1-dcache accesses  (74,98%)
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

       9,430223356 seconds time elapsed

      18,675328000 seconds user
       0,000000000 seconds sys
21
  • 1
  • paired in IDs 6 numbers apart? - It's not a fixed 6, it's enumerating 1 core of each physical core, then coming back and doing the other. So using the low half of the core numbers is one thread per physical core. I don't know whether / how the BIOS influences how different kernels enumerate cores, but on some systems both logical cores of each physical core are enumerated by Linux before moving on, vs. this layout on other systems. Both are equally valid. Commented Apr 20, 2021 at 2:21
  • 1
    Can you include a minimal reproducible example test case to see exactly what operations you're doing? Commented Apr 20, 2021 at 2:34
  • You said there are 2% more misses but your numbers show way more: 396m vs 118m? Commented Apr 20, 2021 at 16:49
  • 1
    @PeterCordes I ran the program on another machine that has an Intel i3-6100 CPU and the results were like the ones you saw yourself.. (Sorry for providing the data in images , but now I am away from that machine. ) That's definitely strange.. Commented Apr 21, 2021 at 11:33

1 Answer 1

0

Perf does not support AMD yet for detail analysis of cache behavior (It partially does from Kernel 6.1). You can use uprof from amd to investigate more in detail. https://www.amd.com/en/developer/uprof.html

Sign up to request clarification or add additional context in comments.

6 Comments

Are you saying the events perf does support give wrong counts, or the names are mapped to the wrong HW events so it's actually counting something else? Because it does reports counts for L1-dcache-loads and L1-dcache-load-misses on the OP's CPU.
No, I am not saying perf does not count or it gives wrong result but I want to say it does not support other metrics like you see in OP's post '<not supported> LLC-loads <not supported> LLC-load-misses ' hence I was suggesting to use other tool to get more insight to analyze.
Fair enough, useful info, but that doesn't really answer this specific question, unfortunately. The OP was primarily interested in the L1d counts, and used perf stat -d to get them, which also enables LLC counts. The actual question is about L1d miss counts in false sharing between logical cores on the same or different physical cores. LLC and L2 counts could perhaps figure out where memory traffic is going, but it's still weird that there'd be L1d misses at all in the case of sharing a physical core.
In a single multi-threaded core, each thread would need to maintain coherence with the other thread(s), probably the overhead is bigger than to maintain coherence with other thread(s) in other core for the specific AMD hardware.
Yeah, on Intel CPUs you get lots of machine_clears.memory_ordering, like 360M vs. 40M, as discussed in comments on another Q&A: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? So it's probably slower on AMD as well, but why would that involve any L1d misses?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.