clflush to invalidate cache line via C function

Question

I am trying to use clflush to manually evicts a cache line in order to determine cache and line sizes. I didn't find any guide on how to use that instruction. All I see, are some codes that use higher level functions for that purpose.

There is a kernel function void clflush_cache_range(void *vaddr, unsigned int size), but still I don't know what to include in my code and how to use that. I don't know what is the size in that function.

More than that, how can I be sure that the line is evicted in order to verify the correctness of my code?

UPDATE:

Here is a initial code for what I am trying to do.

#include <immintrin.h>
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  /* FLUSH A LINE */
  /* each element is 4 bytes */
  /* assuming that cache line size is 64 bytes */
  /* array[0] till array[15] is flushed */
  /* even if line size is less than 64 bytes */
  /* we are sure that array[0] has been flushed */
  _mm_clflush( &array[ 0 ] );



  int tm = 0;
  register uint64_t time1, time2, time3;


  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );

  time3 = __rdtscp( &array[ 0 ] ) - time2; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time3 );
  return 0;
}

Before running the code, I would like to manually verify that it is a correct code. Am I in the correct path? Did I use _mm_clflush correctly?

UPDATE:

Thanks to Peter's comment, I fixed the the code as follows

  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache miss */
  printf( "miss latency = %lu \n", time2 );
  time1 = __rdtscp( &tm ); /* set timer */
  time2 = __rdtscp( &array[ 0 ] ) - time1; /* array[0] is a cache hit */
  printf( "hit latency = %lu \n", time1 );

By running the code multiple times, I get the following output

$ ./flush
miss latency = 238
hit latency = 168
$ ./flush
miss latency = 154
hit latency = 140
$ ./flush
miss latency = 252
hit latency = 140
$ ./flush
miss latency = 266
hit latency = 252

The first run seems to be reasonable. But the second run looks odd. By running the code from the command line, every time the array is initialized with the values and then I explicitly evict the first line.

UPDATE4:

I tried Hadi-Brais code and here are the outputs

naderan@webshub:~$ ./flush3
address = 0x7ffec7a92220
array[ 0 ] = 0
miss section latency = 378
array[ 0 ] = 0
hit section latency = 175
overhead latency = 161
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 217 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffedbe0af40
array[ 0 ] = 0
miss section latency = 392
array[ 0 ] = 0
hit section latency = 231
overhead latency = 168
Measured L1 hit latency = 63 TSC cycles
Measured main memory latency = 224 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffead7fdc90
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 252 TSC cycles
naderan@webshub:~$ ./flush3
address = 0x7ffe51a77310
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 182
overhead latency = 161
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 203 TSC cycles

Slightly different latencies are acceptable. However hit latency of 63 compared to 21 and 14 is also observable.

UPDATE5:

As I checked the Ubuntu, there is no power saving feature enabled. Maybe the frequency change is disabled in the bios, or there is a miss configuration

$ cat /proc/cpuinfo  | grep -E "(model|MHz)"
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
cpu MHz         : 2097.571
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz  
cpu MHz         : 2097.571
$ lscpu | grep MHz
CPU MHz:             2097.571

Anyway, that means the frequency is set to its maximum value which is what I have to care. By running multiple times, I see some different values. Are these normal?

$ taskset -c 0 ./flush3
address = 0x7ffe30c57dd0
array[ 0 ] = 0
miss section latency = 602
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 455 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffd16932fd0
array[ 0 ] = 0
miss section latency = 399
array[ 0 ] = 0
hit section latency = 168
overhead latency = 147
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 252 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffeafb96580
array[ 0 ] = 0
miss section latency = 364
array[ 0 ] = 0
hit section latency = 161
overhead latency = 140
Measured L1 hit latency = 21 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffe58291de0
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 168
overhead latency = 140
Measured L1 hit latency = 28 TSC cycles
Measured main memory latency = 217 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7fffa76d20b0
array[ 0 ] = 0
miss section latency = 371
array[ 0 ] = 0
hit section latency = 161
overhead latency = 147
Measured L1 hit latency = 14 TSC cycles
Measured main memory latency = 224 TSC cycles
$ taskset -c 0 ./flush3
address = 0x7ffdec791580
array[ 0 ] = 0
miss section latency = 357
array[ 0 ] = 0
hit section latency = 189
overhead latency = 147
Measured L1 hit latency = 42 TSC cycles
Measured main memory latency = 210 TSC cycles

Your GDB output from disas /m has giant gaps, like from 0x69e to 0x6cd (or about 50 bytes of machine code). According to help disas: Only the main source file is displayed, not those of, e.g., any inlined functions. This modifier hasn't proved useful in practice and is deprecated in favor of /s. _mm_clflush is an inline function. Also you forgot to compile with optimization enabled, so your function is full of wasted instructions. And you're still using the useless _rdtscp( &array[ 0 ] ) thing that does a store to the array after reading the clock. — Peter Cordes
– Peter Cordes, Commented Aug 16, 2018 at 9:35
@PeterCordes: I wrote UPDATE4. Regarding _rdtscp( &array[ 0 ] ), you say that it is not good for my purpose. I read the manual and accept that. However, I didn't find any alternative for that. Do you mean that __rdtsc which Hadi-Brais used in his code is the right choice? I understand that from your comment about that. — mahmood
– mahmood, Commented Aug 17, 2018 at 5:49
Hadi's answer explains why and how he's using a read inside the timed region, with temp = array[0]. It compiles to asm that does what we want (if you use gcc -O3.) — Peter Cordes
– Peter Cordes, Commented Aug 17, 2018 at 5:50
When you ran Hadi's code, you probably didn't control for CPU frequency scaling. RDTSC counts at a fixed frequency, regardless of the core clock speed. So it's perfectly reasonable to see variations up to a factor of 5 on a 4GHz CPU (rated frequency = reference frequency) that idles at 0.8GHz (actually frequency when the program first starts). That's why I ran an infinite loop in the background to get my CPU to ramp up to max before running Hadi's code, see my comments under his answer. If you have a Skylake, maybe sometimes your CPU ramped up fast enough to see a lower time. — Peter Cordes
– Peter Cordes, Commented Aug 17, 2018 at 5:55
What Peter has said is critically important and you should understand it very well. TSC cycles have fixed periods, and so they measure wall clock time. In contrast, core cycles do NOT measure wall clock time under frequency scaling because different cycles have different periods. If the whole program fully runs within the core frequency domain, the core cycle count will be the same each run irrespective of frequency changes. However, the TSC cycle count will be different depending on frequency, because it directly translates into execution time. — Hadi Brais
– Hadi Brais, Commented Aug 17, 2018 at 8:02

Hadi Brais · Accepted Answer · 2020-01-24 10:19:23Z

14

You have multiple errors in the code that may lead the nonsensical measurements that you're seeing. I've fixed the errors and you can find the explanation in the comments below.

/* compile with gcc at optimization level -O3 */
/* set the minimum and maximum CPU frequency for all cores using cpupower to get meaningful results */ 
/* run using "sudo nice -n -20 ./a.out" to minimize possible context switches, or at least use "taskset -c 0 ./a.out" */
/* you can optionally use a p-state scaling driver other than intel_pstate to get more reproducable results */
/* This code still needs improvement to obtain more accurate measurements,
   and a lot of effort is required to do that—argh! */
/* Specifically, there is no single constant latency for the L1 because of
   the way it's designed, and more so for main memory. */
/* Things such as virtual addresses, physical addresses, TLB contents,
   code addresses, and interrupts may have an impact that needs to be
   investigated */
/* The instructions that GCC puts unnecessarily in the timed section are annoying AF */
/* This code is written to run on Intel processors! */

#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
int main()
{
  int array[ 100 ];

  /* this is optional */
  /* will bring array in the cache */
  for ( int i = 0; i < 100; i++ )
    array[ i ] = i;

  printf( "address = %p \n", &array[ 0 ] ); /* guaranteed to be aligned within a single cache line */

  _mm_mfence();                      /* prevent clflush from being reordered by the CPU or the compiler in this direction */

  /* flush the line containing the element */
  _mm_clflush( &array[ 0 ] );

  //unsigned int aux;
  uint64_t time1, time2, msl, hsl, osl; /* initial values don't matter */

  /* You can generally use rdtsc or rdtscp.
     See: https://stackoverflow.com/questions/59759596/is-there-any-difference-in-between-rdtsc-lfence-rdtsc-and-rdtsc-rdtscp
     I AM NOT SURE THOUGH THAT THE SERIALIZATION PROERTIES OF
     RDTSCP ARE APPLICABLE AT THE COMPILER LEVEL WHEN USING THE
     __RDTSCP INTRINSIC. THIS IS TRUE FOR PURE FENCES SUCH AS LFENCE. */

  _mm_mfence();                      /* this properly orders both clflush and rdtsc*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
  int temp = array[ 0 ];             /* array[0] is a cache miss */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load*/
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  msl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );             /* prevent the compiler from optimizing the load */
  printf( "miss section latency = %lu \n", msl );   /* the latency of everything in between the two rdtsc */

  _mm_mfence();                      /* this properly orders both clflush and rdtsc*/
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc and the load */
  temp = array[ 0 ];                 /* array[0] is a cache hit as long as the OS, a hardware prefetcher, or a speculative accesses to the L1D or lower level inclusive caches don't evict it */
  /* measring the write miss latency to array is not meaningful because it's an implementation detail and the next write may also miss */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc and the load */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  hsl = time2 - time1;

  printf( "array[ 0 ] = %i \n", temp );            /* prevent the compiler from optimizing the load */
  printf( "hit section latency = %lu \n", hsl );   /* the latency of everything in between the two rdtsc */


  _mm_mfence();                      /* this properly orders both clflush and rdtsc */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time1 = __rdtsc();                 /* set timer */
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions + compiler barrier for rdtsc */
  /* no need for mfence because there are no stores in between */
  _mm_lfence();                      /* mfence and lfence must be in this order + compiler barrier for rdtsc */
  time2 = __rdtsc();
  _mm_lfence();                      /* serialize __rdtsc with respect to trailing instructions */
  osl = time2 - time1;

  printf( "overhead latency = %lu \n", osl ); /* the latency of everything in between the two rdtsc */


  printf( "Measured L1 hit latency = %lu TSC cycles\n", hsl - osl ); /* hsl is always larger than osl */
  printf( "Measured main memory latency = %lu TSC cycles\n", msl - osl ); /* msl is always larger than osl and hsl */

  return 0;
}

Highly recommended: Memory latency measurement with time stamp counter.

Related: How can I create a spectre gadget in practice?.

edited Jan 24, 2020 at 10:19

answered Aug 13, 2018 at 21:41

Hadi Brais

24k3 gold badges65 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

27 Comments

Peter Cordes Over a year ago

rdtscp doesn't need a preceding lfence, that's why the OP was using it instead of rdtsc. All previous instructions have to execute before it samples the time. (But it doesn't necessarily make later instructions wait for that to happen.)

Hadi Brais Over a year ago

@mahmood Using -O3 helps reduce the amount of noise inside the timed section of the code by removing expensive instructions. You can emit the binary using -O3 and -O0 and compare the assembly code and see the difference. Fences are required not just for the compiler (when optimizations are used), but also for the CPU itself. You cannot turn off the optimizations that the CPU itself performs. So the fences are critical to obtain a reliable measurement. You can do slightly better if you write the whole code in assembly instead of C, because there you have absolute control over the timed section.

Hadi Brais Over a year ago

Each fence has a purpose as explained in the comments in the code.

Hadi Brais Over a year ago

@mahmood Both mfence and lfence order clflush on Intel processors. However, mfence flushes the store buffer as well. But here we have no writes to array[0], so I don't think that would make a difference. However, on most Intel processors, mfence is cheaper than lfence. But mfence is used only once before clflush, so that wouldn't matter that much. I think it's OK to replace the mfence before clflush with lfence in this code.

Peter Cordes Over a year ago

@zgc: en.wikipedia.org/wiki/Common_subexpression_elimination

|

Peter Cordes · Accepted Answer · 2018-08-13 12:52:42Z

5

You know you can query the line size with cpuid, right? Do that if you actually want to find it programmatically. (Otherwise, assume it's 64 bytes, because it is on everything after PIII.)

But sure if want to use clflush or clflushopt from C for whatever reason, use void _mm_clflush(void const *p) or void _mm_clflushopt(void const *p), from #include <immintrin.h>. (See Intel's insn set ref manual entry for clflush or clflushopt).

GCC, clang, ICC, and MSVC all support Intel's <immintrin.h> intrinsics.

You could also have found this by searching Intel's intrinsics guide for clflush to find definitions for the intrinsics for that instruction.

see also https://stackoverflow.com/tags/x86/info for more links to guides, docs, and reference manuals.

More than that, how can I be sure that the line is evicted in order to verify the correctness of my code?

Look at the compiler's asm output, or single-step it in a debugger. If/when clflush executes, that cache line is evicted at that point in your program.

edited Aug 13, 2018 at 12:52

answered Aug 13, 2018 at 9:06

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

14 Comments

mahmood Over a year ago

Are these valid functions in gcc? Or they are specific for intel compiler?

Peter Cordes Over a year ago

@mahmood. All 4 mainstream x86 compilers support Intel's intrinsics in <immintrin.h>. gcc, clang, ICC, and MSVC.

mahmood Over a year ago

I think I had some progresses. Please see the updated post.

Peter Cordes Over a year ago

@onlycparra: clflush has existed since about SSE2, but has its own CPUID feature flag. So does clflushopt. en.wikichip.org/wiki/amd/microarchitectures/zen_2 confirms that it has the CLFLUSHOPT feature, or you could look at CPUID dumps on instlatx64.atw.hu for any particular Zen2 CPU.

Peter Cordes Over a year ago

@onlycparra: clflushopt in a loop. (With one SFENCE after, if you care about it being ordered wrt. later stores). (e.g. the Linux kernel function clflush_cache_range. See also Is there a way to flush the entire CPU cache related to a program?)

|

Collectives™ on Stack Overflow

clflush to invalidate cache line via C function

2 Answers 2

27 Comments

14 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

27 Comments

14 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related