Python, numpy and the cacheline

Question

I try to follow https://igoro.com/archive/gallery-of-processor-cache-effects/ in python using numpy. Though it does not work and I don't quite understand why...

numpy has fixed size dtypes, such as np.int64 which takes up 8 bytes. So with a cacheline of 64 bytes, 8 array values should be held in cache.

Thus, when doing the timing, I should not see a notable change in required time when accessing values within the cacheline, because the same number of cache line transfers are needed.

Based on this SO answer, I also tried to disable the garbage collection which didn't change anything.

# import gc
import time
import numpy as np

def update_kth_entries(arr, k):
    arr[k] = 0
    start = time.perf_counter_ns()
    for idx in range(0, len(arr), k):
        arr[idx] = 0
    end = time.perf_counter_ns()
    print(f"Updated every {k:4} th entry ({len(arr)//k:7} elements) in {(end - start)*1e-9:.5f}s")
    return arr

# gc.disable()
arr = np.arange(8*1024*1024, dtype=np.int64)
print(
    f"(Data) size of array: {arr.nbytes/1024/1024:.2f} MiB "
    f"(based on {arr.dtype})"
)

for k in np.power(2, np.arange(0,11)):
    update_kth_entries(arr, k)
# gc.enable()

This gives something like

(Data) size of array: 64.00 MiB (based on int64)
Updated every    1 th entry (8388608 elements) in 0.72061s
Updated every    2 th entry (4194304 elements) in 0.32783s
Updated every    4 th entry (2097152 elements) in 0.14810s
Updated every    8 th entry (1048576 elements) in 0.07622s
Updated every   16 th entry ( 524288 elements) in 0.04409s
Updated every   32 th entry ( 262144 elements) in 0.01891s
Updated every   64 th entry ( 131072 elements) in 0.00930s
Updated every  128 th entry (  65536 elements) in 0.00434s
Updated every  256 th entry (  32768 elements) in 0.00234s
Updated every  512 th entry (  16384 elements) in 0.00129s
Updated every 1024 th entry (   8192 elements) in 0.00057s

Here is the output of lscpu -C

NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL  SETS PHY-LINE COHERENCY-SIZE
L1d       32K     384K    8 Data            1    64        1             64
L1i       32K     384K    8 Instruction     1    64        1             64
L2       256K       3M    4 Unified         2  1024        1             64
L3        16M      16M   16 Unified         3 16384        1             64

At this point I am quite confused about what I am observing.

On the one hand I fail to see the cacheline using above code.
On the other hand I can show some sort of CPU caching effect using something like in this answer with a large enough 2D array.

I did above tests in a container on a Mac. A quick test on my Mac shows the same behavior.

Is this odd behavior due to the python interpreter?

What am I missing here?

EDIT

Based on the answer of @jérôme-richard I did some more timing, using timeit based on the functions

from numba import jit
import numpy as np

def for_loop(arr,k,arr_cnt):
    for idx in range(0, arr_cnt, k):
        arr[idx] = 0
    return arr

def vectorize(arr, k, arr_cnt):
    arr[0:arr_cnt:k] = 0
    return arr

@jit
def for_loop_numba(arr, k, arr_cnt):
    for idx in range(0, arr_cnt, k):
        arr[idx] = 0

Using the same array from above with some more information

dtype_size_bytes = 8
arr = np.arange(dtype_size_bytes * 1024 * 1024, dtype=np.int64)
print(
    f"(Data) size of array: {arr.nbytes/1024/1024:.2f} MiB "
    f"(based on {arr.dtype})"
)

cachline_size_bytes = 64
l1_size_bytes = 32*1024
l2_size_bytes = 256*1024
l3_size_bytes = 3*1024*1024
print(f"Elements in cacheline: {cachline_size_bytes//dtype_size_bytes}")
print(f"Elements in L1: {l1_size_bytes//dtype_size_bytes}")
print(f"Elements in L2: {l2_size_bytes//dtype_size_bytes}")
print(f"Elements in L3: {l3_size_bytes//dtype_size_bytes}")

which gives

(Data) size of array: 64.00 MiB (based on int64)
Elements in cacheline: 8
Elements in L1: 4096
Elements in L2: 32768
Elements in L3: 393216

If I now use timeit on above functions for various k (stride length) I get

for loop
stride=   1: total time to traverse = 598 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
stride=   1: time per element = 71.261 nsec +/- 2.208 nsec
stride=   2: total time to traverse = 294 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
stride=   2: time per element = 70.197 nsec +/- 0.859 nsec
stride=   4: total time to traverse = 151 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
stride=   4: time per element = 72.178 nsec +/- 0.666 nsec
stride=   8: total time to traverse = 77.2 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
stride=   8: time per element = 73.579 nsec +/- 1.476 nsec
stride=  16: total time to traverse = 37.6 ms ± 684 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
stride=  16: time per element = 71.730 nsec +/- 1.305 nsec
stride=  32: total time to traverse = 20 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=  32: time per element = 76.468 nsec +/- 5.304 nsec
stride=  64: total time to traverse = 10.8 ms ± 707 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=  64: time per element = 82.099 nsec +/- 5.393 nsec
stride= 128: total time to traverse = 5.16 ms ± 225 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 128: time per element = 78.777 nsec +/- 3.426 nsec
stride= 256: total time to traverse = 2.5 ms ± 114 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 256: time per element = 76.383 nsec +/- 3.487 nsec
stride= 512: total time to traverse = 1.31 ms ± 38.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 512: time per element = 80.239 nsec +/- 2.361 nsec
stride=1024: total time to traverse = 678 μs ± 36.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride=1024: time per element = 82.716 nsec +/- 4.429 nsec

Vectorized
stride=   1: total time to traverse = 6.12 ms ± 708 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   1: time per element = 0.729 nsec +/- 0.084 nsec
stride=   2: total time to traverse = 5.5 ms ± 862 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   2: time per element = 1.311 nsec +/- 0.206 nsec
stride=   4: total time to traverse = 5.73 ms ± 871 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   4: time per element = 2.732 nsec +/- 0.415 nsec
stride=   8: total time to traverse = 5.73 ms ± 401 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   8: time per element = 5.468 nsec +/- 0.382 nsec
stride=  16: total time to traverse = 4.01 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=  16: time per element = 7.644 nsec +/- 0.205 nsec
stride=  32: total time to traverse = 2.35 ms ± 178 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=  32: time per element = 8.948 nsec +/- 0.680 nsec
stride=  64: total time to traverse = 1.42 ms ± 74.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride=  64: time per element = 10.809 nsec +/- 0.570 nsec
stride= 128: total time to traverse = 792 μs ± 100 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 128: time per element = 12.089 nsec +/- 1.530 nsec
stride= 256: total time to traverse = 300 μs ± 19.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 256: time per element = 9.153 nsec +/- 0.587 nsec
stride= 512: total time to traverse = 144 μs ± 7.38 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride= 512: time per element = 8.780 nsec +/- 0.451 nsec
stride=1024: total time to traverse = 67.8 μs ± 5.67 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride=1024: time per element = 8.274 nsec +/- 0.692 nsec

for loop numba
stride=   1: total time to traverse = 6.11 ms ± 316 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   1: time per element = 0.729 nsec +/- 0.038 nsec
stride=   2: total time to traverse = 5.02 ms ± 246 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   2: time per element = 1.197 nsec +/- 0.059 nsec
stride=   4: total time to traverse = 4.93 ms ± 366 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   4: time per element = 2.350 nsec +/- 0.175 nsec
stride=   8: total time to traverse = 5.55 ms ± 500 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=   8: time per element = 5.292 nsec +/- 0.476 nsec
stride=  16: total time to traverse = 3.65 ms ± 228 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=  16: time per element = 6.969 nsec +/- 0.434 nsec
stride=  32: total time to traverse = 2.13 ms ± 48.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride=  32: time per element = 8.133 nsec +/- 0.186 nsec
stride=  64: total time to traverse = 1.48 ms ± 75.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride=  64: time per element = 11.322 nsec +/- 0.574 nsec
stride= 128: total time to traverse = 813 μs ± 84.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 128: time per element = 12.404 nsec +/- 1.283 nsec
stride= 256: total time to traverse = 311 μs ± 14.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 256: time per element = 9.477 nsec +/- 0.430 nsec
stride= 512: total time to traverse = 138 μs ± 7.46 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride= 512: time per element = 8.394 nsec +/- 0.455 nsec
stride=1024: total time to traverse = 67.6 μs ± 6.14 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride=1024: time per element = 8.253 nsec +/- 0.750 nsec

As already mentioned by @jérôme-richard, the python overhead in the standard for loop compared to the numpy vecotrization or the numba case is huge, ranging from a factor of 10 to 100.

The numpy vectorization / numba cases are comparable.

Jérôme Richard · Accepted Answer · 2024-09-27 10:50:45Z

4

I am quite confused about what I am observing.

You are mainly observing the overhead of the interpreter and the one of Numpy. Indeed, arr[idx] = 0 is interpreted and calls a function of the arr object which performs type checks, reference counting, certainly creates an internal Numpy generator and many other expensive things. These overheads are much bigger than the latency of a CPU cache (at least the L1 and L2, and possibly even the L3 one regarding the exact target CPU).

In fact, 0.00057s so update 8192 cache lines is pretty huge: it means ~70 ns per access! The latency of the RAM is generally similar to that, and the one of CPU caches is generally no more than dozens of nanoseconds (for the L3) and few ns for the L1/L2 on modern CPUs running at >= 2GHz. Thus, the observed overheads are at least one to two orders of magnitude higher than the effect you want to observe. The overhead of Numpy functions (including direct indexing which is known to be very slow) is typically few µs (at least hundreds of ns).

You can mitigate this overheads by doing vectorized operations. This is the way to use Numpy efficiently. Here, more specifically, you can replace the loop with arr[0:arr.size:k] = 0. The overhead of this operation is about 300-400 ns on my Intel Skylake (Xeon) CPU. This is still far higher than the overhead of a cache-line access but sufficiently small for cache effect to be seen when k is not so big. Note that accessing to arr.size already takes 30-40 ns on my machine so it is better to move that outside the timed section (and store the result in a temporary variable).

Here are initial results:

(Data) size of array: 64.00 MiB (based on int64)
Updated every    1 th entry (8388608 elements) in 0.56829s
Updated every    2 th entry (4194304 elements) in 0.28489s
Updated every    4 th entry (2097152 elements) in 0.14076s
Updated every    8 th entry (1048576 elements) in 0.07060s
Updated every   16 th entry ( 524288 elements) in 0.03604s
Updated every   32 th entry ( 262144 elements) in 0.01799s
Updated every   64 th entry ( 131072 elements) in 0.00923s
Updated every  128 th entry (  65536 elements) in 0.00476s
Updated every  256 th entry (  32768 elements) in 0.00278s
Updated every  512 th entry (  16384 elements) in 0.00136s
Updated every 1024 th entry (   8192 elements) in 0.00062s

And here are the one with the vectorized operation:

Updated every    1 th entry (8388608 elements) in 0.00308s
Updated every    2 th entry (4194304 elements) in 0.00466s
Updated every    4 th entry (2097152 elements) in 0.00518s
Updated every    8 th entry (1048576 elements) in 0.00515s
Updated every   16 th entry ( 524288 elements) in 0.00391s
Updated every   32 th entry ( 262144 elements) in 0.00242s
Updated every   64 th entry ( 131072 elements) in 0.00129s
Updated every  128 th entry (  65536 elements) in 0.00064s
Updated every  256 th entry (  32768 elements) in 0.00039s
Updated every  512 th entry (  16384 elements) in 0.00024s
Updated every 1024 th entry (   8192 elements) in 0.00011s

We can see that the CPython and Numpy overheads took more than 80% of the time. New timings are much more accurate (though not perfect since there are still some visible overheads for the last line -- certainly 3~15%). CPython is not great for such a benchmark. You should use native languages (like C, C++, Rust), or at least use JIT/AOT compilers so to avoid the interpreter overheads and also avoid Numpy/Python ones. Cython (only with memory views) and Numba can help to strongly reduce such overheads. The later ones should be enough here.

For example, we can see that accessing cache lines with a stride of 1024 items takes ~13 ns/cache-line which is realistic. For a stride of 8 items, it is ~5 ns. This is smaller than the L3/RAM latency, because of hardware prefetching. In this case, you measure RAM_latency / concurrency where concurrency is the number of simultaneous memory operation that are performed by the CPU (e.g. dozens).

edited Sep 27, 2024 at 10:50

answered Sep 27, 2024 at 10:43

Jérôme Richard

53.4k6 gold badges48 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Stefan Over a year ago

Thanks for this detailed response. I edited my post with some more accurate timings based on numpy vecotriztion / numba.

Stefan Over a year ago

So to my understanding for k<8, we see the cacheline benefit. For k>=8 we see various(?!) other effects, which ones? Besides, if the cachline copy is only from RAM or also from higher cache level to L1?

Jérôme Richard Over a year ago

Yes, roughly. All cache lines are fetched for k<=8 (8 is included). Beyond this limits things can become very complicated (and dependent of your specific CPU architecture and your RAM).

Jérôme Richard Over a year ago

When the stride is known (which is the case here) and rather small, hardware prefetchers can fetch cache lines early. This helps the reduce the latency and increase throughput. For large strides, prefetchers can be disabled regarding the target CPU architecture. When strides gets big, the overhead of cache trashing due to conflict misses can become a problem. CPUs can sometime do some (undocumented) tricks to mitigate the effects. If they do not, then what matters is the L1 cache associativity : it defines the concurrency of the in-flight memory requests to other caches.

Jérôme Richard Over a year ago

By the way, your benchmark is similar to the one of this post where I detailed some effects mention in the above comments. I also mention some other things to consider when writing a memory benchmark here. For exemple, here, you might assume that your CPU will only write cache lines in memory but this is not true : it reads them before due to the cache-line write-allocation policy used on all modern x86-64 CPUs.

|

Collectives™ on Stack Overflow

Python, numpy and the cacheline

EDIT

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

EDIT

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related