I try to follow https://igoro.com/archive/gallery-of-processor-cache-effects/ in python using numpy.
Though it does not work and I don't quite understand why...
numpy has fixed size dtypes, such as np.int64 which takes up 8 bytes.
So with a cacheline of 64 bytes, 8 array values should be held in cache.
Thus, when doing the timing, I should not see a notable change in required time when accessing values within the cacheline, because the same number of cache line transfers are needed.
Based on this SO answer, I also tried to disable the garbage collection which didn't change anything.
# import gc
import time
import numpy as np
def update_kth_entries(arr, k):
arr[k] = 0
start = time.perf_counter_ns()
for idx in range(0, len(arr), k):
arr[idx] = 0
end = time.perf_counter_ns()
print(f"Updated every {k:4} th entry ({len(arr)//k:7} elements) in {(end - start)*1e-9:.5f}s")
return arr
# gc.disable()
arr = np.arange(8*1024*1024, dtype=np.int64)
print(
f"(Data) size of array: {arr.nbytes/1024/1024:.2f} MiB "
f"(based on {arr.dtype})"
)
for k in np.power(2, np.arange(0,11)):
update_kth_entries(arr, k)
# gc.enable()
This gives something like
(Data) size of array: 64.00 MiB (based on int64)
Updated every 1 th entry (8388608 elements) in 0.72061s
Updated every 2 th entry (4194304 elements) in 0.32783s
Updated every 4 th entry (2097152 elements) in 0.14810s
Updated every 8 th entry (1048576 elements) in 0.07622s
Updated every 16 th entry ( 524288 elements) in 0.04409s
Updated every 32 th entry ( 262144 elements) in 0.01891s
Updated every 64 th entry ( 131072 elements) in 0.00930s
Updated every 128 th entry ( 65536 elements) in 0.00434s
Updated every 256 th entry ( 32768 elements) in 0.00234s
Updated every 512 th entry ( 16384 elements) in 0.00129s
Updated every 1024 th entry ( 8192 elements) in 0.00057s
Here is the output of lscpu -C
NAME ONE-SIZE ALL-SIZE WAYS TYPE LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d 32K 384K 8 Data 1 64 1 64
L1i 32K 384K 8 Instruction 1 64 1 64
L2 256K 3M 4 Unified 2 1024 1 64
L3 16M 16M 16 Unified 3 16384 1 64
At this point I am quite confused about what I am observing.
- On the one hand I fail to see the cacheline using above code.
- On the other hand I can show some sort of CPU caching effect using something like in this answer with a large enough 2D array.
I did above tests in a container on a Mac. A quick test on my Mac shows the same behavior.
Is this odd behavior due to the python interpreter?
What am I missing here?
EDIT
Based on the answer of @jérôme-richard I did some more timing, using timeit based on the functions
from numba import jit
import numpy as np
def for_loop(arr,k,arr_cnt):
for idx in range(0, arr_cnt, k):
arr[idx] = 0
return arr
def vectorize(arr, k, arr_cnt):
arr[0:arr_cnt:k] = 0
return arr
@jit
def for_loop_numba(arr, k, arr_cnt):
for idx in range(0, arr_cnt, k):
arr[idx] = 0
Using the same array from above with some more information
dtype_size_bytes = 8
arr = np.arange(dtype_size_bytes * 1024 * 1024, dtype=np.int64)
print(
f"(Data) size of array: {arr.nbytes/1024/1024:.2f} MiB "
f"(based on {arr.dtype})"
)
cachline_size_bytes = 64
l1_size_bytes = 32*1024
l2_size_bytes = 256*1024
l3_size_bytes = 3*1024*1024
print(f"Elements in cacheline: {cachline_size_bytes//dtype_size_bytes}")
print(f"Elements in L1: {l1_size_bytes//dtype_size_bytes}")
print(f"Elements in L2: {l2_size_bytes//dtype_size_bytes}")
print(f"Elements in L3: {l3_size_bytes//dtype_size_bytes}")
which gives
(Data) size of array: 64.00 MiB (based on int64)
Elements in cacheline: 8
Elements in L1: 4096
Elements in L2: 32768
Elements in L3: 393216
If I now use timeit on above functions for various k (stride length) I get
for loop
stride= 1: total time to traverse = 598 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
stride= 1: time per element = 71.261 nsec +/- 2.208 nsec
stride= 2: total time to traverse = 294 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
stride= 2: time per element = 70.197 nsec +/- 0.859 nsec
stride= 4: total time to traverse = 151 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
stride= 4: time per element = 72.178 nsec +/- 0.666 nsec
stride= 8: total time to traverse = 77.2 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
stride= 8: time per element = 73.579 nsec +/- 1.476 nsec
stride= 16: total time to traverse = 37.6 ms ± 684 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
stride= 16: time per element = 71.730 nsec +/- 1.305 nsec
stride= 32: total time to traverse = 20 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 32: time per element = 76.468 nsec +/- 5.304 nsec
stride= 64: total time to traverse = 10.8 ms ± 707 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 64: time per element = 82.099 nsec +/- 5.393 nsec
stride= 128: total time to traverse = 5.16 ms ± 225 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 128: time per element = 78.777 nsec +/- 3.426 nsec
stride= 256: total time to traverse = 2.5 ms ± 114 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 256: time per element = 76.383 nsec +/- 3.487 nsec
stride= 512: total time to traverse = 1.31 ms ± 38.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 512: time per element = 80.239 nsec +/- 2.361 nsec
stride=1024: total time to traverse = 678 μs ± 36.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride=1024: time per element = 82.716 nsec +/- 4.429 nsec
Vectorized
stride= 1: total time to traverse = 6.12 ms ± 708 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 1: time per element = 0.729 nsec +/- 0.084 nsec
stride= 2: total time to traverse = 5.5 ms ± 862 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 2: time per element = 1.311 nsec +/- 0.206 nsec
stride= 4: total time to traverse = 5.73 ms ± 871 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 4: time per element = 2.732 nsec +/- 0.415 nsec
stride= 8: total time to traverse = 5.73 ms ± 401 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 8: time per element = 5.468 nsec +/- 0.382 nsec
stride= 16: total time to traverse = 4.01 ms ± 107 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 16: time per element = 7.644 nsec +/- 0.205 nsec
stride= 32: total time to traverse = 2.35 ms ± 178 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 32: time per element = 8.948 nsec +/- 0.680 nsec
stride= 64: total time to traverse = 1.42 ms ± 74.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 64: time per element = 10.809 nsec +/- 0.570 nsec
stride= 128: total time to traverse = 792 μs ± 100 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 128: time per element = 12.089 nsec +/- 1.530 nsec
stride= 256: total time to traverse = 300 μs ± 19.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 256: time per element = 9.153 nsec +/- 0.587 nsec
stride= 512: total time to traverse = 144 μs ± 7.38 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride= 512: time per element = 8.780 nsec +/- 0.451 nsec
stride=1024: total time to traverse = 67.8 μs ± 5.67 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride=1024: time per element = 8.274 nsec +/- 0.692 nsec
for loop numba
stride= 1: total time to traverse = 6.11 ms ± 316 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 1: time per element = 0.729 nsec +/- 0.038 nsec
stride= 2: total time to traverse = 5.02 ms ± 246 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 2: time per element = 1.197 nsec +/- 0.059 nsec
stride= 4: total time to traverse = 4.93 ms ± 366 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 4: time per element = 2.350 nsec +/- 0.175 nsec
stride= 8: total time to traverse = 5.55 ms ± 500 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 8: time per element = 5.292 nsec +/- 0.476 nsec
stride= 16: total time to traverse = 3.65 ms ± 228 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 16: time per element = 6.969 nsec +/- 0.434 nsec
stride= 32: total time to traverse = 2.13 ms ± 48.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
stride= 32: time per element = 8.133 nsec +/- 0.186 nsec
stride= 64: total time to traverse = 1.48 ms ± 75.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 64: time per element = 11.322 nsec +/- 0.574 nsec
stride= 128: total time to traverse = 813 μs ± 84.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 128: time per element = 12.404 nsec +/- 1.283 nsec
stride= 256: total time to traverse = 311 μs ± 14.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
stride= 256: time per element = 9.477 nsec +/- 0.430 nsec
stride= 512: total time to traverse = 138 μs ± 7.46 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride= 512: time per element = 8.394 nsec +/- 0.455 nsec
stride=1024: total time to traverse = 67.6 μs ± 6.14 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
stride=1024: time per element = 8.253 nsec +/- 0.750 nsec
As already mentioned by @jérôme-richard, the python overhead in the standard for loop compared to the numpy vecotrization or the numba case is huge, ranging from a factor of 10 to 100.
The numpy vectorization / numba cases are comparable.