Optimization Challenge Due to L1 Cache with Numba

Question

I've been working on optimizing the calculation of differences between elements in NumPy arrays. I have been using Numba for performance improvements, but I get a 100-microsecond jump when the array size surpasses 1 MB. I assume this is due to my CPU's Ryzen 7950X 1 MB L1 cache size.

Here is an example code:

@jit(nopython=True)
def extract_difference_1(random_array):
    shape0, shape1 = random_array.shape
    difference_arr = np.empty((shape0, shape1), dtype=np.float64)
    for i in range(shape0):
        difference_arr[i] = random_array[i,0] - random_array[i,1], random_array[i,1] - random_array[i,2], random_array[i,2] - random_array[i,3], random_array[i,3] - random_array[i,4], random_array[i,4] - random_array[i,5], random_array[i,5] - random_array[i,6], random_array[i,6] - random_array[i,0]

    return difference_arr

@jit(nopython=True)
def extract_difference_2(random_array):
    shape0, shape1 = random_array.shape
    split_index = shape0 // 2
    part_1 = extract_difference_1(random_array[:split_index])
    part_2 = extract_difference_1(random_array[split_index:])

    return part_1 , part_2

x_list = [18500, 18700, 18900]
y = 7
for x in x_list:
    random_array = np.random.rand(x, y)
    print(f"\nFor (x,y) = ({x}, {y}), random_array size is {array_size_string(random_array)}:\n")
    for func in [extract_difference_1, extract_difference_2]:
        func(random_array) # compile the function
        timing_result = %timeit -q -o func(random_array)
        print(f"{func.__name__}:\t {timing_result_message(timing_result)}")

The timing results are:

For (x,y) = (18500, 7), random_array size is 0.988 MB, 1011.72 KB:

extract_difference_1:    32.4 µs ± 832 ns,   b: 31.5 µs,    w: 34.3 µs,     (l: 7, r: 10000),
extract_difference_2:    33.8 µs ± 279 ns,   b: 33.5 µs,    w: 34.3 µs,     (l: 7, r: 10000),

For (x,y) = (18700, 7), random_array size is 0.999 MB, 1022.66 KB:

extract_difference_1:    184 µs ± 2.15 µs,   b: 181 µs,     w: 188 µs,  (l: 7, r: 10000),
extract_difference_2:    34.4 µs ± 51.2 ns,  b: 34.3 µs,    w: 34.5 µs,     (l: 7, r: 10000),

For (x,y) = (18900, 7), random_array size is 1.009 MB, 1033.59 KB:

extract_difference_1:    201 µs ± 3.3 µs,    b: 196 µs,     w: 205 µs,  (l: 7, r: 10000),
extract_difference_2:    34.5 µs ± 75.2 ns,  b: 34.4 µs,    w: 34.6 µs,     (l: 7, r: 10000),

Splitting the resulting difference_arr into two does it, but I prefer if the result is a single array. Especially as later, I will be increasing the y to 10, 50, 100, 1000 and x to 20000. When combining the split arrays part_1 and part_2 into the difference_arr, I found it slower than extract_difference_1. I think the slowdown is due to the extract_difference_1 being larger than 1 MB, resulting in L1 cache not being used.

Is there a way to maintain the performance while having the result be a single array with Python, Numba or any other package? Or is there a way that will allow me to recombine these arrays without a performance penalty for the resulting array exceeding the L1 cache size?

What OS are you using? I was able to reproduce it on Windows 10, but not on Ubuntu 22 (even with various other sizes). — ken
– ken, Commented Jun 21, 2024 at 8:52
I am not sure what is going on here, but one thing is for sure: At least for this example, if you change both functions to take an output buffer array as an argument instead of generating it on each call, the performance should be almost the same. So there is no need for chunked execution. You might also want to check if your actual problem is caused by the L1 cache. — ken
– ken, Commented Jun 21, 2024 at 9:30

Jérôme Richard · Accepted Answer · 2024-06-21 09:50:25Z

TL;DR: The performance issue is not caused by your CPU cache. It comes from the behaviour of the allocator on your target platform which is certainly Windows.

Analysis

I assume this is due to my CPU's Ryzen 7950X 1 MB L1 cache size.

First of all, the AMD Ryzen 7950X CPU is a Zen4 CPU. This architecture have L1D caches of 32 KiB not 1 MiB. That being said, the L2 cache is 1 MiB on this architecture.

While the cache-size hypothesis is a tempting idea at first glance. There are two major issues with it:

First, the same amount of data is read and written by the two functions. The fact that the array is split in two parts does not change this fact. Thus, if cache misses happens in the first function due to the L2 capacity, it should also be the case on the other function. Regarding memory accesses, the only major difference between the two function is the order of the access which should not have a significant performance impact anyway (since the array is sufficiently large so latency issues are mitigated).

Moreover, the L2 cache on Zen4 is not so much slower than the L3 one. Indeed, It should not be more than twice slower while experimental results show a >5x times bigger execution time.

I can reproduce this on a Cascade Lake CPU (with a L2 cache of also 1 MiB) on Windows. Here is the result:

For (x,y) = (18500, 7), random_array size is 0.988006591796875:

extract_difference_1:    68.6 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
extract_difference_2:    70.8 µs ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For (x,y) = (18700, 7), random_array size is 0.998687744140625:

extract_difference_1:    342 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
extract_difference_2:    69.7 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For (x,y) = (18900, 7), random_array size is 1.009368896484375:

extract_difference_1:    386 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
extract_difference_2:    67 µs ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

New hypothesis: allocation overheads

Splitting the resulting difference_arr into two does it

The main difference between the two functions is that one performs 2 small allocations rather than 1 big. This rises a new hypothesis: can the allocation timings explain the issue?

We can easily answer this question based on this previous post: Why is allocation using np.empty not O(1). We can see that there is a big performance gap between allocations of 0.76 MiB (np.empty(10**5)) and the next bigger one >1 MiB. Here are the provided results of the target answer:

np.empty(10**5)   # 620 ns ± 2.83 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**6)   # 9.61 µs ± 34.2 ns per loop   (on 7 runs, 100000 loops each)

More precisely, here is new benchmarks on my current machine:

%timeit -n 10_000 np.empty(1000*1024, np.uint8)
793 ns ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit -n 10_000 np.empty(1024*1024, np.uint8)
6.6 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We can see that the gap is close to 1 MiB. Note that the timings between 1000 KiB and 1024 are not very stable (showing that the result is dependent of hidden low-level parameters -- possibly packing/alignment issues).

This Numpy allocation behaviour is AFAIK specific to Windows and AFAIR not visible on Linux (gaps might be seen but not that big and not at the same threshold).

An explanation is provided in the linked answer : expensive kernel calls are performed beyond a threshold (huge-pages might also play a role too).

Solutions

Is there a way to maintain the performance while having the result be a single array with Python

Yes. You can preallocate the output array memory so not to pay the expensive allocation overhead. An alternative solution is to use another allocator (e.g. jemalloc, tcmalloc).

Here is a modified code preallocating memory:

@nb.jit(nopython=True)
def extract_difference_1(random_array, scratchMem):
    shape0, shape1 = random_array.shape
    difference_arr = scratchMem[:shape0*shape1].reshape((shape0, shape1))#np.empty((shape0, shape1), dtype=np.float64)
    for i in range(shape0):
        difference_arr[i] = random_array[i,0] - random_array[i,1], random_array[i,1] - random_array[i,2], random_array[i,2] - random_array[i,3], random_array[i,3] - random_array[i,4], random_array[i,4] - random_array[i,5], random_array[i,5] - random_array[i,6], random_array[i,6] - random_array[i,0]

    return difference_arr

@nb.jit(nopython=True)
def extract_difference_2(random_array, scratchMem):
    shape0, shape1 = random_array.shape
    split_index = shape0 // 2
    part_1 = extract_difference_1(random_array[:split_index], np.empty((split_index, shape1)))
    part_2 = extract_difference_1(random_array[split_index:], np.empty((split_index, shape1)))

    return part_1 , part_2

x_list = [18500, 18700, 18900]
y = 7
scratchMem = np.empty(16*1024*1024)
for x in x_list:
    random_array = np.random.rand(x, y)
    print(f"\nFor (x,y) = ({x}, {y}), random_array size is {x*y*8/1024/1024}:\n")
    for func in [extract_difference_1, extract_difference_2]:
        func(random_array, scratchMem) # compile the function
        timing_result = %timeit -q -o func(random_array, scratchMem)
        print(f"{func.__name__}:\t {timing_result}")

Here is the result:

For (x,y) = (18500, 7), random_array size is 0.988006591796875:

extract_difference_1:    65.1 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
extract_difference_2:    71 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For (x,y) = (18700, 7), random_array size is 0.998687744140625:

extract_difference_1:    69.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
extract_difference_2:    68.3 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

For (x,y) = (18900, 7), random_array size is 1.009368896484375:

extract_difference_1:    68.5 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
extract_difference_2:    68.7 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We can see that the problem is now gone! Thus, this confirms the hypothesis that allocations were the main source of the performance issue.

Collectives™ on Stack Overflow

Optimization Challenge Due to L1 Cache with Numba

1 Answer 1

Analysis

New hypothesis: allocation overheads

Solutions

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Analysis

New hypothesis: allocation overheads

Solutions

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related