5

I'm using Python 3.12.11 with Numba 0.61.2 on Ubuntu 22.04.5 and AMD Ryzen 7 3800X CPU. This benchmark:

import numpy as np, timeit as ti, numba as nb

@nb.njit(fastmath=True)
def f1d(img, w):
  fl = np.zeros_like(img)

  for i in range(w, len(img)-w):
    p = img[i]
    fl[i] = max(abs(p-img[i+1]), abs(p-img[i+w]), abs(p-img[i-w]), abs(p-img[i-1]))

  return fl

@nb.njit(fastmath=True)
def f2d(img):
  fl = np.zeros_like(img)

  for y in range(1, img.shape[0]-1):
    for x in range(1, img.shape[1]-1):
      p = img[y, x]
      fl[y, x] = max(abs(p-img[y, x+1]), abs(p-img[y, x-1]), abs(p-img[y+1, x]), abs(p-img[y-1, x]))

  return fl

img2d = np.random.randint(256, size=(500, 500)).astype('i2')
img1d = img2d.ravel().astype('i2')
w = img2d.shape[1]
print(np.array_equal(f2d(img2d)[:, 1:-1], f1d(img1d, w).reshape((w, -1))[:, 1:-1]))
print(f'Minimum, median and maximum execution time in us:')

for fun in ('f2d(img2d)', 'f1d(img1d, w)'):  
  t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=99))
  print(f'{fun:20}  {np.amin(t):8,.3f}  {np.median(t):8,.3f}  {np.amax(t):8,.3f}')  

produces:

True
Minimum, median and maximum execution time in us:
f2d(img2d)             170.701   172.806   180.690
f1d(img1d, w)          853.687   864.637   873.464

2D version is over 5x faster than 1D version, due to heavy use of SIMD: ymm is present 559 times in f2d assembly vs. 6 times in f1d. What is the reason that essentially the same loop is vectorized for f2d, but not for f1d?


In response to Homer512, I tried a loopless version of f1d:

@nb.njit(fastmath=True)
def f1da(img, w):
  p = img[w:-w]
  l = img[w-1:-w-1]
  r = img[w+1:-w+1]
  d = img[:-2*w]
  u = img[2*w:]
  fl = np.maximum(np.maximum(np.abs(p-l), np.abs(p-r)), np.maximum(np.abs(p-u), np.abs(p-d)))

  return fl

which ends up only slightly faster at:

f1da(img1d, w)         713.653   714.284   727.849

These examples are greatly simplifying my use case, which can't be expressed in a loopless fashion with NumPy array operations.

20
  • 1
    Probably same as last time: If vectorization fails, its usually because LLVM cannot eliminate range checks. Have you tried the same workaround as last time? Commented Jul 25 at 8:00
  • @Homer512 I tried that, see question edit, but it's still much slower than 2D. Commented Jul 25 at 8:46
  • 1
    @PaulJurczak Regarding the compile-time constant. I agree with you (I got the same result on my machine when I tried), but my point was to know if this is fine to you having w to be a constant. I asked this because I was thinking about using the @stencil feature apparently requiring w to be a constant. Commented Jul 25 at 19:28
  • 1
    @PaulJurczak For your information: @stencil is a waste of time. I tried it in 2D and it is about as slow as f1da so it is pointless... The code can be trivially parallelised but loops can too. IMHO, the only benefit is the compilation time. I also tried it in 1D following the doc and it was bogus (the code seems to never end on my machine)... Commented Jul 25 at 20:30
  • 1
    This is a bit off topic, but I found that explicitly casting after subtracting drastically improved performance. For example: abs(np.int16(p - img[y, x + 1])). Commented Jul 26 at 11:56

2 Answers 2

3

Here is a fix based on the good advice of Homer512 and its related past answer. I corrected the indices so to avoid any relative offset in memory accesses (even having fl[i+w] = ... prevent Numba to optimise the code).

@nb.njit
def f1d(img, w):
  fl = np.zeros_like(img)
  pr = img[w+1:]
  pl = img[w-1:]
  pc = img[w:]
  pt = img[0:]
  pb = img[w+w:]
  out = fl[w:]
  size = len(img) - 2 * w

  for i in range(size):
    p = pc[i]
    out[i] = max(abs(p-pr[i]), abs(p-pl[i]), abs(p-pt[i]), abs(p-pb[i]))

  return fl

I think Numba fails to optimise the initial code because it does not see that there is no overlapping between the input and output arrays (pointer aliasing issue). AFAIK, LLVM can add some checks to generate a more specialised/optimised variant of the loop, but this kind of optimisation is very brittle (and require aggressive optimisations to be enabled internally). Numba could theoretically do that at a higher level (this is obvious here since fl is created in the function) but AFAIK it is not done yet.

The above code is about equally fast to the 2D version. As stated in the comments, both implementation are far from being optimal compared to a naive SIMD-optimised native code (about 6 times faster still with a single thread).

By the way, please note I removed the fastmath flag based it does not apply on integer operations (only floating point ones).


Update: faster implementation

As pointed out by max9111 in comments, the code can be optimised further. Here is a much faster implementation (equally fast with an optimised native code and about 7 times faster than the above one):

@nb.njit
def f1d_fastest(img, w):
  assert w > 0
  fl = np.zeros_like(img)
  pr = img[w+1:]
  pl = img[w-1:]
  pc = img[w:]
  pt = img[0:]
  pb = img[w+w:]
  out = fl[w:]
  size = len(img) - 2 * w

  for i in range(np.uint64(size)):
    p = pc[i]
    out[i] = max(abs(np.int16(p-pr[i])), abs(np.int16(p-pl[i])), abs(np.int16(p-pt[i])), abs(np.int16(p-pb[i])))

  return fl

This code assumes there is not overflow when doing subtractions. If this is the case, then you can replace 16-bit explicit conversions to 32-bit. This is totally fine with the provided random input array.

The same thing applies on the 2D code.

Please note that the 2D code tends to be slower than the 1D one once optimised on machines not supporting the AVX-512 instruction set (e.g. AMD Zen5 CPUs or most Intel server-side CPUs). This is because the SSE/AVX1/AVX2/Neons instruction sets do not provide a way to do masked load/stores (of 16-bit integers) which are critical to compute the borders efficiently (as opposed to AVX-512). The 1D version is not limited by that.

Actually, the 1D version compute the borders differently: it fills them contrary to the 2D version. This means the 1D and 2D version are not strictly equivalent. This is not a problem if you do not care about the borders. If you want to fill them with zeros, then you should include this in the 1D version so to perform a fair comparison (it is not that fast to fill the borders because this can only be done efficiently with scalar instructions).

Sign up to request clarification or add additional context in comments.

20 Comments

Maybe you can add an optimized version of your code to your answer and confirm that this works as intended. int16-int16-> results by definition in int64 in Numba if not declared otherwise. pastebin.com/HGZ4W09B
@max9111 Thank you. I provided a faster 1D implementation based on yours. I also checked the correctness and mention some information about this. This code is now competitive with native codes. By the way, the int16-int16->int64 typing rule is a bit surprising (especially for a C/C++ developer). 32-bit is enough even to avoid overflow issues.
Image borders are disregarded for performance reasons, see my test of results: print(np.array_equal(f2d(img2d)[:, 1:-1], f1d(img1d, w).reshape((w, -1))[:, 1:-1]))
Here is a script which should compare all functions so far: pastebin.com/rBr8mhV4 . The fastest one on my setup (Windows) is clearly f1d_fastest_with_empty_init (by a large margin). Then comes f2d_alloc due to the empty_like and then ``f1d_fastest`. Here are the results on my setup (5 runs): pastebin.com/HMfQQupF. Can you run it and report results like I did?
Here are results on my Desktop Linux (in performance mode, so disabled CStates, no turbo, a fixed stable frequency and nearly no background activity). Results are clearly more stable though I still see some (~10%) fluctuation in timings on the fastest SIMD-optimised implementations. Results are similar overall and f1d_fastest_with_empty_init is clearly the fastest.
Thank you for doing all this work. I was trying to get a Mojo version, hoping for excellent performance, but got stuck somewhere on the learning curve and had to drop it for now. My code and results are available at github.com/pauljurczak/python-benchmarks/tree/main.
|
0

Memory Access Patterns

f2d (2D Array): When you iterate with img[y, x], img[y, x+1], img[y, x-1], these accesses are largely contiguous in memory (or at least within the same cache line). Even img[y+1, x] and img[y-1, x] access adjacent rows, but within the inner x loop, the data is processed sequentially. Modern CPUs and SIMD instructions (like AVX, using ymm registers) are highly optimized for loading and processing data in contiguous blocks. This allows Numba's LLVM backend to generate highly efficient vectorized code.

f1d (1D Flattened Array with w stride): The issue here is the w (width) parameter, which is 500 in your example. When you access img[i+w] or img[i-w], you are jumping 500 elements (1000 bytes for i2 data type) away from img[i]. These are large, non-contiguous memory strides.

SIMD units work by loading a block of data (e.g., 256 bits or 32 bytes for ymm registers) into a register and then performing operations on all elements in that block simultaneously.

If the data needed for an operation (p, img[i+1], img[i-1], img[i+w], img[i-w]) is scattered across memory, the CPU has to perform multiple, less efficient memory loads (potentially cache misses) to bring the data into the registers. This overhead quickly negates the benefits of vectorization.

Compiler (Numba/LLVM) Optimizations

Numba leverages LLVM, which is a powerful optimizing compiler. LLVM is excellent at vectorizing "embarrassingly parallel" loops where memory accesses are predictable and contiguous. However, loops with complex or large strided memory access patterns are notoriously difficult for compilers to vectorize automatically. The compiler might fall back to scalar (element-by-element) operations for f1d because it cannot efficiently map the strided loads to SIMD instructions. The ymm count difference you observed strongly supports this: f2d is heavily vectorized, f1d is not.

When you perform np.abs(p-u) or np.abs(p-d), NumPy's internal C code will be highly optimized for these element-wise operations. However, the initial creation of the d and u arrays (which involve large strides relative to the original img1d's conceptual 2D structure) still involves non-local memory accesses. While NumPy's C routines are better at handling this than a Numba-compiled Python loop for some cases, the underlying memory fetching bottleneck remains. The operations like np.maximum and np.abs themselves are vectorized, but the overall efficiency is still limited by how the data is laid out and accessed.

3 Comments

There are many issues in this answer. First of all, the memory access pattern is overall the same in both codes img[y+1, x] also performs "non-contiguous memory strides". Indeed, memory is flatten internally anyway. Moreover, both codes can be vectorized (which make sense since it is logically the same operation). It is just that the compiler certainly fails to auto-vectorize it. This does not means it cannot be. Auto-vectorization is very brittle, so no, we can hardly states that "excellent at vectorizing [...]". In fact, it actually fails in this simple stencil case...
"the data is processed sequentially" This is ambiguous: the codes are technically using 1 single thread, but the main question is whether SIMD units are properly used or not. A single-threaded code using SIMD instructions is generally not called a parallel one. It is better to talk about scalar and SIMD code (possibly vectorized too but the term "vectorized" unfortunately has another meaning when it comes to Numpy/Python codes).
The last paragraph is the relevant part. However, please note Numba can sometimes optimize operations so to avoid temporary arrays (I am not sure it does it here so one need to check it carefully).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.