I'm using Python 3.12.11 with Numba 0.61.2 on Ubuntu 22.04.5 and AMD Ryzen 7 3800X CPU. This benchmark:
import numpy as np, timeit as ti, numba as nb
@nb.njit(fastmath=True)
def f1d(img, w):
fl = np.zeros_like(img)
for i in range(w, len(img)-w):
p = img[i]
fl[i] = max(abs(p-img[i+1]), abs(p-img[i+w]), abs(p-img[i-w]), abs(p-img[i-1]))
return fl
@nb.njit(fastmath=True)
def f2d(img):
fl = np.zeros_like(img)
for y in range(1, img.shape[0]-1):
for x in range(1, img.shape[1]-1):
p = img[y, x]
fl[y, x] = max(abs(p-img[y, x+1]), abs(p-img[y, x-1]), abs(p-img[y+1, x]), abs(p-img[y-1, x]))
return fl
img2d = np.random.randint(256, size=(500, 500)).astype('i2')
img1d = img2d.ravel().astype('i2')
w = img2d.shape[1]
print(np.array_equal(f2d(img2d)[:, 1:-1], f1d(img1d, w).reshape((w, -1))[:, 1:-1]))
print(f'Minimum, median and maximum execution time in us:')
for fun in ('f2d(img2d)', 'f1d(img1d, w)'):
t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=99))
print(f'{fun:20} {np.amin(t):8,.3f} {np.median(t):8,.3f} {np.amax(t):8,.3f}')
produces:
True
Minimum, median and maximum execution time in us:
f2d(img2d) 170.701 172.806 180.690
f1d(img1d, w) 853.687 864.637 873.464
2D version is over 5x faster than 1D version, due to heavy use of SIMD: ymm is present 559 times in f2d assembly vs. 6 times in f1d. What is the reason that essentially the same loop is vectorized for f2d, but not for f1d?
In response to Homer512, I tried a loopless version of f1d:
@nb.njit(fastmath=True)
def f1da(img, w):
p = img[w:-w]
l = img[w-1:-w-1]
r = img[w+1:-w+1]
d = img[:-2*w]
u = img[2*w:]
fl = np.maximum(np.maximum(np.abs(p-l), np.abs(p-r)), np.maximum(np.abs(p-u), np.abs(p-d)))
return fl
which ends up only slightly faster at:
f1da(img1d, w) 713.653 714.284 727.849
These examples are greatly simplifying my use case, which can't be expressed in a loopless fashion with NumPy array operations.
wto be a constant. I asked this because I was thinking about using the@stencilfeature apparently requiringwto be a constant.@stencilis a waste of time. I tried it in 2D and it is about as slow asf1daso it is pointless... The code can be trivially parallelised but loops can too. IMHO, the only benefit is the compilation time. I also tried it in 1D following the doc and it was bogus (the code seems to never end on my machine)...abs(np.int16(p - img[y, x + 1])).