2

For my study, I have to implement pairwise distance L1-distance calculation between two sets of vectors, each represented as a NumPy matrix (vectors are rows). This has to be done using two loops, one loop and no loops. I expected that since NumPy is so great with vectorization, algorithms must rank as two-loops slower than one-loop slower than no-loops.

I've written the functions:

def f_cdist_2(X1, X2):
    res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64)

    for ix1 in range(X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            res[ix1, ix2] = np.abs(X1[ix1, :] - X2[ix2, :]).sum()

    return res


def f_cdist_1(X1, X2):
    res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64)

    for ix1 in range(X1.shape[0]):
        res[ix1, :] = np.abs(np.tile(X1[ix1, :], (X2.shape[0], 1)) - X2).sum(axis=1)

    return res


def f_cdist_0(X1, X2):
    res = np.abs(
            np.tile(X1[:, :, np.newaxis], (1, 1, X2.shape[0])) - \
            np.tile(X2.T[np.newaxis, :, :], (X1.shape[0], 1, 1))
    ).sum(axis=1)

    return res

Then I tested the performance with two random matrices of shapes 128 x 512 and 256 x 512, based on 100 runs I've got the results:

  1. Two loops: 156 msec

  2. One loop: 32 msec

  3. No loops: 135 msec

I also tried cdist from scipy.spatial.distance, and got the best performance of all: 9 msec.

Now, is there a better way to implement no-loops function? I hoped it to perform at least as good as one-loop, but for now I'm at loss as to how to improve it.

UPDATE

Using kwinkunks's implementation of no-loops approach, I've re-run tests (again 100 trials) on matrices 1024 x 1024, results are below:

  1. Two loops: 5.7 sec

  2. One loop: 6.6 sec

  3. No loops: 3.9 sec

  4. scipy.spatial.distance.cdist: 0.6 sec

So on larger matrices, no-loops implementation indeed works better. scipy makes wonders, but if I understand correctly, it is written on C, hence such a great performance.

UPDATE

Tried with 4096 x 1024 matrices of np.float64, same setup:

  1. Two loops: 88 sec

  2. One loop: 66 sec

  3. No loops: Ran out of memory (had ~ 18 Gb of free RAM at the moment)

  4. scipy.spatial.distance.cdist: 13 sec

6
  • Interesting study! I'm curious if it the result will be any different if you use much bigger matrices. Have you tried with much bigger shapes? Commented Apr 25, 2019 at 18:08
  • That is a good insight. I think the memory of the broadcast of the large arrays is creating considerable overhead. This is a surprising result! Commented Apr 25, 2019 at 18:20
  • Tried just now, updated post with results. No-loops now is the best of three. Will try with even larger shapes, but that's going to take significant time. I'll leave it overnight. Commented Apr 25, 2019 at 20:01
  • You can do this a lot faster than cdist. eg. stackoverflow.com/a/42994680/4045774 or stackoverflow.com/a/53380192/4045774 Commented Apr 26, 2019 at 12:41
  • @max9111, a lot faster than cdist from scipy? Could you please say how? I browsed through questions you suggested, and implemented distance calculations with np.einsum like this: np.einsum('ijk -> ij', np.abs(X1[:, None, :] - X2[None, :, :])), but it performs pretty much like no-loops implementation above. Is there a better way? Commented Apr 26, 2019 at 16:12

3 Answers 3

4

You can get extra speedup from the vectorized version using Pythran

f_dist.py:

import numpy as np
#pythran export f_dist(float64[:,:], float64[:,:])
def f_dist(X1, X2):
    return np.sum(np.abs(X1[:, None, :] - X2), axis=-1)

On my laptop, the original version runs at:

> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)'
100 loops, best of 3: 7.05 msec per loop

Once you compile the kernel:

> pythran f_dist.py

You can benchmark it:

> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)'
1000 loops, best of 3: 1.21 msec per loop

Using SIMD instructions further speeds-up the computation:

> pythran f_dist.py -DUSE_XSIMD -march=native
> python -m timeit -s 'from f_dist import f_dist; from numpy.random import random; x = random((100,100)); y = random((100,100))' 'f_dist(x, y)'
1000 loops, best of 3: 774 usec per loop

Disclaimer: I'm the core dev of the pythran project.

Sign up to request clarification or add additional context in comments.

3 Comments

Could you also benchmark with #pythran export f_dist(float64[:,::1], float64[:,::1])? At least in Numba or Cython, declaring non-contigous arrays often hurts SIMD-vectorization.
It triggers a compilation error :-) Thanks for the extra test case :-) In theroy non contiguous array would prevent vectorization.
Bug fixed locally, and as expected, there's no SIMD processing of the main loop when using sliced array.
0

You can avoid the tiling etc with NumPy's broadcasting:

def f_dist(X1, X2):
    return np.sum(np.abs(X1[:, None, :] - X2), axis=-1)

But, surprisingly (to me anyway) it's not faster than your loop (ca. 90 ms on my machine, compared to 24 ms for your f_cdist_1() function).

That broadcasting trick is often useful. It means you can do things like this:

>>> np.array([1,2,3]) * np.array([10, 20, 30])[:, None]
array([[10, 20, 30],
       [20, 40, 60],
       [30, 60, 90]])

2 Comments

I knew there was a smarter way than using tiling! With your code no-loops takes 50 ms vs. 25 ms one-loop. Still slower, unfortunately, but much closer now.
I've a feeling it's slower because it has to allocate a large array of shape (128, 256, 512) before summing it.
0

A solution using Numba

  • Parallelized (on very small samples eg. (24x24) the parallelized version will be slower due to the overhead of creating threads)
  • The inner loop is SIMD-vectorized

Exmaple

import numpy as np
import numba as nb

#Debug output for SIMD-vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
########################################

#Your solution
#You can also use Numba on this, but apart from parallization
#it is often better to write out the inner loop
def f_cdist(X1, X2):
    res = np.zeros(shape=(X1.shape[0], X2.shape[0]), dtype=np.float64)

    for ix1 in range(X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            res[ix1, ix2] = np.abs(X1[ix1, :] - X2[ix2, :]).sum()

    return res

@nb.njit(fastmath=True,parallel=True)
def f_cdist_nb(X1, X2):
    #Some safety, becuase there is no bounds-checking
    assert X1.shape[1]==X2.shape[1]
    res = np.empty(shape=(X1.shape[0], X2.shape[0]), dtype=X1.dtype)

    for ix1 in nb.prange(X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            #Writing out the inner loop often leads to better performance
            sum=0.
            for i in range(X1.shape[1]):
                sum+=np.abs(X1[ix1, i] - X2[ix2, i])
            res[ix1, ix2] = sum

    return res

Perfomance

from scipy import spatial
#4096x1024    
X1=np.random.rand(4096,1024)
X2=np.random.rand(4096,1024)

res1=f_cdist_nb(X1,X2)
res2=f_cdist(X1,X2)
res3=spatial.distance.cdist(X1, X2, 'cityblock')

#Check the results
np.allclose(res1,res2)
True
np.allclose(res1,res3)
True

%timeit res1=f_cdist_nb(X1,X2)
1.38 s ± 64.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=f_cdist(X1,X2)
1min 25s ± 483 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
17.6 s ± 18.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#1024x1024
X1=np.random.rand(1024,1024)
X2=np.random.rand(1024,1024)

%timeit res1=f_cdist_nb(X1,X2)
63.5 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
1.09 s ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#512x512
X1=np.random.rand(512,512)
X2=np.random.rand(512,512)

%timeit res1=f_cdist_nb(X1,X2)
4.91 ms ± 280 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
130 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Edit: A hand optimized Numba version

#Unroll and Jam loops
@nb.njit(fastmath=True,parallel=True)
def f_cdist_nb_3(X1, X2):
    assert X1.shape[1]==X2.shape[1]
    res = np.empty(shape=(X1.shape[0], X2.shape[0]), dtype=X1.dtype)

    for ix1 in nb.prange(X1.shape[0]//4):
        for ix2 in range(X2.shape[0]//4):
            sum_1,sum_2,sum_3,sum_4,sum_5,sum_6   =0.,0.,0.,0.,0.,0.
            sum_7,sum_8,sum_9,sum_10,sum_11,sum_12=0.,0.,0.,0.,0.,0.
            sum_13,sum_14,sum_15,sum_16=0.,0.,0.,0.

            for i in range(X1.shape[1]):
                sum_1+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+0, i])
                sum_2+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+1, i])
                sum_3+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+2, i])
                sum_4+=np.abs(X1[ix1*4+0, i] - X2[ix2*4+3, i])
                sum_5+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+0, i])
                sum_6+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+1, i])
                sum_7+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+2, i])
                sum_8+=np.abs(X1[ix1*4+1, i] - X2[ix2*4+3, i])
                sum_9+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+0, i])
                sum_10+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+1, i])
                sum_11+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+2, i])
                sum_12+=np.abs(X1[ix1*4+2, i] - X2[ix2*4+3, i])
                sum_13+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+0, i])
                sum_14+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+1, i])
                sum_15+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+2, i])
                sum_16+=np.abs(X1[ix1*4+3, i] - X2[ix2*4+3, i])

            res[ix1*4+0, ix2*4+0] = sum_1
            res[ix1*4+0, ix2*4+1] = sum_2
            res[ix1*4+0, ix2*4+2] = sum_3
            res[ix1*4+0, ix2*4+3] = sum_4
            res[ix1*4+1, ix2*4+0] = sum_5
            res[ix1*4+1, ix2*4+1] = sum_6
            res[ix1*4+1, ix2*4+2] = sum_7
            res[ix1*4+1, ix2*4+3] = sum_8
            res[ix1*4+2, ix2*4+0] = sum_9
            res[ix1*4+2, ix2*4+1] = sum_10
            res[ix1*4+2, ix2*4+2] = sum_11
            res[ix1*4+2, ix2*4+3] = sum_12
            res[ix1*4+3, ix2*4+0] = sum_13
            res[ix1*4+3, ix2*4+1] = sum_14
            res[ix1*4+3, ix2*4+2] = sum_15
            res[ix1*4+3, ix2*4+3] = sum_16

    #Rest of the loop
    for ix1 in range(X1.shape[0]//4*4,X1.shape[0]):
        for ix2 in range(X2.shape[0]):
            sum_1=0.
            for i in range(X1.shape[1]):
                sum_1+=np.abs(X1[ix1, i] - X2[ix2, i])
            res[ix1, ix2] = sum_1

    for ix1 in range(X1.shape[0]):
        for ix2 in range(X2.shape[0]//4*4,X2.shape[0]):
            sum_1=0.
            for i in range(X1.shape[1]):
                sum_1+=np.abs(X1[ix1, i] - X2[ix2, i])
            res[ix1, ix2] = sum_1
    return res

Timings

#4096x1024    
X1=np.random.rand(4096,1024)
X2=np.random.rand(4096,1024)

res1=f_cdist_nb(X1,X2)
res2=f_cdist_nb_3(X1,X2)
res3=spatial.distance.cdist(X1, X2, 'cityblock')

#Check the results
print(np.allclose(res1,res2))
print(np.allclose(res1,res3))

%timeit res1=f_cdist_nb(X1,X2)
1.6 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=f_cdist_nb_3(X1,X2)
497 ms ± 50.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res3=spatial.distance.cdist(X1, X2, 'cityblock')
17.7 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.