16

The normal way to map a function in a numpy.narray like np.array[map(some_func,x)] or vectorize(f)(x) can't provide an index. The following code is just a simple example that is commonly seen in many applications.

dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])

for i in range(feature_mat.shape[0]):
    for j in range(i, feature_mat.shape[0]):
        dis_mat[i, j] = np.linalg.norm(
            feature_mat[i, :] - feature_mat[j, :]
        )
        dis_mat[j, i] = dis_mat[i, j]

Is there a way to speed it up?


Thank you for your help! The quickest way to speed up this code is this, using the function that @user2357112 commented about:

    from scipy.spatial.distance import pdist,squareform
    dis_mat = squareform(pdist(feature_mat))

@Julien's method is also good if feature_mat is small, but when the feature_mat is 1000 by 2000, then it needs nearly 40 GB of memory.

10
  • 2
    Are you looking for this? Commented Nov 30, 2017 at 4:52
  • 1
    @Marco. numba sounds like overkill. This looks like it can be vectorized directly, or at least that's my hunch. My gut says to look at ufunc.outer. Commented Nov 30, 2017 at 4:53
  • 2
    As an aside, the "normal way" to apply a function over an array should not be map or vectorize; those should be reserved for cases where it's just not possible to apply more efficient methods, or where you have to get something written fast and the performance hit won't be a problem. Commented Nov 30, 2017 at 5:09
  • 3
    user167122, the etiquette here is that you don't put answers into your question. Would you consider reverting your last edit to the question (revision 5)? Commented Nov 30, 2017 at 9:59
  • 1
    If you consider user2357112's solution best, why did you accept Julien's answer? Commented Nov 30, 2017 at 20:31

6 Answers 6

19

SciPy comes with a function specifically to compute the kind of pairwise distances you're computing. It's scipy.spatial.distance.pdist, and it produces the distances in a condensed format that basically only stores the upper triangle of the distance matrix, but you can convert the result to square form with scipy.spatial.distance.squareform:

from scipy.spatial.distance import pdist, squareform

distance_matrix = squareform(pdist(feature_mat))

This has the benefit of avoiding the giant intermediate arrays required with a direct vectorized solution, so it's faster and works on larger inputs. It loses the timing to an approach that uses algebraic manipulations to have dot handle the heavy lifting, though.

pdist also supports a wide variety of alternate distance metrics, if you decide you want something other than Euclidean distance.

# Manhattan distance!
distance_matrix = squareform(pdist(feature_mat, 'cityblock'))

# Cosine distance!
distance_matrix = squareform(pdist(feature_mat, 'cosine'))

# Correlation distance!
distance_matrix = squareform(pdist(feature_mat, 'correlation'))

# And more! Check out the docs.
Sign up to request clarification or add additional context in comments.

Comments

14

You can create a new axis and broadcast:

dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)

Timing:

feature_mat = np.random.rand(100,200)

def a():
    dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
    for i in range(feature_mat.shape[0]):
        for j in range(i, feature_mat.shape[0]):
            dis_mat[i, j] = np.linalg.norm(
                feature_mat[i, :] - feature_mat[j, :]
            )
            dis_mat[j, i] = dis_mat[i, j]

def b():
    dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)



%timeit a()
100 loops, best of 3: 20.5 ms per loop

%timeit b()
100 loops, best of 3: 11.8 ms per loop

3 Comments

feature_mat[:,None] - feature_mat can be huge, though. You're likely to blow your RAM on reasonably large inputs.
The usual dilemma between speed and memory... :)
Thanks for your advice. The totally time reduce from 6363 ms to 967 ms. And it's really easy to understand and readable.
8

Factor what can be done, and use np.dot optimizations on k x k matrix, in little memory place (kxk):

def c(m): 
    xy=np.dot(m,m.T) # O(k^3)
    x2=y2=(m*m).sum(1) #O(k^2)
    d2=np.add.outer(x2,y2)-2*xy  #O(k^2)
    d2.flat[::len(m)+1]=0 # Rounding issues
    return np.sqrt(d2)  # O (k^2)

And for comparison:

def d(m):
   return  squareform(pdist(m))

Here are the 'time(it)' for a k*k initial matrices:

Enter image description here

The two algorithms are O(k^3), but c(m) makes the O(k^3) part of the job through np.dot, the critical node of linear algebra which benefits of all optimizations (multicore and so on). pdist is just loops as seen in the source.

This explains the 15x factor for big arrays, even if pdist exploits the symmetry of the matrix by calculating only half of the terms.

3 Comments

Oh yeah, this solution. I'm pretty sure I've seen it in another answer before, but I didn't remember it. No giant intermediates, and you get a BLAS matrix multiply for a good chunk of the work (most of it?). Performance should be pretty good. I don't remember whether it beat pdist, but I wouldn't be surprised either way; last time I checked, pdist wasn't anywhere near as optimized as a BLAS matrix multiply.
I might be remembering it from this answer, but I feel like I saw a NumPy version. Anyway, my timings say it's competitive with pdist, winning on larger arrays but losing on smaller arrays. With a bit more optimization, it could probably beat pdist on smaller inputs.
(squareforming the pdist output takes long enough to tie with this answer on the small inputs, though.)
2

One way I thought of to avoid mixing NumPy and for loops would be to create an index array using a version of this index creator that allows for replacement:

import numpy as np
from itertools import product, chain
from scipy.special import comb

def comb_index(n, k):
    count = comb(n, k, exact=True, repetition=True)
    index = np.fromiter(chain.from_iterable(product(range(n), repeat=k)),
                        int, count=count*k)
    return index.reshape(-1, k)

Then, we simply take each of those array couples, compute the difference between them, reshape the resulting array, and take the norm of each of the rows of the array:

reshape_mat = np.diff(feature_mat[comb_index(feature_mat.shape[0], 2), :], axis=1).reshape(-1, feature_mat.shape[1])
dis_list = np.linalg.norm(reshape_mat, axis=-1)

Note that dis_list is simply a list of all of the n*(n+1)/2 possible norms. This runs at close to the same speed as the other answer for the feature_mat he provided, and when comparing the byte sizes of our largest sections,

(feature_mat[:,None] - feature_mat).nbytes == 16000000

while

np.diff(feature_mat[comb_index(feature_mat.shape[0], 2), :], axis=1).reshape(-1, feature_mat.shape[1]).nbytes == 8080000

For most inputs, mine uses only half the storage: still unoptimal, but a marginal improvement.

Comments

2

Based on np.triu_indices, in case you really want to do this with pure NumPy:

s = feature_mat.shape[0]
i, j = np.triu_indices(s, 1)         # All possible combinations of indices
dist_mat = np.empty((s, s))          # Don't waste time filling with zeros
np.einsum('ii->i', dist_mat)[:] = 0  # When you can just fill the diagonal
dist_mat[i, j] = dist_mat[j, i] = np.linalg.norm(feature_mat[i] - feature_mat[j], axis=-1)
                                     # Vectorized version of your original process

The benefit of this method over broadcasting is that you can do it in chunks:

n = 10000000   # Based on your RAM available
for k in range (0, i.size, n):
    i_ = i[k: k + n]
    j_ = j[k: k + n]
    dist_mat[i_, j_] = dist_mat[j_, i_] = \
                     np.linalg.norm(feature_mat[i_] - feature_mat[j_], axis = -1)

Comments

-1

Let's begin by rewriting this in terms of a function:

dist(mat, i, j):
    return np.linalg.norm(mat[i, :] - mat[j, :])

size = feature_mat.shape[0]

for i in range(size):
    for j in range(size):
        dis_mat[i, j] = dist(feature_mat, i, j)

This can be rewritten in (a slightly more) vectorized form as:

v = [dist(feature_map, i, j) for i in range(size) for j in range(size)]
dist_mat = np.array(v).reshape(size, size)

Notice that we're still relying on Python rather than NumPy for some of the computation, but it's a step towards vectorization. Also notice that dist(i, j) is symmetric, so we could further reduce computations by approximately half. Perhaps considering:

v = [dist(feature_map, i, j) for i in range(size) for j in range(i + 1)]

Now the tricky bit is assigning these computed values to the correct elements in a dist_mat.

How fast this performs depends on the cost of computing dist(i, j). For small feature_mats, the cost of recomputing is not high enough to worry about this. But for large matrices, you definitely do not want to recompute.

1 Comment

Consider the result of your code, the v maybe a list?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.