Memory Efficient L2 norm using Python broadcasting

Question

I am trying to implement a way to cluster points in a test dataset based on their similarity to a sample dataset, using Euclidean distance. The test dataset has 500 points, each point is a N dimensional vector (N=1024). The training dataset has around 10000 points and each point is also a 1024- dim vector. The goal is to find the L2-distance between each test point and all the sample points to find the closest sample (without using any python distance functions). Since the test array and training array have different sizes, I tried using broadcasting:

    import numpy as np
    dist = np.sqrt(np.sum( (test[:,np.newaxis] - train)**2, axis=2))

where test is an array of shape (500,1024) and train is an array of shape (10000,1024). I am getting a MemoryError. However, the same code works for smaller arrays. For example:

     test= np.array([[1,2],[3,4]])
     train=np.array([[1,0],[0,1],[1,1]])

Is there a more memory efficient way to do the above computation without loops? Based on the posts online, we can implement L2- norm using matrix multiplication sqrt(X * X-2*X * Y+Y * Y). So I tried the following:

    x2 = np.dot(test, test.T)
    y2 = np.dot(train,train.T)
    xy = 2* np.dot(test,train.T)

    dist = np.sqrt(x2 - xy + y2)

Since the matrices have different shapes, when I tried to broadcast, there is a dimension mismatch and I am not sure what is the right way to broadcast (dont have much experience with Python broadcasting). I would like to know what is the right way to implement the L2 distance computation as a matrix multiplication in Python, where the matrices have different shapes. The resultant distance matrix should have dist[i,j] = Euclidean distance between test point i and sample point j.

thanks

So, you are looking for a total of 5E6 distances for vectors of length 1024? Your final shape would be (500, 10000) or (10000, 500)? — wwii
– wwii, Commented Sep 30, 2015 at 3:22
It would be (500, 10000). The test points are rows, sample points are columns of the distance matrix. — user1462351
– user1462351, Commented Sep 30, 2015 at 5:56

spiky · Accepted Answer · 2016-01-19 15:20:29Z

17

Here is broadcasting with shapes of the intermediates made explicit:

m = x.shape[0] # x has shape (m, d)
n = y.shape[0] # y has shape (n, d)
x2 = np.sum(x**2, axis=1).reshape((m, 1))
y2 = np.sum(y**2, axis=1).reshape((1, n))
xy = x.dot(y.T) # shape is (m, n)
dists = np.sqrt(x2 + y2 - 2*xy) # shape is (m, n)

The documentation on broadcasting has some pretty good examples.

edited Jan 19, 2016 at 15:20

answered Jan 19, 2016 at 14:33

spiky

4234 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Akash Over a year ago

just a little correction in the last line dists = np.sqrt(x2 + y2 - 2*x(y.T))

John Greenall · Accepted Answer · 2016-01-19 14:51:30Z

3

I think what you are asking for already exists in scipy in the form of the cdist function.

from scipy.spatial.distance import cdist
res = cdist(test, train, metric='euclidean')

answered Jan 19, 2016 at 14:51

John Greenall

1,69013 silver badges17 bronze badges

Comments

The_Coder · Accepted Answer · 2018-06-25 09:11:56Z

3

Simplified and working version from this answer:

x, y = test, train

x2 = np.sum(x**2, axis=1, keepdims=True)
y2 = np.sum(y**2, axis=1)
xy = np.dot(x, y.T)
dist = np.sqrt(x2 - 2*xy + y2)

So the approach you have in mind is correct, but you need to be careful how you apply it.

To make your life easier, consider using the tested and proven functions from scipy or scikit-learn.

edited Jun 25, 2018 at 9:11

The_Coder

3215 silver badges18 bronze badges

answered Nov 19, 2015 at 14:27

user2379410

Collectives™ on Stack Overflow

Memory Efficient L2 norm using Python broadcasting

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related