2

I have two 3000x3 vectors and I'd like to compute 1-to-1 Euclidean distance between them. For example, vec1 is

1 1 1  
2 2 2    
3 3 3  
4 4 4  
...

The vec2 is

2 2 2
3 3 3  
4 4 4  
5 5 5  
...

I'd like to get the results as

1.73205081  
1.73205081
1.73205081
1.73205081
...

I triedscipy.spatial.distance.cdist(vec1,vec2), and it returns a 3000x3000 matrix whereas I only need the main diagonal. I also tried np.sqrt(np.sum((vec1-vec2)**2 for vec1,vec2 in zip(vec1,vec2))) and it didn't work for my purpose. Is there any way to compute the distances please? I'd appreciate any comments.

2
  • Are you storing your vectors in a list? Commented Aug 21, 2015 at 17:41
  • Yes, in 2 different files. The following posts answered my question. Thanks anyway. Commented Aug 21, 2015 at 18:41

2 Answers 2

3

cdist gives you back a 3000 x 3000 array because it computes the distance between every pair of row vectors in your two input arrays.

To compute only the distances between corresponding row indices, you could use np.linalg.norm:

a = np.repeat((np.arange(3000) + 1)[:, None], 3, 1)
b = a + 1

dist = np.linalg.norm(a - b, axis=1)

Or using standard vectorized array operations:

dist = np.sqrt(((a - b) ** 2).sum(1))
Sign up to request clarification or add additional context in comments.

Comments

0

Here's another way that works. It still utilizes the np.linalg.norm function but it processes the data, if that is something you needed.

import numpy as np
vec1='''1 1 1
    2 2 2
    3 3 3
    4 4 4'''
vec2='''2 2 2
    3 3 3
    4 4 4
    5 5 5'''

process_vec1 = np.array([])
process_vec2 = np.array([])

for line in vec1:
    process_vec1 = np.append( process_vec1, map(float,line.split()) )
for line in vec2:
    process_vec2 = np.append( process_vec2, map(float,line.split()) )

process_vec1 = process_vec1.reshape( (len(process_vec1)/3, 3) )
process_vec2 = process_vec2.reshape( (len(process_vec2)/3, 3) )

dist = np.linalg.norm( process_vec1 - process_vec2 , axis = 1 )

print dist

[1.7320508075688772 1.7320508075688772 1.7320508075688772 1.7320508075688772]

1 Comment

In general it's going to be a lot faster to use vectorization to process multiple rows (e.g. np.linalg.norm(process_vec1 - process_vec2, axis=1)) rather than using map, which implicitly iterates over the rows in Python rather than C.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.