Vectorizing euclidean distance computation - NumPy

Question

my question regards the vectorization of my code. I have one array that holds 3D-coordinates and one array that holds the information of edges that connect the coordinates:

In [8]:coords
Out[8]: 
array([[ 11.22727013,  24.72620964,   2.02986932],
       [ 11.23895836,  24.67577744,   2.04130101],
       [ 11.23624039,  24.63677788,   2.04096866],
       [ 11.22516632,  24.5986824 ,   2.04045677],
       [ 11.21166992,  24.56095695,   2.03898215],
       [ 11.20334721,  24.5227356 ,   2.03556442],
       [ 11.2064085 ,  24.48479462,   2.03098583],
       [ 11.22059727,  24.44837189,   2.02649784],
       [ 11.24213409,  24.41513252,   2.01979685]])

In [13]:edges
Out[13]: 
array([[0, 1],
       [1, 2],
       [2, 3],
       [3, 4],
       [4, 5],
       [5, 6],
       [6, 7],
       [7, 8],], dtype=int32)

Now, I would like to calculate the sum of the euclidian distance between the coordinates in the edges array. E.g. Distance from coords[0] to coords[1] + distance from coords[1] to coords[2] .....

I have the following code, which does the job:

def networkLength(coords, edges):

   from scipy.spatial import distance 
   distancesNetwork = np.array([])    

   for i in range(edges.shape[0]):
        distancesNetwork = np.append(distancesNetwork, distance.euclidean(coords[edges[i, 0]], coords[edges[i, 1]]))

   return sum(distancesNetwork)

I was wondering whether it is possible to vectorize the code, rather than doing a loop. What is the pythonian way to do it? Thanks a lot!!

1. It's considered a very bad practice to put import statements anywhere other than the beginning of any module. Importing from within a function is especially strange. 2. Appending numpy arrays in a loop is highly inefficient, because it leads to many reallocation, making any algorithm effectively O(n^2) in the best case. — Eli Korvigo
– Eli Korvigo, Commented Dec 18, 2016 at 19:06
@EliKorvigo 1. I beg to differ: importing within a function is useful for breaking circular dependencies, or pulling in large packages that are only needed for optional functionality, so that someone who never uses those functions can use the module without needing the large package. It's unusual, sure, but not very bad practice. Just know why you're doing it. I do agree that it's probably unnecessary here. — David Z
– David Z, Commented Dec 18, 2016 at 19:26
@DavidZ in my practice, when you want to avoid importing something heavy, you can simply put the function into a separate module in your package. This is a common way to fix the circular import problem, too. — Eli Korvigo
– Eli Korvigo, Commented Dec 18, 2016 at 19:49

Divakar · Accepted Answer · 2016-12-18 19:46:02Z

2

Approach #1

We could slice out the first and second columns altogether for indexing into coords instead of iterating for each element along them and perform the euclidean distance computations that involves element-wise squaring and summing along each row and then getting the element-wise square-root. Finally, we need to sum all those values for one scalar as shown in the original code.

Thus, one vectorized implementation would be -

np.sqrt(((coords[edges[:, 0]] - coords[edges[:, 1]])**2).sum(1)).sum()

There's a built-in in NumPy to do those distance computing operations as np.linalg.norm. In terms of performance, I would think it would be comparable to what we have just listed earlier. For the sake of completeness, the implementation would be -

np.linalg.norm(coords[edges[:, 0]] - coords[edges[:, 1]],axis=1).sum()

Approach #2

Tweaking the earlier approach, we could use np.einsum that in one step would perform both squaring and summing along each row and as such would be a bit more efficient.

The implementation would look something like this -

s = coords[edges[:, 0]] - coords[edges[:, 1]]
out = np.sqrt(np.einsum('ij,ij->i',s,s)).sum()

Runtime test

Function definitions -

def networkLength(coords, edges): # Original code from question
   distancesNetwork = np.array([])    
   for i in range(edges.shape[0]):
        distancesNetwork = np.append(distancesNetwork, \
        distance.euclidean(coords[edges[i, 0]], coords[edges[i, 1]]))
   return sum(distancesNetwork)

def vectorized_app1(coords, edges):
    return np.sqrt(((coords[edges[:, 0]] - coords[edges[:, 1]])**2).sum(1)).sum()

def vectorized_app2(coords, edges):
    s = coords[edges[:, 0]] - coords[edges[:, 1]]
    return np.sqrt(np.einsum('ij,ij->i',s,s)).sum()

Verification and Timings -

In [114]: # Setup bigger inputs
     ...: coords = np.random.rand(100,3)
     ...: edges = np.random.randint(0,100,(10000,2))

# Verify results across all approaches
In [115]: networkLength(coords, edges)
Out[115]: 6607.8829431403547

In [116]: vectorized_app1(coords, edges)
Out[116]: 6607.8829431403337

In [117]: vectorized_app2(coords, edges)
Out[117]: 6607.8829431403337

In [118]: %timeit networkLength(coords, edges)
     ...: %timeit vectorized_app1(coords, edges)
     ...: %timeit vectorized_app2(coords, edges)
     ...: 
1 loops, best of 3: 519 ms per loop
1000 loops, best of 3: 822 µs per loop
1000 loops, best of 3: 668 µs per loop

edited Dec 18, 2016 at 19:46

answered Dec 18, 2016 at 18:53

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ozzycalifornia Over a year ago

Wow, thank you so much for taking the time. That helps a lot for my understanding!! It will likely take some time for me to understand what is happening in the np.einsum approach :)

Ozzycalifornia Over a year ago

fyi - Just compared the 'np.linalg.norm' and the 'np.sqrt(((coords[edges[:, 0]] - coords[edges[:, 1]])**2).sum(1)).sum()' approach. The latter is slightly faster.

Collectives™ on Stack Overflow

Vectorizing euclidean distance computation - NumPy

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related