2

my question regards the vectorization of my code. I have one array that holds 3D-coordinates and one array that holds the information of edges that connect the coordinates:

In [8]:coords
Out[8]: 
array([[ 11.22727013,  24.72620964,   2.02986932],
       [ 11.23895836,  24.67577744,   2.04130101],
       [ 11.23624039,  24.63677788,   2.04096866],
       [ 11.22516632,  24.5986824 ,   2.04045677],
       [ 11.21166992,  24.56095695,   2.03898215],
       [ 11.20334721,  24.5227356 ,   2.03556442],
       [ 11.2064085 ,  24.48479462,   2.03098583],
       [ 11.22059727,  24.44837189,   2.02649784],
       [ 11.24213409,  24.41513252,   2.01979685]])

In [13]:edges
Out[13]: 
array([[0, 1],
       [1, 2],
       [2, 3],
       [3, 4],
       [4, 5],
       [5, 6],
       [6, 7],
       [7, 8],], dtype=int32)

Now, I would like to calculate the sum of the euclidian distance between the coordinates in the edges array. E.g. Distance from coords[0] to coords[1] + distance from coords[1] to coords[2] .....

I have the following code, which does the job:

def networkLength(coords, edges):

   from scipy.spatial import distance 
   distancesNetwork = np.array([])    

   for i in range(edges.shape[0]):
        distancesNetwork = np.append(distancesNetwork, distance.euclidean(coords[edges[i, 0]], coords[edges[i, 1]]))

   return sum(distancesNetwork)

I was wondering whether it is possible to vectorize the code, rather than doing a loop. What is the pythonian way to do it? Thanks a lot!!

4
  • 1. It's considered a very bad practice to put import statements anywhere other than the beginning of any module. Importing from within a function is especially strange. 2. Appending numpy arrays in a loop is highly inefficient, because it leads to many reallocation, making any algorithm effectively O(n^2) in the best case. Commented Dec 18, 2016 at 19:06
  • @EliKorvigo 1. I beg to differ: importing within a function is useful for breaking circular dependencies, or pulling in large packages that are only needed for optional functionality, so that someone who never uses those functions can use the module without needing the large package. It's unusual, sure, but not very bad practice. Just know why you're doing it. I do agree that it's probably unnecessary here. Commented Dec 18, 2016 at 19:26
  • @DavidZ in my practice, when you want to avoid importing something heavy, you can simply put the function into a separate module in your package. This is a common way to fix the circular import problem, too. Commented Dec 18, 2016 at 19:49
  • Thanks guys! The feedback helps a lot! Commented Dec 18, 2016 at 20:31

1 Answer 1

2

Approach #1

We could slice out the first and second columns altogether for indexing into coords instead of iterating for each element along them and perform the euclidean distance computations that involves element-wise squaring and summing along each row and then getting the element-wise square-root. Finally, we need to sum all those values for one scalar as shown in the original code.

Thus, one vectorized implementation would be -

np.sqrt(((coords[edges[:, 0]] - coords[edges[:, 1]])**2).sum(1)).sum()

There's a built-in in NumPy to do those distance computing operations as np.linalg.norm. In terms of performance, I would think it would be comparable to what we have just listed earlier. For the sake of completeness, the implementation would be -

np.linalg.norm(coords[edges[:, 0]] - coords[edges[:, 1]],axis=1).sum()

Approach #2

Tweaking the earlier approach, we could use np.einsum that in one step would perform both squaring and summing along each row and as such would be a bit more efficient.

The implementation would look something like this -

s = coords[edges[:, 0]] - coords[edges[:, 1]]
out = np.sqrt(np.einsum('ij,ij->i',s,s)).sum()

Runtime test

Function definitions -

def networkLength(coords, edges): # Original code from question
   distancesNetwork = np.array([])    
   for i in range(edges.shape[0]):
        distancesNetwork = np.append(distancesNetwork, \
        distance.euclidean(coords[edges[i, 0]], coords[edges[i, 1]]))
   return sum(distancesNetwork)

def vectorized_app1(coords, edges):
    return np.sqrt(((coords[edges[:, 0]] - coords[edges[:, 1]])**2).sum(1)).sum()

def vectorized_app2(coords, edges):
    s = coords[edges[:, 0]] - coords[edges[:, 1]]
    return np.sqrt(np.einsum('ij,ij->i',s,s)).sum()

Verification and Timings -

In [114]: # Setup bigger inputs
     ...: coords = np.random.rand(100,3)
     ...: edges = np.random.randint(0,100,(10000,2))

# Verify results across all approaches
In [115]: networkLength(coords, edges)
Out[115]: 6607.8829431403547

In [116]: vectorized_app1(coords, edges)
Out[116]: 6607.8829431403337

In [117]: vectorized_app2(coords, edges)
Out[117]: 6607.8829431403337

In [118]: %timeit networkLength(coords, edges)
     ...: %timeit vectorized_app1(coords, edges)
     ...: %timeit vectorized_app2(coords, edges)
     ...: 
1 loops, best of 3: 519 ms per loop
1000 loops, best of 3: 822 µs per loop
1000 loops, best of 3: 668 µs per loop
Sign up to request clarification or add additional context in comments.

2 Comments

Wow, thank you so much for taking the time. That helps a lot for my understanding!! It will likely take some time for me to understand what is happening in the np.einsum approach :)
fyi - Just compared the 'np.linalg.norm' and the 'np.sqrt(((coords[edges[:, 0]] - coords[edges[:, 1]])**2).sum(1)).sum()' approach. The latter is slightly faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.