Speed up nested looping over pandas DataFrame

Question

I have a pandas.DataFrame containing many coordinates of objects arranged into identity, x, and y.

I'm trying to find the closest objects across two identities. To clear up what I mean take this code:

import numpy as np
import pandas as pd

# Generate random data
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(10000), 'y':np.random.randn(10000)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(10000), 'y':np.random.randn(10000)})
df = pd.concat([df_identity_1, df_identity_2])

>>> df
      identity         x         y
0            1 -1.784748  2.085517
1            1  0.324645 -1.584790
2            1 -0.044623 -0.348576
3            1  0.802035  1.362336
4            1 -0.091508 -0.655114
...        ...       ...       ...
9995         2  0.939491  0.304964
9996         2 -0.233707 -0.135265
9997         2  0.792494  1.157236
9998         2 -0.385080 -0.021226
9999         2  0.105970 -0.042135

Currently, I have to go through each row and iterate through the entire DataFrame again to find the closest coordinate.

# Function to find the absolute / Euclidean distance between two coordinates
def euclidean(x1, y1, x2, y2):
    a = np.array((int(x1), int(y1)))
    b = np.array((int(x2), int(y2)))
    return np.linalg.norm(a-b)

# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
    d = df[(df['identity'] != int(row['identity']))]
    if d.empty:
        return None
    return min(euclidean(row.x, row.y, r.x, r.y) for r in df.itertuples(index=False))

df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)

This code is fully functional – but when I have a large dataset (+100k coordinates) this "nested" for-loop is extremely time consuming.

Is there some functionality that could speed up this concept or a faster approach altogether?

For efficient math operation between arrays, you probably want to use numpy instead of making pure python code. — glemaitre
– glemaitre, Commented Dec 22, 2019 at 19:58

Nick Becker · Accepted Answer · 2019-12-22 21:29:19Z

The best way to solve this problem is to use a spatial data structure. These data structures allow you to dramatically reduce the size of the search space when you need to do these kinds of queries. SciPy provides a KD-tree for nearest neighbor queries, but it would be a bit of a hassle to scale this to multiple machines (if the size of your data requires that).

If you need to scale beyond that, you'd probably want to use dedicated geospatial analytics tools.

In general, if you want to speed up something like this, you need to make tradeoffs between iterative approaches and memory-intensity.

However, in this case, your core bottlenecks are:

Iterating row by row
Calling euclidean once per every pair of rows, rather than once per dataset.

NumPy functions like norm are columnar in nature and you should take advantage of that by calling it on the entire array of data. If each of your dataframes are 10,000 rows, you're calling norm 100 million times. Just tweaking this a bit to make that change should help you a lot.

If you want to do this in Python at scale and couldn't use a spatial data structure effectively (and don't want to use heuristics to reduce the search space), something like the following would probably work: cross-product join the two tables, calculate the euclidean distance once with a single columnar operation, and use a groupby-aggregation (min) to get the closest points.

This would be much faster and much more memory intensive than iterating row by row like you are doing, but could easily be scaled with something like Dask (or Spark).

I'm going to use only a few rows to illustrate the logic.

import numpy as np
import pandas as pd

# Generate random data
nrows = 3
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_1.reset_index(drop=False, inplace=True)
df_identity_2.reset_index(drop=False, inplace=True)

Notice how I'm creating a unique index in addition to the identity flag for each dataframe. This will come in handy later for the groupby. Next, I can do the cross-product join. This would be cleaner if we used different column names, but I'll keep it consistent with your example. This join will quickly go out of memory in pure Pandas as the dataset grows, but Dask (https://dask.org/) would be able to handle it quite well.

def cross_product(left, right):
    return left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1)

crossprod = cross_product(df_identity_1, df_identity_2)
crossprod
index_x identity_x  x_x y_x index_y identity_y  x_y y_y
0   0   1   1.660468    -1.954339   0   2   -0.431543   0.500864
1   0   1   1.660468    -1.954339   1   2   -0.607647   -0.436480
2   0   1   1.660468    -1.954339   2   2   1.613126    -0.696860
3   1   1   0.153419    0.619493    0   2   -0.431543   0.500864
4   1   1   0.153419    0.619493    1   2   -0.607647   -0.436480
5   1   1   0.153419    0.619493    2   2   1.613126    -0.696860
6   2   1   -0.592440   -0.299046   0   2   -0.431543   0.500864
7   2   1   -0.592440   -0.299046   1   2   -0.607647   -0.436480
8   2   1   -0.592440   -0.299046   2   2   1.613126    -0.696860

Next, we just need to calculate the minimum distance for each row and then group by each index_x and index_y (respectively) and get the minimum distance value. Notice how we can do this with a single call to norm, rather than one call per row.

crossprod['dist'] = np.linalg.norm(crossprod[['x_x', 'y_x']].values - crossprod[['x_y', 'y_y']].values, axis=1)
closest_per_identity1 = crossprod.groupby(['index_x']).agg({'dist':'min'})
closest_per_identity2 = crossprod.groupby(['index_y']).agg({'dist':'min'})

closest_per_identity1
dist
index_x 
0   1.258370
1   0.596869
2   0.138273

closest_per_identity2
dist
index_y 
0   0.596869
1   0.138273
2   1.258370

Comparing to your original example on the same data. Note that I changed your int calls to floats and your itertuples to iterate through d, rather than df (as otherwise you're comparing each point to itself).

df = pd.concat([df_identity_1, df_identity_2])

def euclidean(x1, y1, x2, y2):
    a = np.array((float(x1), float(y1)))
    b = np.array((float(x2), float(y2)))
    return np.linalg.norm(a-b)

# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
    d = df[(df['identity'] != int(row['identity']))]
    if d.empty:
        return None
    r = min(euclidean(row.x, row.y, r.x, r.y) for r in d.itertuples(index=False))
    return r

df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)
df
index   identity    x   y   closest_coord
0   0   1   1.660468    -1.954339   1.258370
1   1   1   0.153419    0.619493    0.596869
2   2   1   -0.592440   -0.299046   0.138273
0   0   2   -0.431543   0.500864    0.596869
1   1   2   -0.607647   -0.436480   0.138273
2   2   2   1.613126    -0.696860   1.258370

Wow! Thank you so much for the detailed analysis and response.
You're welcome! As a note, I don't recommend trying the naive cross-join if you some heuristics you could use to reduce the search space first. If you can reduce the search space (i.e., by creating "bins" using something like cut, rounding, etc.) you will be much better off doing a join on that kind of key rather than the pure key=1

Collectives™ on Stack Overflow

Speed up nested looping over pandas DataFrame

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related