The best way to solve this problem is to use a spatial data structure. These data structures allow you to dramatically reduce the size of the search space when you need to do these kinds of queries. SciPy provides a KD-tree for nearest neighbor queries, but it would be a bit of a hassle to scale this to multiple machines (if the size of your data requires that).
If you need to scale beyond that, you'd probably want to use dedicated geospatial analytics tools.
In general, if you want to speed up something like this, you need to make tradeoffs between iterative approaches and memory-intensity.
However, in this case, your core bottlenecks are:
- Iterating row by row
- Calling
euclidean once per every pair of rows, rather than once per dataset.
NumPy functions like norm are columnar in nature and you should take advantage of that by calling it on the entire array of data. If each of your dataframes are 10,000 rows, you're calling norm 100 million times. Just tweaking this a bit to make that change should help you a lot.
If you want to do this in Python at scale and couldn't use a spatial data structure effectively (and don't want to use heuristics to reduce the search space), something like the following would probably work: cross-product join the two tables, calculate the euclidean distance once with a single columnar operation, and use a groupby-aggregation (min) to get the closest points.
This would be much faster and much more memory intensive than iterating row by row like you are doing, but could easily be scaled with something like Dask (or Spark).
I'm going to use only a few rows to illustrate the logic.
import numpy as np
import pandas as pd
# Generate random data
nrows = 3
df_identity_1 = pd.DataFrame({'identity':1, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_2 = pd.DataFrame({'identity':2, 'x':np.random.randn(nrows), 'y':np.random.randn(nrows)})
df_identity_1.reset_index(drop=False, inplace=True)
df_identity_2.reset_index(drop=False, inplace=True)
Notice how I'm creating a unique index in addition to the identity flag for each dataframe. This will come in handy later for the groupby. Next, I can do the cross-product join. This would be cleaner if we used different column names, but I'll keep it consistent with your example. This join will quickly go out of memory in pure Pandas as the dataset grows, but Dask (https://dask.org/) would be able to handle it quite well.
def cross_product(left, right):
return left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1)
crossprod = cross_product(df_identity_1, df_identity_2)
crossprod
index_x identity_x x_x y_x index_y identity_y x_y y_y
0 0 1 1.660468 -1.954339 0 2 -0.431543 0.500864
1 0 1 1.660468 -1.954339 1 2 -0.607647 -0.436480
2 0 1 1.660468 -1.954339 2 2 1.613126 -0.696860
3 1 1 0.153419 0.619493 0 2 -0.431543 0.500864
4 1 1 0.153419 0.619493 1 2 -0.607647 -0.436480
5 1 1 0.153419 0.619493 2 2 1.613126 -0.696860
6 2 1 -0.592440 -0.299046 0 2 -0.431543 0.500864
7 2 1 -0.592440 -0.299046 1 2 -0.607647 -0.436480
8 2 1 -0.592440 -0.299046 2 2 1.613126 -0.696860
Next, we just need to calculate the minimum distance for each row and then group by each index_x and index_y (respectively) and get the minimum distance value. Notice how we can do this with a single call to norm, rather than one call per row.
crossprod['dist'] = np.linalg.norm(crossprod[['x_x', 'y_x']].values - crossprod[['x_y', 'y_y']].values, axis=1)
closest_per_identity1 = crossprod.groupby(['index_x']).agg({'dist':'min'})
closest_per_identity2 = crossprod.groupby(['index_y']).agg({'dist':'min'})
closest_per_identity1
dist
index_x
0 1.258370
1 0.596869
2 0.138273
closest_per_identity2
dist
index_y
0 0.596869
1 0.138273
2 1.258370
Comparing to your original example on the same data. Note that I changed your int calls to floats and your itertuples to iterate through d, rather than df (as otherwise you're comparing each point to itself).
df = pd.concat([df_identity_1, df_identity_2])
def euclidean(x1, y1, x2, y2):
a = np.array((float(x1), float(y1)))
b = np.array((float(x2), float(y2)))
return np.linalg.norm(a-b)
# Function to find the closest coordinate with a different index
def find_closest_coord(row, df):
d = df[(df['identity'] != int(row['identity']))]
if d.empty:
return None
r = min(euclidean(row.x, row.y, r.x, r.y) for r in d.itertuples(index=False))
return r
df['closest_coord'] = df.apply(lambda row: find_closest_coord(row, df), axis=1)
df
index identity x y closest_coord
0 0 1 1.660468 -1.954339 1.258370
1 1 1 0.153419 0.619493 0.596869
2 2 1 -0.592440 -0.299046 0.138273
0 0 2 -0.431543 0.500864 0.596869
1 1 2 -0.607647 -0.436480 0.138273
2 2 2 1.613126 -0.696860 1.258370
numpyinstead of making pure python code.