4

I have two data frames that look like the following:

df_A:

ID    x     y
a     0     0
c     3     2
b     2     5

df_B:

ID    x     y
a     2     1
c     3     5
b     1     2

I want to add a column in db_B that is the Euclidean distance between the x,y coordinates in df_B from df_A for each identifier. The desired result would be:

ID    x     y    dist
a     2     1    1.732
c     3     5    3
b     1     2    3.162

The identifiers are not necessarily going to be in the same order. I know how to do this by looping through the rows of df_A and finding the matching ID in df_B, but I was hoping to avoid using a for loop since this will be used on data with tens of millions of rows. Is there some way to use apply but condition it on matching IDs?

1
  • Did either of the posted solutions work for you? Commented Jan 14, 2017 at 10:52

3 Answers 3

4

If ID isn't the index, make it so.

df_B.set_index('ID', inplace=True)
df_A.set_index('ID', inplace=True)

df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5

Since index and columns are already aligned, simply doing the math should just work.

Sign up to request clarification or add additional context in comments.

1 Comment

Nice solution !
3

Solution which uses sklearn.metrics.pairwise.paired_distances method:

In [73]: A
Out[73]:
    x  y
ID
a   0  0
c   3  2
b   2  5

In [74]: B
Out[74]:
    x  y
ID
a   2  1
c   3  5
b   1  2

In [75]: from sklearn.metrics.pairwise import paired_distances

In [76]: B['dist'] = paired_distances(B, A)

In [77]: B
Out[77]:
    x  y      dist
ID
a   2  1  2.236068
c   3  5  3.000000
b   1  2  3.162278

Comments

1

For performance, you might want to work with NumPy arrays and for euclidean distance computations between corresponding rows, np.einsum would be do it pretty efficiently.

Incorporating the fixing of rows to make them aligned, here's an implementation -

# Get sorted row indices for dataframe-A
sidx = df_A.index.argsort()
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]

# Sort A rows accordingly and get the elementwise differences against B
s = df_A.values[idx] - df_B.values

# Use einsum to square and sum each row and finally sqrt for distances
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))

Sample input, output -

In [121]: df_A
Out[121]: 
   0  1
a  0  0
c  3  2
b  2  5

In [122]: df_B
Out[122]: 
   0  1
c  3  5
a  2  1
b  1  2

In [124]: df_B  # After code run
Out[124]: 
   0  1      dist
c  3  5  3.000000
a  2  1  2.236068
b  1  2  3.162278

Here's a runtime test comparing einsum against few other counterparts.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.