Matching IDs Between Pandas DataFrames and Applying Function

Question

I have two data frames that look like the following:

df_A:

ID    x     y
a     0     0
c     3     2
b     2     5

df_B:

ID    x     y
a     2     1
c     3     5
b     1     2

I want to add a column in db_B that is the Euclidean distance between the x,y coordinates in df_B from df_A for each identifier. The desired result would be:

ID    x     y    dist
a     2     1    1.732
c     3     5    3
b     1     2    3.162

The identifiers are not necessarily going to be in the same order. I know how to do this by looping through the rows of df_A and finding the matching ID in df_B, but I was hoping to avoid using a for loop since this will be used on data with tens of millions of rows. Is there some way to use apply but condition it on matching IDs?

Did either of the posted solutions work for you?

Divakar
– Divakar

2017-01-14 10:52:04 +00:00
Commented Jan 14, 2017 at 10:52 — Divakar
– Divakar, Commented Jan 14, 2017 at 10:52

piRSquared · Accepted Answer · 2016-12-24 23:52:01Z

4

If ID isn't the index, make it so.

df_B.set_index('ID', inplace=True)
df_A.set_index('ID', inplace=True)

df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5

Since index and columns are already aligned, simply doing the math should just work.

edited Dec 24, 2016 at 23:52

answered Dec 24, 2016 at 22:48

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MaxU - stand with Ukraine Over a year ago

Nice solution !

MaxU - stand with Ukraine · Accepted Answer · 2016-12-24 23:13:54Z

3

Solution which uses sklearn.metrics.pairwise.paired_distances method:

In [73]: A
Out[73]:
    x  y
ID
a   0  0
c   3  2
b   2  5

In [74]: B
Out[74]:
    x  y
ID
a   2  1
c   3  5
b   1  2

In [75]: from sklearn.metrics.pairwise import paired_distances

In [76]: B['dist'] = paired_distances(B, A)

In [77]: B
Out[77]:
    x  y      dist
ID
a   2  1  2.236068
c   3  5  3.000000
b   1  2  3.162278

answered Dec 24, 2016 at 23:13

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:01:52Z

For performance, you might want to work with NumPy arrays and for euclidean distance computations between corresponding rows, np.einsum would be do it pretty efficiently.

Incorporating the fixing of rows to make them aligned, here's an implementation -

# Get sorted row indices for dataframe-A
sidx = df_A.index.argsort()
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]

# Sort A rows accordingly and get the elementwise differences against B
s = df_A.values[idx] - df_B.values

# Use einsum to square and sum each row and finally sqrt for distances
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))

Sample input, output -

In [121]: df_A
Out[121]: 
   0  1
a  0  0
c  3  2
b  2  5

In [122]: df_B
Out[122]: 
   0  1
c  3  5
a  2  1
b  1  2

In [124]: df_B  # After code run
Out[124]: 
   0  1      dist
c  3  5  3.000000
a  2  1  2.236068
b  1  2  3.162278

Here's a runtime test comparing einsum against few other counterparts.

Collectives™ on Stack Overflow

Matching IDs Between Pandas DataFrames and Applying Function

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related