Improve efficiency of looping through pandas dataframe

Question

I'm guessing there is a more efficient way to make the comparisons that I am making.

Currently, I have two pandas DataFrames.

DataFrame A looks like this:

    Location    Tier    Other
0   100         1       'Blah'
1   200         1       'Blah'
2   10          1       'Blah'
3   30          1       'Blah'
4   500         1       'Blah'

DataFrame B looks like this:

    Start   Stop    Tier    Other
0   400     600     1       'Blah'
1   5       20      2       'Blah'

I would like to find all rows whose Location > Start and Location < End and Tier matches. So, in the above example, row 4 from DataFrame A has a Location that is greater than 400 but less than 600 and the Tier is ` in both DataFrames, so it should get returned somehow, like appending to a final DataFrame.

This is how I am making the comparisons now:

for i in A():
    matching = matching.append(B[(B.Tier == i.Tier) & (B.Start < i.Location) & (B.Stop > i.Location)], ignore_index=True)
return matching

Is there a faster way of accomplishing this, as my code runs quite slowly?

Are we assuming that each Tier only have one row in B? If that's the case, I think it's better to store B as a dictionary with Tier as keys. If not, please define the behavior when there are multiple rows with same Tier in B. — Pinyi Wang
– Pinyi Wang, Commented Oct 31, 2018 at 0:58

BENY · Accepted Answer · 2018-10-31 01:02:14Z

1

Using numpy broadcast

s1=df2.Start.values<df.Location.values[:,None]
s2=df2.Stop.values>df.Location.values[:,None]
s1&s2
Out[110]: 
array([[False, False],
       [False, False],
       [False,  True],
       [False, False],
       [ True, False]])
df[(s1&s2).any(1)]
Out[111]: 
   Location  Tier   Other
2        10     1  'Blah'
4       500     1  'Blah'

answered Oct 31, 2018 at 1:02

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Schematic · Accepted Answer · 2018-10-31 01:00:33Z

1

The first thing that comes to my mind is sorting your data sets (for example, sort by Location and then by Tier for set A). Then you can use a binary search algorithm to vastly improve your search time.

answered Oct 31, 2018 at 1:00

Schematic

1346 bronze badges

Comments

Pinyi Wang · Accepted Answer · 2018-10-31 01:08:27Z

1

You can define Start and Stop when there are multiple rows with the same Tier in B.

def matching(row):
    # use the first one
    cur_row = B[B.Tier == row.Tier].iloc[0]
    Start = cur_row.Start
    Stop = cur_row.Stop
    return row.Location > Start and row.Location < End

A[A.apply(matching, axis=1)]

Other example:

def matching(row):
    # other example
    cur_rows = B[B.Tier == row.Tier]
    Start = cur_rows.Start.min()
    Stop = cur_rows.Stop.max()
    return row.Location > Start and row.Location < End

A[A.apply(matching, axis=1)]

answered Oct 31, 2018 at 1:08

Pinyi Wang

8726 silver badges16 bronze badges

Collectives™ on Stack Overflow

Improve efficiency of looping through pandas dataframe

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related