0

I'm guessing there is a more efficient way to make the comparisons that I am making.

Currently, I have two pandas DataFrames.

DataFrame A looks like this:

    Location    Tier    Other
0   100         1       'Blah'
1   200         1       'Blah'
2   10          1       'Blah'
3   30          1       'Blah'
4   500         1       'Blah'

DataFrame B looks like this:

    Start   Stop    Tier    Other
0   400     600     1       'Blah'
1   5       20      2       'Blah'

I would like to find all rows whose Location > Start and Location < End and Tier matches. So, in the above example, row 4 from DataFrame A has a Location that is greater than 400 but less than 600 and the Tier is ` in both DataFrames, so it should get returned somehow, like appending to a final DataFrame.

This is how I am making the comparisons now:

for i in A():
    matching = matching.append(B[(B.Tier == i.Tier) & (B.Start < i.Location) & (B.Stop > i.Location)], ignore_index=True)
return matching

Is there a faster way of accomplishing this, as my code runs quite slowly?

1
  • Are we assuming that each Tier only have one row in B? If that's the case, I think it's better to store B as a dictionary with Tier as keys. If not, please define the behavior when there are multiple rows with same Tier in B. Commented Oct 31, 2018 at 0:58

3 Answers 3

1

Using numpy broadcast

s1=df2.Start.values<df.Location.values[:,None]
s2=df2.Stop.values>df.Location.values[:,None]
s1&s2
Out[110]: 
array([[False, False],
       [False, False],
       [False,  True],
       [False, False],
       [ True, False]])
df[(s1&s2).any(1)]
Out[111]: 
   Location  Tier   Other
2        10     1  'Blah'
4       500     1  'Blah'
Sign up to request clarification or add additional context in comments.

Comments

1

The first thing that comes to my mind is sorting your data sets (for example, sort by Location and then by Tier for set A). Then you can use a binary search algorithm to vastly improve your search time.

Comments

1

You can define Start and Stop when there are multiple rows with the same Tier in B.

def matching(row):
    # use the first one
    cur_row = B[B.Tier == row.Tier].iloc[0]
    Start = cur_row.Start
    Stop = cur_row.Stop
    return row.Location > Start and row.Location < End

A[A.apply(matching, axis=1)]

Other example:

def matching(row):
    # other example
    cur_rows = B[B.Tier == row.Tier]
    Start = cur_rows.Start.min()
    Stop = cur_rows.Stop.max()
    return row.Location > Start and row.Location < End

A[A.apply(matching, axis=1)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.