1

I have two data frames. One dataframe (A) looks like:

Name     begin    stop    ID      
Peter     30       150    1      
Hugo     4500     6000    2      
Jennie    300      700    3   

The other dataframe (B) looks like

entry     string      
89         aa      
568        bb     
938437     cc

I want to accomplish two tasks here:

  1. I want to get a list of indices for rows (from dataframe B) for which entry column falls in the interval (specified by begin and stop column) in dataframe A. The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A. 

  1. The indices that I get from task 1, I want to remove it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
entry     string          
938437     cc

How can I accomplish these two tasks?

4
  • how big are your two dataframes? Commented Apr 6, 2021 at 13:09
  • 1
    Dataframe B is of shape (14011296, 132) and dataframe A is of shape (63275, 12) Commented Apr 6, 2021 at 13:10
  • in dataframe A, do you have overlaps in the intervals? Commented Apr 6, 2021 at 13:15
  • 1
    No there are no overlaps. Commented Apr 6, 2021 at 13:16

2 Answers 2

1

You can use merge_asof

l = (pd.merge_asof(dfB['entry'].reset_index() #to keep original index after merge
                      .sort_values('entry'), #mandatory to use this merge_asof
                   dfA[['begin','stop']].sort_values('begin'),
                   left_on='entry', right_on='begin',
                   direction='backward') # begin lower than entry
       .query('stop >= entry') # keep only where entry lower than stop
       ['index'].tolist()
    )
print(l)
# Int64Index([0, 1], dtype='int64')

new_df = dfB.loc[dfB.index.difference(l)]
print(new_df)
#     entry string
# 2  938437     cc

Now if you don't need the list onf index and that you real goal is the new_df, then you can do directly

new_df = (pd.merge_asof(dfB.sort_values('entry'), 
                        dfA[['begin','stop']].sort_values('begin'),
                        left_on='entry', right_on='begin',
                        direction='backward')
            .query('stop < entry') #here different inegality
            .drop(['begin','stop'], axis=1) #clean the result
            .reset_index(drop=True)
         )
print(new_df)
Sign up to request clarification or add additional context in comments.

Comments

1

Make use of between() method and tolist() method to get list of indexes:

lst=B[B['entry'].between(A.loc[0,'begin'],A.loc[len(A)-1,'stop'])].index.tolist()

Finally make use of isin() method and boolean masking:

result=B[~B.index.isin(lst)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.