Extracting rows from a pandas dataframe based on values in another dataframe

Question

I have two data frames. One dataframe (A) looks like:

Name     begin    stop    ID      
Peter     30       150    1      
Hugo     4500     6000    2      
Jennie    300      700    3

The other dataframe (B) looks like

entry     string      
89         aa      
568        bb     
938437     cc

I want to accomplish two tasks here:

I want to get a list of indices for rows (from dataframe B) for which entry column falls in the interval (specified by begin and stop column) in dataframe A. The result for this task will be:

lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.

The indices that I get from task 1, I want to remove it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:

entry     string          
938437     cc

How can I accomplish these two tasks?

Dataframe B is of shape (14011296, 132) and dataframe A is of shape (63275, 12) — John
– John, Commented Apr 6, 2021 at 13:10

Ben.T · Accepted Answer · 2021-04-06 13:53:52Z

You can use merge_asof

l = (pd.merge_asof(dfB['entry'].reset_index() #to keep original index after merge
                      .sort_values('entry'), #mandatory to use this merge_asof
                   dfA[['begin','stop']].sort_values('begin'),
                   left_on='entry', right_on='begin',
                   direction='backward') # begin lower than entry
       .query('stop >= entry') # keep only where entry lower than stop
       ['index'].tolist()
    )
print(l)
# Int64Index([0, 1], dtype='int64')

new_df = dfB.loc[dfB.index.difference(l)]
print(new_df)
#     entry string
# 2  938437     cc

Now if you don't need the list onf index and that you real goal is the new_df, then you can do directly

new_df = (pd.merge_asof(dfB.sort_values('entry'), 
                        dfA[['begin','stop']].sort_values('begin'),
                        left_on='entry', right_on='begin',
                        direction='backward')
            .query('stop < entry') #here different inegality
            .drop(['begin','stop'], axis=1) #clean the result
            .reset_index(drop=True)
         )
print(new_df)

Anurag Dabas · Accepted Answer · 2021-04-06 14:07:27Z

1

Make use of between() method and tolist() method to get list of indexes:

lst=B[B['entry'].between(A.loc[0,'begin'],A.loc[len(A)-1,'stop'])].index.tolist()

Finally make use of isin() method and boolean masking:

result=B[~B.index.isin(lst)]

edited Apr 6, 2021 at 14:07

answered Apr 6, 2021 at 13:49

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Collectives™ on Stack Overflow

Extracting rows from a pandas dataframe based on values in another dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related