How to find overlapping rows between two dataframes based on start and end columns?

Question

I have two pandas dataframes df1 and df2 of the form:

df1

start   end   text   source
1       5     abc     1  
8       10    def     1
15      20    ghi     1
25      30    xxx     1
42      45    zzz     1

df2

start   end   text   source
1       6     jkl     2  
7       9     mno     2
11      13    pqr     2
16      17    stu     2
18      19    vwx     2
32      37    yyy     2
40      47    rrr     2

I want to return the intersections of the two dataframes based on the start and end columns in following format:

out_df

start_1   end_1   start_2   end_2  text_1   text_2
1         5       1         6      abc      jkl        
8         10      7         9      def      mno
15        20      16        17     ghi      stu 
15        20      18        19     ghi      vwx
42        45      40        47     zzz      rrr

What is the best method to achieve this?

I would create lists, explode them, merge, then drop duplicates, if you have the memory to do it. — ifly6
– ifly6, Commented Aug 4, 2022 at 13:24
@RVA92 I think that is similar, but still too different from my use case. — Melsauce
– Melsauce, Commented Aug 4, 2022 at 13:48

sammywemmy · Accepted Answer · 2022-08-04 13:53:07Z

3

One option is with conditional_join from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df1.conditional_join(
      df2, 
      ('start', 'end', '<='), 
      ('end', 'start', '>='))

   left                 right
  start end text source start end text source
0     1   5  abc      1     1   6  jkl      2
1     8  10  def      1     7   9  mno      2
2    15  20  ghi      1    16  17  stu      2
3    15  20  ghi      1    18  19  vwx      2
4    42  45  zzz      1    40  47  rrr      2

In the dev version, you can rename the columns, and avoid the MultiIndex (the MultiIndex occurs because the column names are not unique):

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

df1.conditional_join(
        df2, 
        ('start', 'end', '<='), 
        ('end', 'start', '>='), 
        df_columns = {'start':'start_1', 
                      'end':'end_1', 
                      'text':'text_1'},
        right_columns = {'start':'start_2', 
                         'end':'end_2', 
                         'text':'text_2'})

   start_1  end_1 text_1  start_2  end_2 text_2
0        1      5    abc        1      6    jkl
1        8     10    def        7      9    mno
2       15     20    ghi       16     17    stu
3       15     20    ghi       18     19    vwx
4       42     45    zzz       40     47    rrr

The idea for overlaps is the start of interval one should be less than the end of interval 2, while the end of interval two should be less than the start of interval one, that way overlap is assured. I pulled that idea from pd.Interval.overlaps here

Another option is with the piso library; the answer here might point you in the right direction

edited Aug 4, 2022 at 13:53

answered Aug 4, 2022 at 13:47

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Melsauce Over a year ago

Thanks. Is there an alternative method in vanilla/pandas?

Melsauce Over a year ago

I'm running into errors (updating pandas does not seem to help): pyjanitor ImportError: cannot import name 'apply_if_callable' from 'pandas.core.common'

sammywemmy Over a year ago

Still having the errors? Is the error when you run the code? At what point does the error occur?

Melsauce Over a year ago

it turns out its in the import statement itself. import pyjanitor

sammywemmy Over a year ago

@abokey, column selection is in the dev version. Packages usually have a Dev version and a release version. The Dev version has the latest but has not been released yet. You have to install it iwth this :pip install git+https://github.com/pyjanitor-devs/pyjanitor.git uninstall the current version before installing. The Dev version also supports numba for more peerformance

|

Collectives™ on Stack Overflow

How to find overlapping rows between two dataframes based on start and end columns?

1 Answer 1

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related