1

I have two pandas dataframes df1 and df2 of the form:

df1

start   end   text   source
1       5     abc     1  
8       10    def     1
15      20    ghi     1
25      30    xxx     1
42      45    zzz     1 

df2

start   end   text   source
1       6     jkl     2  
7       9     mno     2
11      13    pqr     2
16      17    stu     2
18      19    vwx     2
32      37    yyy     2
40      47    rrr     2

I want to return the intersections of the two dataframes based on the start and end columns in following format:

out_df

start_1   end_1   start_2   end_2  text_1   text_2
1         5       1         6      abc      jkl        
8         10      7         9      def      mno
15        20      16        17     ghi      stu 
15        20      18        19     ghi      vwx
42        45      40        47     zzz      rrr

What is the best method to achieve this?

3
  • I would create lists, explode them, merge, then drop duplicates, if you have the memory to do it. Commented Aug 4, 2022 at 13:24
  • 1
    Have you tried the solution here? Commented Aug 4, 2022 at 13:30
  • @RVA92 I think that is similar, but still too different from my use case. Commented Aug 4, 2022 at 13:48

1 Answer 1

3

One option is with conditional_join from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df1.conditional_join(
      df2, 
      ('start', 'end', '<='), 
      ('end', 'start', '>='))

   left                 right
  start end text source start end text source
0     1   5  abc      1     1   6  jkl      2
1     8  10  def      1     7   9  mno      2
2    15  20  ghi      1    16  17  stu      2
3    15  20  ghi      1    18  19  vwx      2
4    42  45  zzz      1    40  47  rrr      2

In the dev version, you can rename the columns, and avoid the MultiIndex (the MultiIndex occurs because the column names are not unique):

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

df1.conditional_join(
        df2, 
        ('start', 'end', '<='), 
        ('end', 'start', '>='), 
        df_columns = {'start':'start_1', 
                      'end':'end_1', 
                      'text':'text_1'},
        right_columns = {'start':'start_2', 
                         'end':'end_2', 
                         'text':'text_2'})

   start_1  end_1 text_1  start_2  end_2 text_2
0        1      5    abc        1      6    jkl
1        8     10    def        7      9    mno
2       15     20    ghi       16     17    stu
3       15     20    ghi       18     19    vwx
4       42     45    zzz       40     47    rrr

The idea for overlaps is the start of interval one should be less than the end of interval 2, while the end of interval two should be less than the start of interval one, that way overlap is assured. I pulled that idea from pd.Interval.overlaps here

Another option is with the piso library; the answer here might point you in the right direction

Sign up to request clarification or add additional context in comments.

11 Comments

Thanks. Is there an alternative method in vanilla/pandas?
I'm running into errors (updating pandas does not seem to help): pyjanitor ImportError: cannot import name 'apply_if_callable' from 'pandas.core.common'
Still having the errors? Is the error when you run the code? At what point does the error occur?
it turns out its in the import statement itself. import pyjanitor
@abokey, column selection is in the dev version. Packages usually have a Dev version and a release version. The Dev version has the latest but has not been released yet. You have to install it iwth this :pip install git+https://github.com/pyjanitor-devs/pyjanitor.git uninstall the current version before installing. The Dev version also supports numba for more peerformance
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.