Finding intersection of two Data Frames based on columns

Question

Consider I have the following two data frames:

df1:
    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
0   cg00000029  0.297449111 chr16       53434200    53434201    RBL2
1   cg00000108  0.660066803 chr3        37417715    37417716    C3orf35
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3



df2:     
    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3
46  cg00002116  0.017114732 chr17       81703380    81703381    MRPL12
47  cg00002145  0.780230816 chr2        237340893   237340894   COL6A3
48  cg00002190  0.781140134 chr8        19697522    19697523    CSGALNACT1
49  cg00002224  0.220786047 chr8        143038982   143038983   C8orf31

What I want is to find the intersection of these two data frames based on "Start" and "Gene_Symbol" columns and keep only the rows in df1 if their "Start" and "Gene_Symbol" matches with rows in df2. For example I want my result to look like this:

    Composite   Beta_value  Chromosome  Start       End     Gene_Symbol
2   cg00000109  0.660066803 chr3        172198247   172198248   FNDC3B
3   cg00000165  0.660066803 chr1        90729117    90729118    C3orf35
4   cg00000236  0.905679244 chr8        42405776    42405777    VDAC3

And by intersection I do not mean to merge both the data frame and end up with 12 columns like I did by using:

intersection = pd.merge(df1, df2, how='inner', on=['Start','Gene_Symbol'])
s1.dropna(inplace=True)

Which merged the columns from both of my data frames, e.g.:

intersection.columns
Index(['Composite Element REF_x', 'Beta_value_x', 'Chromosome_x', 'Start',
       'End_x', 'Gene_Symbol', 'Gene_Type_x', 'Transcript_ID_x',
       'Position_to_TSS_x', 'CGI_Coordinate_x', 'Feature_Type_x',
       'Composite Element REF_y', 'Beta_value_y', 'Chromosome_y', 'End_y',
       'Gene_Type_y', 'Transcript_ID_y', 'Position_to_TSS_y',
       'CGI_Coordinate_y', 'Feature_Type_y'],
      dtype='object')

Erfan · Accepted Answer · 2019-12-12 09:13:19Z

1

Make sure to select the correct columns when you use DataFrame.merge, this way not all the columns from df2 will be merged as well:

keys = ['Start', 'Gene_Symbol']
intersection = df1.merge(df2[keys], on=keys)

    Composite  Beta_value Chromosome      Start        End Gene_Symbol
0  cg00000109    0.660067       chr3  172198247  172198248      FNDC3B
1  cg00000165    0.660067       chr1   90729117   90729118     C3orf35
2  cg00000236    0.905679       chr8   42405776   42405777       VDAC3

answered Dec 12, 2019 at 9:13

Erfan

43.3k10 gold badges75 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

davidbilla · Accepted Answer · 2019-12-12 10:37:20Z

1

Use only the required columns from df2.

pd.merge(df1, df2[['Start','Gene_Symbol']], on=['Start','Gene_Symbol'])

answered Dec 12, 2019 at 10:37

davidbilla

2,2321 gold badge22 silver badges28 bronze badges

Collectives™ on Stack Overflow

Finding intersection of two Data Frames based on columns

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related