Consider I have the following two data frames:
df1:
Composite Beta_value Chromosome Start End Gene_Symbol
0 cg00000029 0.297449111 chr16 53434200 53434201 RBL2
1 cg00000108 0.660066803 chr3 37417715 37417716 C3orf35
2 cg00000109 0.660066803 chr3 172198247 172198248 FNDC3B
3 cg00000165 0.660066803 chr1 90729117 90729118 C3orf35
4 cg00000236 0.905679244 chr8 42405776 42405777 VDAC3
df2:
Composite Beta_value Chromosome Start End Gene_Symbol
2 cg00000109 0.660066803 chr3 172198247 172198248 FNDC3B
3 cg00000165 0.660066803 chr1 90729117 90729118 C3orf35
4 cg00000236 0.905679244 chr8 42405776 42405777 VDAC3
46 cg00002116 0.017114732 chr17 81703380 81703381 MRPL12
47 cg00002145 0.780230816 chr2 237340893 237340894 COL6A3
48 cg00002190 0.781140134 chr8 19697522 19697523 CSGALNACT1
49 cg00002224 0.220786047 chr8 143038982 143038983 C8orf31
What I want is to find the intersection of these two data frames based on "Start" and "Gene_Symbol" columns and keep only the rows in df1 if their "Start" and "Gene_Symbol" matches with rows in df2. For example I want my result to look like this:
Composite Beta_value Chromosome Start End Gene_Symbol
2 cg00000109 0.660066803 chr3 172198247 172198248 FNDC3B
3 cg00000165 0.660066803 chr1 90729117 90729118 C3orf35
4 cg00000236 0.905679244 chr8 42405776 42405777 VDAC3
And by intersection I do not mean to merge both the data frame and end up with 12 columns like I did by using:
intersection = pd.merge(df1, df2, how='inner', on=['Start','Gene_Symbol'])
s1.dropna(inplace=True)
Which merged the columns from both of my data frames, e.g.:
intersection.columns
Index(['Composite Element REF_x', 'Beta_value_x', 'Chromosome_x', 'Start',
'End_x', 'Gene_Symbol', 'Gene_Type_x', 'Transcript_ID_x',
'Position_to_TSS_x', 'CGI_Coordinate_x', 'Feature_Type_x',
'Composite Element REF_y', 'Beta_value_y', 'Chromosome_y', 'End_y',
'Gene_Type_y', 'Transcript_ID_y', 'Position_to_TSS_y',
'CGI_Coordinate_y', 'Feature_Type_y'],
dtype='object')