How to combine dataframes with redundant rows in pandas

Question

I want to combine two data frames with different and overlapping columns:

df1
    X   a   Y   b     c
A   P   1   Q   21    1.135899
B   P   2   Q   22    1.093204
C   P   3   Q   23    2.035373
D   P   4   Q   24    0.350060
E   P   5   Q   25   -0.939962

df2
    a    b     d
A   1    21    5.5
A   1    21    3.3
A   1    21    2.1
B   2    22    0.8
B   2    22    0.5
C   3    23    1.3
C   3    23    6.5
C   3    23    7.1

I would like to combine both data frames in this way:

df3
    a    b   c          d
A   1    21  1.135899   5.5
A   1    21  1.135899   3.3
A   1    21  1.135899   2.1
B   2    22  1.093204   0.8
B   2    22  1.093204   0.5
C   3    23  2.035373   1.3
C   3    23  2.035373   6.5
C   3    23  2.035373   7.1

How can I achieve this?

jpp · Accepted Answer · 2018-04-19 11:25:42Z

2

Try a left merge. To maintain index, you will need use reset_index before and set_index after the marge.

res = df2.reset_index()\
         .merge(df1, how='left')\
         .set_index('index')\
         .loc[:, ['a', 'b', 'c', 'd']]

print(res)

#        a   b         c    d
# index                      
# A      1  21  1.135899  5.5
# A      1  21  1.135899  3.3
# A      1  21  1.135899  2.1
# B      2  22  1.093204  0.8
# B      2  22  1.093204  0.5
# C      3  23  2.035373  1.3
# C      3  23  2.035373  6.5
# C      3  23  2.035373  7.1

edited Apr 19, 2018 at 11:25

answered Apr 19, 2018 at 10:58

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

honeymoon Over a year ago

I forgot to mention that the columns of df1 are not ordered, some other columns are between column a, b and c. I updated the example.

honeymoon Over a year ago

Thank you this works, but I am not sure what .loc[:, ['a', 'b', 'c', 'd'] is doing...Could you give me some explanation?

jpp Over a year ago

Sure, pd.DataFrame.loc indexes by label: rows & columns respectively. : represents all rows, ['a', 'b', 'c', 'd'] after the comma represents columns selected by label.

jezrael · Accepted Answer · 2018-04-19 11:38:22Z

For append all columns from df1 if not exist in df2 filter by difference and join by default left join:

df = df2.join(df1[df1.columns.difference(df2.columns)])
print (df)
   a   b    d  X  Y         c
A  1  21  5.5  P  Q  1.135899
A  1  21  3.3  P  Q  1.135899
A  1  21  2.1  P  Q  1.135899
B  2  22  0.8  P  Q  1.093204
B  2  22  0.5  P  Q  1.093204
C  3  23  1.3  P  Q  2.035373
C  3  23  6.5  P  Q  2.035373
C  3  23  7.1  P  Q  2.035373

And if need only some columns add subset by list:

df = df2.join(df1[df1.columns.difference(df2.columns)])[['a','b','c','d']]
print (df)
   a   b         c    d
A  1  21  1.135899  5.5
A  1  21  1.135899  3.3
A  1  21  1.135899  2.1
B  2  22  1.093204  0.8
B  2  22  1.093204  0.5
C  3  23  2.035373  1.3
C  3  23  2.035373  6.5
C  3  23  2.035373  7.1

Detail:

print (df1[df1.columns.difference(df2.columns)])
   X  Y         c
A  P  Q  1.135899
B  P  Q  1.093204
C  P  Q  2.035373
D  P  Q  0.350060
E  P  Q -0.939962

Collectives™ on Stack Overflow

How to combine dataframes with redundant rows in pandas

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related