1

I want to combine two data frames with different and overlapping columns:

df1
    X   a   Y   b     c
A   P   1   Q   21    1.135899
B   P   2   Q   22    1.093204
C   P   3   Q   23    2.035373
D   P   4   Q   24    0.350060
E   P   5   Q   25   -0.939962

df2
    a    b     d
A   1    21    5.5
A   1    21    3.3
A   1    21    2.1
B   2    22    0.8
B   2    22    0.5
C   3    23    1.3
C   3    23    6.5
C   3    23    7.1

I would like to combine both data frames in this way:

df3
    a    b   c          d
A   1    21  1.135899   5.5
A   1    21  1.135899   3.3
A   1    21  1.135899   2.1
B   2    22  1.093204   0.8
B   2    22  1.093204   0.5
C   3    23  2.035373   1.3
C   3    23  2.035373   6.5
C   3    23  2.035373   7.1

How can I achieve this?

2 Answers 2

2

Try a left merge. To maintain index, you will need use reset_index before and set_index after the marge.

res = df2.reset_index()\
         .merge(df1, how='left')\
         .set_index('index')\
         .loc[:, ['a', 'b', 'c', 'd']]

print(res)

#        a   b         c    d
# index                      
# A      1  21  1.135899  5.5
# A      1  21  1.135899  3.3
# A      1  21  1.135899  2.1
# B      2  22  1.093204  0.8
# B      2  22  1.093204  0.5
# C      3  23  2.035373  1.3
# C      3  23  2.035373  6.5
# C      3  23  2.035373  7.1
Sign up to request clarification or add additional context in comments.

3 Comments

I forgot to mention that the columns of df1 are not ordered, some other columns are between column a, b and c. I updated the example.
Thank you this works, but I am not sure what .loc[:, ['a', 'b', 'c', 'd'] is doing...Could you give me some explanation?
Sure, pd.DataFrame.loc indexes by label: rows & columns respectively. : represents all rows, ['a', 'b', 'c', 'd'] after the comma represents columns selected by label.
0

For append all columns from df1 if not exist in df2 filter by difference and join by default left join:

df = df2.join(df1[df1.columns.difference(df2.columns)])
print (df)
   a   b    d  X  Y         c
A  1  21  5.5  P  Q  1.135899
A  1  21  3.3  P  Q  1.135899
A  1  21  2.1  P  Q  1.135899
B  2  22  0.8  P  Q  1.093204
B  2  22  0.5  P  Q  1.093204
C  3  23  1.3  P  Q  2.035373
C  3  23  6.5  P  Q  2.035373
C  3  23  7.1  P  Q  2.035373

And if need only some columns add subset by list:

df = df2.join(df1[df1.columns.difference(df2.columns)])[['a','b','c','d']]
print (df)
   a   b         c    d
A  1  21  1.135899  5.5
A  1  21  1.135899  3.3
A  1  21  1.135899  2.1
B  2  22  1.093204  0.8
B  2  22  1.093204  0.5
C  3  23  2.035373  1.3
C  3  23  2.035373  6.5
C  3  23  2.035373  7.1

Detail:

print (df1[df1.columns.difference(df2.columns)])
   X  Y         c
A  P  Q  1.135899
B  P  Q  1.093204
C  P  Q  2.035373
D  P  Q  0.350060
E  P  Q -0.939962

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.