pandas: Better way to update and merge dataframes

Question

Consider the two dataframes df_a and df_b:

>>> df_a = pd.DataFrame.from_dict({1: [1,2,3], 2: ["a", "b", "c"], 3:[4,5,6]})
>>> df_a.index = pd.Index([0,1,3])
>>> print(df_a)

   1  2  3
0  1  a  4
1  2  b  5
3  3  c  6

>>> df_b = pd.DataFrame.from_dict({2: ["d", "e", "f", "g"]})
>>> print(df_b)

   2
0  d
1  e
2  f
3  g

And the following code:

>>> df_a = pd.concat([df_a, df_b])
>>> df_c = df_a.loc[~df_a.index.duplicated(keep='last'),df_b.columns]
>>> df_d = df_a.loc[~df_a.index.duplicated(keep='first'), ~df_a.columns.isin(df_b.columns)]
>>> df_e = df_d.merge(df_c, "outer", left_index=True, right_index=True)
>>> df_e.sort_index(axis=1, inplace=True)

Which produces the desired dataframe (df_e):

>>> print(df_e)
     1  2    3
0  1.0  d  4.0
1  2.0  e  5.0
2  NaN  f  NaN
3  3.0  g  6.0

Is there a more efficient way to get to df_e? I have tried various methods of using pd.concat, pd.merge and pd.update, but my efforts have resulted in one or more of these undesirable consequences:

It disrupts the index of df_a (i.e. the values do not have the same index - some sort of index creation happens 'under the hood').
Columns get renamed.
NaNs appear in places where df_a values should be.

Basically, the operation I want to perform is:

Update df_a with values of df_b.
If values exist in df_b that do not have corresponding index/columns, expand df_a appropriately to include these values (keeping the index/columns in the appropriate order).

EDIT: Provided better example that isn't naturally sorted.

DSM · Accepted Answer · 2018-02-14 14:01:56Z

4

I can think of two straightforward-ish ways to obtain your df_e; I'm not going to think much about column order, though. Adding an extra column 4 to df_b, just to show the behaviour for columns not present in df_a:

In [63]: m = df_b.combine_first(df_a)

In [64]: m
Out[64]: 
     1  2    3   4
0  1.0  d  4.0  10
1  2.0  e  5.0  11
2  NaN  f  NaN  12
3  3.0  g  6.0  13

or

In [65]: a,b = df_a.align(df_b)

In [66]: a.update(b)

In [67]: a
Out[67]: 
     1  2    3     4
0  1.0  d  4.0  10.0
1  2.0  e  5.0  11.0
2  NaN  f  NaN  12.0
3  3.0  g  6.0  13.0

Note the slight difference in dtype introduced by the alignment.

edited Feb 14, 2018 at 14:01

answered Feb 13, 2018 at 23:48

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Charlie Over a year ago

Could you please explain what was the problem you were solving by the align and then update operations? It looks to me as if you solved the problem just with combine_first... Was that in response to my comment about ordering the index/columns...?

Kevin · Accepted Answer · 2018-02-13 23:45:34Z

2

Reading through pandas join and blogs here and here should help you.

From the blogs:

“Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.”

df_b.join(df_a, how='left', lsuffix='_b').drop('2', axis=1).rename(columns={'2_b': 2})

    2   1   3
0   d   1.0 4.0
1   e   2.0 5.0
2   f   NaN NaN
3   g   3.0 6.0

answered Feb 13, 2018 at 23:45

Kevin

8,2275 gold badges39 silver badges58 bronze badges

Comments

jpp · Accepted Answer · 2018-02-13 23:59:33Z

0

This is one way:

df_b[[1, 3]] = df_a[[1, 3]]

Result:

print(df_b)

   2    1    3
0  d  1.0  4.0
1  e  2.0  5.0
2  f  NaN  NaN
3  g  3.0  6.0

answered Feb 13, 2018 at 23:59

jpp

166k37 gold badges301 silver badges363 bronze badges

2 Comments

Charlie Over a year ago

This answer relies on knowing the affected columns, and I don't think it would produce the right answer if the index of df_b was shorter than df_a. My use case is that it is not guaranteed that the index/columns of df_a or df_b is a superset/subset of the respective index/columns of the other.

jpp Over a year ago

@Charlie, Of course, that is true. It wasn't clear to me from your question what the restrictions were. I'll leave this here, nevertheless, for others who happen to know column names in advance.

Collectives™ on Stack Overflow

pandas: Better way to update and merge dataframes

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related