Consider the two dataframes df_a and df_b:
>>> df_a = pd.DataFrame.from_dict({1: [1,2,3], 2: ["a", "b", "c"], 3:[4,5,6]})
>>> df_a.index = pd.Index([0,1,3])
>>> print(df_a)
1 2 3
0 1 a 4
1 2 b 5
3 3 c 6
>>> df_b = pd.DataFrame.from_dict({2: ["d", "e", "f", "g"]})
>>> print(df_b)
2
0 d
1 e
2 f
3 g
And the following code:
>>> df_a = pd.concat([df_a, df_b])
>>> df_c = df_a.loc[~df_a.index.duplicated(keep='last'),df_b.columns]
>>> df_d = df_a.loc[~df_a.index.duplicated(keep='first'), ~df_a.columns.isin(df_b.columns)]
>>> df_e = df_d.merge(df_c, "outer", left_index=True, right_index=True)
>>> df_e.sort_index(axis=1, inplace=True)
Which produces the desired dataframe (df_e):
>>> print(df_e)
1 2 3
0 1.0 d 4.0
1 2.0 e 5.0
2 NaN f NaN
3 3.0 g 6.0
Is there a more efficient way to get to df_e? I have tried various methods of using pd.concat, pd.merge and pd.update, but my efforts have resulted in one or more of these undesirable consequences:
- It disrupts the index of
df_a(i.e. the values do not have the same index - some sort of index creation happens 'under the hood'). - Columns get renamed.
- NaNs appear in places where
df_avalues should be.
Basically, the operation I want to perform is:
- Update
df_awith values ofdf_b. - If values exist in
df_bthat do not have corresponding index/columns, expanddf_aappropriately to include these values (keeping the index/columns in the appropriate order).
EDIT: Provided better example that isn't naturally sorted.