3

Consider the two dataframes df_a and df_b:

>>> df_a = pd.DataFrame.from_dict({1: [1,2,3], 2: ["a", "b", "c"], 3:[4,5,6]})
>>> df_a.index = pd.Index([0,1,3])
>>> print(df_a)

   1  2  3
0  1  a  4
1  2  b  5
3  3  c  6

>>> df_b = pd.DataFrame.from_dict({2: ["d", "e", "f", "g"]})
>>> print(df_b)

   2
0  d
1  e
2  f
3  g

And the following code:

>>> df_a = pd.concat([df_a, df_b])
>>> df_c = df_a.loc[~df_a.index.duplicated(keep='last'),df_b.columns]
>>> df_d = df_a.loc[~df_a.index.duplicated(keep='first'), ~df_a.columns.isin(df_b.columns)]
>>> df_e = df_d.merge(df_c, "outer", left_index=True, right_index=True)
>>> df_e.sort_index(axis=1, inplace=True)

Which produces the desired dataframe (df_e):

>>> print(df_e)
     1  2    3
0  1.0  d  4.0
1  2.0  e  5.0
2  NaN  f  NaN
3  3.0  g  6.0

Is there a more efficient way to get to df_e? I have tried various methods of using pd.concat, pd.merge and pd.update, but my efforts have resulted in one or more of these undesirable consequences:

  1. It disrupts the index of df_a (i.e. the values do not have the same index - some sort of index creation happens 'under the hood').
  2. Columns get renamed.
  3. NaNs appear in places where df_a values should be.

Basically, the operation I want to perform is:

  1. Update df_a with values of df_b.
  2. If values exist in df_b that do not have corresponding index/columns, expand df_a appropriately to include these values (keeping the index/columns in the appropriate order).

EDIT: Provided better example that isn't naturally sorted.

0

3 Answers 3

4

I can think of two straightforward-ish ways to obtain your df_e; I'm not going to think much about column order, though. Adding an extra column 4 to df_b, just to show the behaviour for columns not present in df_a:

In [63]: m = df_b.combine_first(df_a)

In [64]: m
Out[64]: 
     1  2    3   4
0  1.0  d  4.0  10
1  2.0  e  5.0  11
2  NaN  f  NaN  12
3  3.0  g  6.0  13

or

In [65]: a,b = df_a.align(df_b)

In [66]: a.update(b)

In [67]: a
Out[67]: 
     1  2    3     4
0  1.0  d  4.0  10.0
1  2.0  e  5.0  11.0
2  NaN  f  NaN  12.0
3  3.0  g  6.0  13.0

Note the slight difference in dtype introduced by the alignment.

Sign up to request clarification or add additional context in comments.

1 Comment

Could you please explain what was the problem you were solving by the align and then update operations? It looks to me as if you solved the problem just with combine_first... Was that in response to my comment about ordering the index/columns...?
2

Reading through pandas join and blogs here and here should help you.

From the blogs:

“Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.”

df_b.join(df_a, how='left', lsuffix='_b').drop('2', axis=1).rename(columns={'2_b': 2})

    2   1   3
0   d   1.0 4.0
1   e   2.0 5.0
2   f   NaN NaN
3   g   3.0 6.0

Comments

0

This is one way:

df_b[[1, 3]] = df_a[[1, 3]]

Result:

print(df_b)

   2    1    3
0  d  1.0  4.0
1  e  2.0  5.0
2  f  NaN  NaN
3  g  3.0  6.0

2 Comments

This answer relies on knowing the affected columns, and I don't think it would produce the right answer if the index of df_b was shorter than df_a. My use case is that it is not guaranteed that the index/columns of df_a or df_b is a superset/subset of the respective index/columns of the other.
@Charlie, Of course, that is true. It wasn't clear to me from your question what the restrictions were. I'll leave this here, nevertheless, for others who happen to know column names in advance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.