Merging and updating multiple pandas dataframes with overlapping columns

Question

I have a hard time merging and updating Pandas dataframes right now.

I have a bunch of CSV files that I'm parsing with pandas (which is not a problem). In very few cases I have multiple files that contains some columns present in both files.

So, for example, let's say I have:

import pandas as pd

a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])

What I want is this dataframe:

     A  B
0  110  4
1  111  5
2    2  6
3  113  7
41  11  NaN
51  12  NaN
61  13  NaN
71  14  NaN

Pandas has this nice guide: Merge, join, concatenate and compare. But I fail to find a solution to what I want to achieve.

For example a.join(b, how="outer") raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object'). Passing rsuffix="R" is not an option, because the end result is:

      A    B    AR
0   0.0  4.0   NaN
1   1.0  5.0   NaN
2   2.0  6.0   NaN
3   3.0  7.0   NaN
41  NaN  NaN  11.0
51  NaN  NaN  12.0
61  NaN  NaN  13.0
71  NaN  NaN  14.0

Not quite what I want.

pd.merge(a, b, how="outer") looks promising, but it is not quite right either, because the indices are ignored:

Passing left_index=True and right_index=True yields a dataframe similar to .join(..., rsuffix="_x", lsuffix="_y"), so not what I want.

Using update is almost what I want, a.merge(c) would modify a to

       A  B
0  110.0  4
1  111.0  5
2    2.0  6
3  113.0  7

but a.update(b) does nothing (I assume because the indices of a and b are disjunct).

So, is what I want even possible with a single line of code?

EDIT

I came up with this one:

> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()

      A    B
0   110  4.0
1   111  5.0
2     2  6.0
3   113  7.0
41   11  NaN
51   12  NaN
61   13  NaN
71   14  NaN

This is what I want, the question is: is this correct or just a coincidence that this yields the same result as I wanted?

How are you determining which 'A' column has priority?

In the order I'm reading the files. The files are generated by a device (which is kind of a "black box" to me) and generates files with a date in them. So I do:


tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)

And I would like to do (no error checking as this is an example):

results = iter(results)
merged_df = next(results)

for df in results:
    merged_df = the_magic_function_Im_looking_for(df)

In the order I'm reading the data files (they have a date in the filename), so I do: for fn in sorted(glob.glob("*.csv")): ... — Pablo
– Pablo, Commented Jun 17, 2021 at 19:25

Mustafa Aydın · Accepted Answer · 2021-06-17 19:45:10Z

4

reduceing with combine_first:

from functools import reduce

to_merge = [c, b, a]

result = reduce(pd.DataFrame.combine_first, to_merge)

which successively applies combine_first to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.

(we can put reversed(to_merge) in reduce if to_merge comes with the reversed order),

to get

>>> result

        A    B
0   110.0  4.0
1   111.0  5.0
2     2.0  6.0
3   113.0  7.0
41   11.0  NaN
51   12.0  NaN
61   13.0  NaN
71   14.0  NaN

edited Jun 17, 2021 at 19:45

answered Jun 17, 2021 at 19:35

Mustafa Aydın

18.4k4 gold badges21 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Henry Ecker · Accepted Answer · 2021-06-17 19:41:15Z

4

Try with concat + groupby last on axis=1 to merge the Dataframes then get the "last" valid value per column group:

df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()

df:

        A    B
0   110.0  4.0
1   111.0  5.0
2     2.0  6.0
3   113.0  7.0
41   11.0  NaN
51   12.0  NaN
61   13.0  NaN
71   14.0  NaN

Or concating long and getting the last valid row per row index thanks to @anky:

df = pd.concat([a, b, c]).groupby(level=0).last()

df:

      A    B
0   110  4.0
1   111  5.0
2     2  6.0
3   113  7.0
41   11  NaN
51   12  NaN
61   13  NaN
71   14  NaN

edited Jun 17, 2021 at 19:41

answered Jun 17, 2021 at 19:33

Henry Ecker♦

35.8k19 gold badges48 silver badges67 bronze badges

5 Comments

Pablo Over a year ago

This is almost the same I came up with (I edited my original post), but I didn't use the level=0, axis=1 arguments. What would be the difference between my call and yours?

Henry Ecker Over a year ago

I'm concating on axis=1 which builds wide then grouping by the columns. Or as anky noted concating long and getting last valid value per index. Either way the difference is that I'm not sorting or dropping duplicates and I don't know what lll is so I can't give more information about how that's built. @Pablo

Pablo Over a year ago

Ah, I was playing around in ipython and did lll = pd.concat([a,b, c]).sort_index(), so this was pure luck it seems

Henry Ecker Over a year ago

No @Pablo it's not coincidence that that yields what you want. .sort_index().drop_duplicates() doesn't make progress towards the goal (in this case) since there are no duplicates in your data (drop_duplicates doesn't consider the index) and groupby(a.index) is the same as groupby(level=0) It is equivalent to pd.concat([a, b, c]).groupby(level=0).last() except with the additional sorting and dropping steps. (and creating the concated data twice)

Pablo Over a year ago

Thanks a lot, I understood it better. Shame that I cannot accept both answers, but I find the solution with combine_first easier to understand and visualize

Collectives™ on Stack Overflow

Merging and updating multiple pandas dataframes with overlapping columns

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related