1

I have a hard time merging and updating Pandas dataframes right now.

I have a bunch of CSV files that I'm parsing with pandas (which is not a problem). In very few cases I have multiple files that contains some columns present in both files.

So, for example, let's say I have:

import pandas as pd

a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])

What I want is this dataframe:

     A  B
0  110  4
1  111  5
2    2  6
3  113  7
41  11  NaN
51  12  NaN
61  13  NaN
71  14  NaN

Pandas has this nice guide: Merge, join, concatenate and compare. But I fail to find a solution to what I want to achieve.

For example a.join(b, how="outer") raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object'). Passing rsuffix="R" is not an option, because the end result is:

      A    B    AR
0   0.0  4.0   NaN
1   1.0  5.0   NaN
2   2.0  6.0   NaN
3   3.0  7.0   NaN
41  NaN  NaN  11.0
51  NaN  NaN  12.0
61  NaN  NaN  13.0
71  NaN  NaN  14.0

Not quite what I want.

pd.merge(a, b, how="outer") looks promising, but it is not quite right either, because the indices are ignored:

    A    B
0   0  4.0
1   1  5.0
2   2  6.0
3   3  7.0
4  11  NaN
5  12  NaN
6  13  NaN
7  14  NaN

Passing left_index=True and right_index=True yields a dataframe similar to .join(..., rsuffix="_x", lsuffix="_y"), so not what I want.

Using update is almost what I want, a.merge(c) would modify a to

       A  B
0  110.0  4
1  111.0  5
2    2.0  6
3  113.0  7

but a.update(b) does nothing (I assume because the indices of a and b are disjunct).

So, is what I want even possible with a single line of code?

EDIT

I came up with this one:

> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()

      A    B
0   110  4.0
1   111  5.0
2     2  6.0
3   113  7.0
41   11  NaN
51   12  NaN
61   13  NaN
71   14  NaN

This is what I want, the question is: is this correct or just a coincidence that this yields the same result as I wanted?


How are you determining which 'A' column has priority?

In the order I'm reading the files. The files are generated by a device (which is kind of a "black box" to me) and generates files with a date in them. So I do:


tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)

And I would like to do (no error checking as this is an example):

results = iter(results)
merged_df = next(results)

for df in results:
    merged_df = the_magic_function_Im_looking_for(df)
2
  • How are you determining which 'A' column has priority? Commented Jun 17, 2021 at 19:23
  • In the order I'm reading the data files (they have a date in the filename), so I do: for fn in sorted(glob.glob("*.csv")): ... Commented Jun 17, 2021 at 19:25

2 Answers 2

4

reduceing with combine_first:

from functools import reduce

to_merge = [c, b, a]

result = reduce(pd.DataFrame.combine_first, to_merge)

which successively applies combine_first to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.

(we can put reversed(to_merge) in reduce if to_merge comes with the reversed order),

to get

>>> result

        A    B
0   110.0  4.0
1   111.0  5.0
2     2.0  6.0
3   113.0  7.0
41   11.0  NaN
51   12.0  NaN
61   13.0  NaN
71   14.0  NaN
Sign up to request clarification or add additional context in comments.

Comments

4

Try with concat + groupby last on axis=1 to merge the Dataframes then get the "last" valid value per column group:

df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()

df:

        A    B
0   110.0  4.0
1   111.0  5.0
2     2.0  6.0
3   113.0  7.0
41   11.0  NaN
51   12.0  NaN
61   13.0  NaN
71   14.0  NaN

Or concating long and getting the last valid row per row index thanks to @anky:

df = pd.concat([a, b, c]).groupby(level=0).last()

df:

      A    B
0   110  4.0
1   111  5.0
2     2  6.0
3   113  7.0
41   11  NaN
51   12  NaN
61   13  NaN
71   14  NaN

5 Comments

This is almost the same I came up with (I edited my original post), but I didn't use the level=0, axis=1 arguments. What would be the difference between my call and yours?
I'm concating on axis=1 which builds wide then grouping by the columns. Or as anky noted concating long and getting last valid value per index. Either way the difference is that I'm not sorting or dropping duplicates and I don't know what lll is so I can't give more information about how that's built. @Pablo
Ah, I was playing around in ipython and did lll = pd.concat([a,b, c]).sort_index(), so this was pure luck it seems
No @Pablo it's not coincidence that that yields what you want. .sort_index().drop_duplicates() doesn't make progress towards the goal (in this case) since there are no duplicates in your data (drop_duplicates doesn't consider the index) and groupby(a.index) is the same as groupby(level=0) It is equivalent to pd.concat([a, b, c]).groupby(level=0).last() except with the additional sorting and dropping steps. (and creating the concated data twice)
Thanks a lot, I understood it better. Shame that I cannot accept both answers, but I find the solution with combine_first easier to understand and visualize

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.