pandas - merge multiple dataframe with reduce

Question

I saw many post talk about use reduce to merge multiple dataframe. I try below code, it works with three dataframes, if it greater than three, tan, error reported:

pandas.errors.MergeError: Passing 'suffixes' which cause duplicate columns {'value_x'} is not allowed.

code:

import pandas as pd
from functools import reduce

data1 = [
    ['A',1],
    ['B',2],
    ['C',3]]
data2 = [
    ['A',4],
    ['B',5],
    ['C',6]]
data3 = [
    ['A',7],
    ['B',8],
    ['C',9]]
data4 = [
    ['A',10],
    ['B',11],
    ['C',12]]

dfs = []
for data in [data1, data2, data3, data4]:
    df = pd.DataFrame(data, columns=['name', 'value'])
    dfs.append(df)

dfm = reduce(
    lambda left, right: pd.merge(
        left, right,
        on=['name'],
        how='outer',
        suffixes=('_x', '_y')),
    dfs)
print(dfm)

What's proper way to fix this error?

I'm not sure what you were expecting to happen here. What's your desired output? For reference, see How to make good reproducible pandas examples or minimal reproducible example — wjandrea
– wjandrea, Commented Oct 2, 2024 at 18:47

mozway · Accepted Answer · 2024-10-02 18:42:27Z

The problem is that after the third iteration, the left dataframe contains value_x and value as columns, then right one contains value, but adding the _x suffix would cause value to become value_x that is already existing.

You should rather use concat here:

out = (pd.concat([d.set_index('name').add_suffix(f'_{i}')
                  for i, d in enumerate(dfs, start=1)],
                 axis=1)
         .reset_index()
      )

Output:

  name  value_1  value_2  value_3  value_4
0    A        1        4        7       10
1    B        2        5        8       11
2    C        3        6        9       12

To achieve the same with reduce would be a bit convoluted:

dfm = reduce(lambda left, right:
             pd.merge(left, right[1], on=['name'],
                      how='outer', suffixes=(None, f'_{right[0]}')),
             list(enumerate(dfs[1:], start=2)),
             dfs[0])

Output:

  name  value  value_2  value_3  value_4
0    A      1        4        7       10
1    B      2        5        8       11
2    C      3        6        9       12

PaulS · Accepted Answer · 2024-10-02 19:01:22Z

Another approach, which renames the value column as the list of dfs is being created. Afterwards, just use reduce:

# creating the list of dfs and renaming the 'value' column
dfs = []
for idx, data in enumerate([data1,data2,data3,data4]):
    df = pd.DataFrame(data,columns=['name',f'value_{idx+1}'])
    dfs.append(df)

reduce(lambda left, right: pd.merge(left, right, on='name', how='outer'), dfs)

Alternatively, we can rename column value inside reduce, by using list comprehension:

reduce(
    lambda left, right: pd.merge(left, right, on='name', how='outer'),
    # renaming columns of dfs
    [x.rename(columns={'value': f'value_{idx+1}'}) for idx, x in enumerate(dfs)])

Output:

  name  value_1  value_2  value_3  value_4
0    A        1        4        7       10
1    B        2        5        8       11
2    C        3        6        9       12

Collectives™ on Stack Overflow

pandas - merge multiple dataframe with reduce

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related