0

I saw many post talk about use reduce to merge multiple dataframe. I try below code, it works with three dataframes, if it greater than three, tan, error reported:

pandas.errors.MergeError: Passing 'suffixes' which cause duplicate columns {'value_x'} is not allowed.

code:

import pandas as pd
from functools import reduce

data1 = [
    ['A',1],
    ['B',2],
    ['C',3]]
data2 = [
    ['A',4],
    ['B',5],
    ['C',6]]
data3 = [
    ['A',7],
    ['B',8],
    ['C',9]]
data4 = [
    ['A',10],
    ['B',11],
    ['C',12]]

dfs = []
for data in [data1, data2, data3, data4]:
    df = pd.DataFrame(data, columns=['name', 'value'])
    dfs.append(df)

dfm = reduce(
    lambda left, right: pd.merge(
        left, right,
        on=['name'],
        how='outer',
        suffixes=('_x', '_y')),
    dfs)
print(dfm)

What's proper way to fix this error?

2

2 Answers 2

0

The problem is that after the third iteration, the left dataframe contains value_x and value as columns, then right one contains value, but adding the _x suffix would cause value to become value_x that is already existing.

You should rather use concat here:

out = (pd.concat([d.set_index('name').add_suffix(f'_{i}')
                  for i, d in enumerate(dfs, start=1)],
                 axis=1)
         .reset_index()
      )

Output:

  name  value_1  value_2  value_3  value_4
0    A        1        4        7       10
1    B        2        5        8       11
2    C        3        6        9       12

To achieve the same with reduce would be a bit convoluted:

dfm = reduce(lambda left, right:
             pd.merge(left, right[1], on=['name'],
                      how='outer', suffixes=(None, f'_{right[0]}')),
             list(enumerate(dfs[1:], start=2)),
             dfs[0])

Output:

  name  value  value_2  value_3  value_4
0    A      1        4        7       10
1    B      2        5        8       11
2    C      3        6        9       12
Sign up to request clarification or add additional context in comments.

Comments

0

Another approach, which renames the value column as the list of dfs is being created. Afterwards, just use reduce:

# creating the list of dfs and renaming the 'value' column
dfs = []
for idx, data in enumerate([data1,data2,data3,data4]):
    df = pd.DataFrame(data,columns=['name',f'value_{idx+1}'])
    dfs.append(df)

reduce(lambda left, right: pd.merge(left, right, on='name', how='outer'), dfs)

Alternatively, we can rename column value inside reduce, by using list comprehension:

reduce(
    lambda left, right: pd.merge(left, right, on='name', how='outer'),
    # renaming columns of dfs
    [x.rename(columns={'value': f'value_{idx+1}'}) for idx, x in enumerate(dfs)])

Output:

  name  value_1  value_2  value_3  value_4
0    A        1        4        7       10
1    B        2        5        8       11
2    C        3        6        9       12

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.