Avoid duplicate columns while merging with pandas

Question

I have dozens of dataframes I would like to merge with a "reference" dataframe. I want to merge the columns when they exist in both dataframes, or conversely, create a new one when they don't already exist. I have the feeling that this is closely related to this topic but I cannot figure out out to make it work in my case. Also, note that the key used for merging never contains duplicates.

# Reference dataframe
df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})

# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
                'potato':[13,21]})

df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
                'carrot':[14,8,32]})

df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00'],
                'potato':[27,31]})


df = df.merge(df1, how='left', on='date_time')
df = df.merge(df2, how='left', on='date_time')
df = df.merge(df3, how='left', on='date_time')

The result is :

              date_time  potato_x  carrot  potato_y
0  2018-06-01 00:00:00       NaN     NaN       NaN
1  2018-06-01 00:30:00      13.0     NaN       NaN
2  2018-06-01 01:00:00      21.0     NaN       NaN
3  2018-06-01 01:30:00       NaN    14.0      27.0

While I would like :

              date_time  potato  carrot 
0  2018-06-01 00:00:00       NaN     NaN  
1  2018-06-01 00:30:00      13.0     NaN   
2  2018-06-01 01:00:00      21.0     NaN 
3  2018-06-01 01:30:00      27.0    14.0

Edit (following @sammywemmy's answer): I have no idea what will be the dataframe columns name before importing them (in a loop). Usually, the dataframes that are merged with my reference dataframe contain about 100 columns, from which 90%-95% are common with the other dataframes.

Every new dataframe to be merge contains about 100 columns. Among these 100 columns, there might be 10 columns that have a name that is not present in the previous dataframes. So, assuming that I want to merge 15 dataframes, I will have at the end 100 columns + 15*10 = 250 columns — Antoine T
– Antoine T, Commented Feb 26, 2020 at 2:33
it seems the other columns are food names (potato, carrot,...) and the common key is date_time. 100 columns is a lot and i dont see how you can keep track of that. I suggest you write code that melts every dataframe, using date_time as your index_var, then perform the merge. — sammywemmy
– sammywemmy, Commented Feb 26, 2020 at 3:20

Scott Boston · Accepted Answer · 2020-02-26 14:07:48Z

1

I would pd.concat similar structured dataframes then merge the others like this:

df.merge(pd.concat([df1, df3]), on='date_time', how='left')\
  .merge(df2, on='date_time', how='left')

Output:

             date_time  potato  carrot
0  2018-06-01 00:00:00     NaN     NaN
1  2018-06-01 00:30:00    13.0     NaN
2  2018-06-01 01:00:00    21.0     NaN
3  2018-06-01 01:30:00    27.0    14.0

Per comments below:

df = pd.DataFrame({'date_time':['2018-06-01 00:00:00','2018-06-01 00:30:00','2018-06-01 01:00:00','2018-06-01 01:30:00']})

# Dataframes to merge to reference dataframe
df1 = pd.DataFrame({'date_time':['2018-06-01 00:30:00','2018-06-01 01:00:00'],
                'potato':[13,21]})

df2 = pd.DataFrame({'date_time':['2018-06-01 01:30:00','2018-06-01 02:00:00','2018-06-01 02:30:00'],
                'carrot':[14,8,32]})

df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})

df.merge(pd.concat([df1, df3]), on='date_time', how='left').merge(df2, on='date_time', how='left')

Output:

             date_time  potato  zucchini  carrot
0  2018-06-01 00:00:00     NaN       NaN     NaN
1  2018-06-01 00:30:00    13.0       NaN     NaN
2  2018-06-01 01:00:00    21.0       NaN     NaN
3  2018-06-01 01:30:00    27.0      11.0    14.0

edited Feb 26, 2020 at 14:07

answered Feb 26, 2020 at 1:22

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Antoine T Over a year ago

I think your solution works only if the merged/concatenated dataframe is either fully similar or diffirent from df. For example, I wouldn't know how to deal with a dataframe such as : df3 = pd.DataFrame({'date_time':['2018-06-01 01:30:00', '2018-06-01 02:00:00'],'potato':[27,31], 'zucchini':[11,1]})

sammywemmy · Accepted Answer · 2020-02-26 00:36:12Z

0

Continuing from your code,
use the filter method to pull out potato related columns,
sum them along the columns axis,
and remove columns that contain potato_...

df['potato'] = df.filter(like='potato').fillna(0).sum(axis=1)

exclude_columns = df.columns.str.contains('potato_[a-z]')
df = df.loc[:,~exclude_columns]

    date_time         carrot    potato
0   2018-06-01 00:00:00 NaN     0.0
1   2018-06-01 00:30:00 NaN     13.0
2   2018-06-01 01:00:00 NaN     21.0
3   2018-06-01 01:30:00 14.0    27.0

answered Feb 26, 2020 at 0:36

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

1 Comment

Antoine T Over a year ago

I have no idea what will be the columns name before importing them. To be more precise, every new dataframe contains about 100 columns, from which 90%-95% are common with the other dataframes. I edited my question to add these information.

Collectives™ on Stack Overflow

Avoid duplicate columns while merging with pandas

2 Answers 2

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related