Pandas Merge - How to avoid duplicating columns

Question

I am attempting a merge between two data frames. Each data frame has two index levels (date, cusip). In the columns, some columns match between the two (currency, adj date) for example.

What is the best way to merge these by index, but to not take two copies of currency and adj date.

Each data frame is 90 columns, so I am trying to avoid writing everything out by hand.

df:                 currency  adj_date   data_col1 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

df2:                currency  adj_date   data_col2 ...
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45
...

If I do:

dfNew = merge(df, df2, left_index=True, right_index=True, how='outer')

I get

dfNew:              currency_x  adj_date_x   data_col2 ... currency_y adj_date_y
date        cusip
2012-01-01  XSDP      USD      2012-01-03   0.45             USD         2012-01-03

Thank you! ...

Ninjakannon · Accepted Answer · 2019-12-02 14:24:51Z

239

You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.

cols_to_use = df2.columns.difference(df.columns)

Then perform the merge (note this is an index object but it has a handy tolist() method).

dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')

This will avoid any columns clashing in the merge.

edited Dec 2, 2019 at 14:24

Ninjakannon

3,8297 gold badges55 silver badges82 bronze badges

answered Oct 1, 2013 at 20:43

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jimmy Over a year ago

What if the key is a column and it's called the same? It would be dropped with the first step.

DaReal Over a year ago

Due to the constrain in the above comment, I went with the answer of @rprog below

golmschenk Over a year ago

If the key is a column, to use this answer, convert the columns to use to a list (cols_to_use = cols_to_use.tolist()) and append name of your key column to this list (cols_to_use.append('key_column_name')). You also need to change the merge from using left_index and right_index to using on='key_column_name'.

Kyle F. Hartzenberg · Accepted Answer · 2023-06-01 06:14:34Z

199

I use the suffixes option in .merge() followed by drop():

dfNew = df.merge(df2, left_index=True, right_index=True,
                 how='outer', suffixes=('', '_y'))

dfNew.drop(dfNew.filter(regex='_y$').columns, axis=1, inplace=True)

Thanks @ijoseph

edited Jun 1, 2023 at 6:14

Kyle F. Hartzenberg

4,1063 gold badges18 silver badges42 bronze badges

answered Jun 26, 2016 at 0:13

rprog

2,1401 gold badge13 silver badges9 bronze badges

4 Comments

ijoseph Over a year ago

Would be a more helpful answer if it included the code for filtering (which is fairly straightforward, yet still time-consuming to look up/ error-prone to remember). i.e. dfNew.drop(list(dfNew.filter(regex='_y$')), axis=1, inplace=True)

MLLDantas Over a year ago

I find this solution better because I can still merge using one of the columns as a reference instead of the index. Then, I can only remove the duplicates. Thank you!

Johan Over a year ago

Best solution. You may consider including elaborate regex as suggested below

axolotl Over a year ago

I'm going to be that person flagging a case when an original column name ends in _y so that it would be dropped as well:)

Chrisji · Accepted Answer · 2020-11-20 09:00:13Z

28

Building on @rprog's answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:

dfNew = df.merge(df2, left_index=True, right_index=True,
             how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')

Or using df.join:

dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")

The regex here is keeping anything that does not end with the word "DROP", so just make sure to use a suffix that doesn't appear among the columns already.

edited Nov 20, 2020 at 9:00

Chrisji

3112 silver badges13 bronze badges

answered Jun 25, 2020 at 6:53

Elliott Collins

7706 silver badges9 bronze badges

Comments

JulienD · Accepted Answer · 2017-11-22 15:56:23Z

I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow

sales.csv

    city;state;units
    Mendocino;CA;1
    Denver;CO;4
    Austin;TX;2

revenue.csv

    branch_id;city;revenue;state_id
    10;Austin;100;TX
    20;Austin;83;TX
    30;Austin;4;TX
    47;Austin;200;TX
    20;Denver;83;CO
    30;Springfield;4;I

merge.py import pandas

def drop_y(df):
    # list comprehension of the cols that end with '_y'
    to_drop = [x for x in df if x.endswith('_y')]
    df.drop(to_drop, axis=1, inplace=True)


sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')

result = pandas.merge(sales, revenue,  how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')

When executing the merge command I replace the _x suffix with an empty string and them I can remove columns ending with _y

output.csv

    id;city;state;units;branch_id;revenue;state_id
    0;Denver;CO;4;20;83;CO
    1;Austin;TX;2;10;100;TX
    2;Austin;TX;2;20;83;TX
    3;Austin;TX;2;30;4;TX
    4;Austin;TX;2;47;200;TX

sophocles · Accepted Answer · 2020-11-20 15:55:50Z

3

This is a bit of going around the problem, but I have written a function that basically deals with the extra columns:

def merge_fix_cols(df_company,df_product,uniqueID):
    
    df_merged = pd.merge(df_company,
                         df_product,
                         how='left',left_on=uniqueID,right_on=uniqueID)    
    for col in df_merged:
        if col.endswith('_x'):
            df_merged.rename(columns = lambda col:col.rstrip('_x'),inplace=True)
        elif col.endswith('_y'):
            to_drop = [col for col in df_merged if col.endswith('_y')]
            df_merged.drop(to_drop,axis=1,inplace=True)
        else:
            pass
    return df_merged

Seems to work well with my merges!

answered Nov 20, 2020 at 15:55

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

Comments

william_grisaitis · Accepted Answer · 2023-03-28 20:34:11Z

3

If the indexes are the same (big if true!) you can do:

df = df1.copy()
df[df2.columns] = df2

this similar to merge

pd.merge(df1, df2, index_left=True, index_right=True)

but with no duplicate columns

answered Mar 28, 2023 at 20:34

william_grisaitis

6,1304 gold badges46 silver badges57 bronze badges

1 Comment

Irene Over a year ago

As per 2023, this should be the accepted answer, because it is concise and expressive.

Tom Fink · Accepted Answer · 2022-04-23 21:17:47Z

2

You can remove the duplicate y columns you don't want after the join:

# Join df and df2
dfNew = merge(df, df2, left_index=True, right_index=True, how='inner')

# Remove the y columns by selecting the columns you want to keep
dfNew = dfNew.loc[:, ("currency_x", "adj_date_x", "data_col1", "data_col2")]

Output: currency_x | adj_date_x | data_col1 | data_col2

answered Apr 23, 2022 at 21:17

Tom Fink

1,38714 silver badges26 bronze badges

Comments

sophocles · Accepted Answer · 2021-02-03 18:28:37Z

0

can't you just subset the columns in either df first?

[i for i in df.columns if i not in df2.columns]

dfNew = merge(df **[i for i in df.columns if i not in df2.columns]**, df2, left_index=True, right_index=True, how='outer')

edited Feb 3, 2021 at 18:28

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

answered Jan 8, 2021 at 16:20

user6046760

5241 gold badge4 silver badges7 bronze badges

Comments

Abimael Domínguez · Accepted Answer · 2021-05-28 18:09:19Z

0

When the amount of columns you want to avoid is lower than the columns you want to keep... you could use this kind of filtering:

df.loc[:, ~df.columns.isin(['currency', 'adj_date'])]

This will filter all columns in the dataframe except the 'currency' and 'adj_date' columns, you have to write the merge something like this:

    dfNew = merge(df, 
                  df2.loc[:, ~df.columns.isin(['currency', 'adj_date'])], 
                  left_index=True,
                  right_index=True,
                  how='outer')

Note the "~", it means "not".

answered May 28, 2021 at 18:09

Abimael Domínguez

5275 silver badges9 bronze badges

Comments

Till Hoffmann · Accepted Answer · 2021-07-28 14:09:00Z

You can include duplicate columns in the key to merge on to ensure only a single copy appears in the result.

# Generate some dummy data.
shared = pd.DataFrame({'key': range(5), 'name': list('abcde')})
a = shared.copy()
a['value_a'] = np.random.normal(0, 1, 5)
b = shared.copy()
b['value_b'] = np.random.normal(0, 1, 5)

# Standard merge.
merged = pd.merge(a, b, on='key')
print(merged.columns)  # Index(['key', 'name_x', 'value_a', 'name_y', 'value_b'], dtype='object')

# Merge with both keys.
merged = pd.merge(a, b, on=['key', 'name'])
print(merged.columns)  # Index(['key', 'name', 'value_a', 'value_b'], dtype='object')

This method also ensures that values in columns that appear in both data frames are consistent (e.g. that the currency in both columns is the same). If they are not, the corresponding row will be dropped (if how = 'inner') or occur with missing values (if how = 'outer').

Wim Yedema · Accepted Answer · 2022-06-10 11:58:51Z

0

If you're merging on arbitrary columns and don't want to keep the right key this will do the trick:

mrg = pd.merge(a, b, how="left", left_on="A_KEY", right_on="B_KEY")
mrg.drop(columns=b.columns.difference(cols_to_use))

answered Jun 10, 2022 at 11:58

Wim Yedema

413 bronze badges

Collectives™ on Stack Overflow

Pandas Merge - How to avoid duplicating columns

11 Answers 11

3 Comments

4 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

3 Comments

4 Comments

Comments

Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related