Should I drop duplicates before merging two DataFrames or after the merge?

Question

I have two DataFrames in pandas: customers and flights. Both contain duplicates on the join key (Loyalty#). I am not sure whether the correct workflow is to remove duplicates before the merge or merge first and then dedupe.

Example expert from flights:

Loyalty#  Year  Month  YearMonthDate  NumFlights  ...
101902    2019     1   2019-01-01     0.0         ...
101902    2019     1   2019-01-01     0.0         ...   <- duplicate
101902    2019     2   2019-02-01     0.0         ...
101902    2019     2   2019-02-01     0.0         ...   <- duplicate

Example expert from customers:

Loyalty#  FirstName  LastName  City     LoyaltyStatus  ...
101902    Hans       Schlottmann  London   Aurora   ...
101902    Yi         Nesti        Toronto  Aurora   ...   <- duplicate
106001    Maudie     Hyland       Fredericton Star ...
106001    Ivette     Peifer       Montreal   Star   ...   <- duplicate

I will aggregate and do feature engineering on the flights DataFrame (groupby on Loyalty#) before merging. After that, I will merge with customers:

df = flights_agg.merge(customers, on="Loyalty#", how="inner")

Given that both DataFrames have duplicate Loyalty#, and that I will aggregate the flights table first, is it better practice to:

drop duplicates in customers before the merge, or

merge first and deduplicate afterward?

In which situations does the order matter for correctness?

assuming you're removing duplicates on the same criteria, then the order doesn't matter for correctness. but for compute and clarity, you should definitely deduplicate before merging. if you have two data sets each with 1M rows and they all have the same join key, if you don't deduplicate before merging, you'll get 1T rows — Derek O
– Derek O, Commented Oct 28 at 14:03
@JonSG I am assuming a human error, idk why tbh this is the dataset I have. — Teexlol
– Teexlol, Commented Oct 28 at 19:48

Dmitry543 · Accepted Answer · 2025-10-28 12:27:57Z

2

# flights -> one row per Loyalty#
flights_agg = (flights.groupby('Loyalty#', as_index=False)
                      .agg(NumFlights=('NumFlights', 'sum')))

# customers -> one row per Loyalty# (choose an explicit rule; here: keep last)
customers_1 = customers.drop_duplicates('Loyalty#', keep='last')

# strict one-to-one merge; raises if either side still has dup keys
df = flights_agg.merge(customers_1, on='Loyalty#', how='inner', validate='one_to_one')

groupby removes repeated flight rows so metrics aren’t double-counted. drop_duplicates fixes the customers grain so the join isn’t many-to-many. validate='one_to_one' guards correctness by throwing a MergeError if duplicate keys slip through. If you track recency, replace the dedupe with customers.sort_values(['Loyalty#','updated_at']).drop_duplicates('Loyalty#', keep='last') to keep the latest record per key.

answered Oct 28 at 12:27

Dmitry543

1,3272 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Teexlol Oct 28 at 19:51

That's clear enough. Thank you🙏

Collectives™ on Stack Overflow

Should I drop duplicates before merging two DataFrames or after the merge?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related