I have two DataFrames in pandas: customers and flights. Both contain duplicates on the join key (Loyalty#). I am not sure whether the correct workflow is to remove duplicates before the merge or merge first and then dedupe.
Example expert from flights:
Loyalty# Year Month YearMonthDate NumFlights ...
101902 2019 1 2019-01-01 0.0 ...
101902 2019 1 2019-01-01 0.0 ... <- duplicate
101902 2019 2 2019-02-01 0.0 ...
101902 2019 2 2019-02-01 0.0 ... <- duplicate
Example expert from customers:
Loyalty# FirstName LastName City LoyaltyStatus ...
101902 Hans Schlottmann London Aurora ...
101902 Yi Nesti Toronto Aurora ... <- duplicate
106001 Maudie Hyland Fredericton Star ...
106001 Ivette Peifer Montreal Star ... <- duplicate
I will aggregate and do feature engineering on the flights DataFrame (groupby on Loyalty#) before merging. After that, I will merge with customers:
df = flights_agg.merge(customers, on="Loyalty#", how="inner")
Given that both DataFrames have duplicate Loyalty#, and that I will aggregate the flights table first, is it better practice to:
drop duplicates in customers before the merge, or
merge first and deduplicate afterward?
In which situations does the order matter for correctness?