3

I have two DataFrames in pandas: customers and flights. Both contain duplicates on the join key (Loyalty#). I am not sure whether the correct workflow is to remove duplicates before the merge or merge first and then dedupe.

Example expert from flights:

Loyalty#  Year  Month  YearMonthDate  NumFlights  ...
101902    2019     1   2019-01-01     0.0         ...
101902    2019     1   2019-01-01     0.0         ...   <- duplicate
101902    2019     2   2019-02-01     0.0         ...
101902    2019     2   2019-02-01     0.0         ...   <- duplicate

Example expert from customers:

Loyalty#  FirstName  LastName  City     LoyaltyStatus  ...
101902    Hans       Schlottmann  London   Aurora   ...
101902    Yi         Nesti        Toronto  Aurora   ...   <- duplicate
106001    Maudie     Hyland       Fredericton Star ...
106001    Ivette     Peifer       Montreal   Star   ...   <- duplicate

I will aggregate and do feature engineering on the flights DataFrame (groupby on Loyalty#) before merging. After that, I will merge with customers:

df = flights_agg.merge(customers, on="Loyalty#", how="inner")

Given that both DataFrames have duplicate Loyalty#, and that I will aggregate the flights table first, is it better practice to:

drop duplicates in customers before the merge, or

merge first and deduplicate afterward?

In which situations does the order matter for correctness?

3
  • 1
    Why do two different customers share a loyalty number? Commented Oct 28 at 13:19
  • 1
    assuming you're removing duplicates on the same criteria, then the order doesn't matter for correctness. but for compute and clarity, you should definitely deduplicate before merging. if you have two data sets each with 1M rows and they all have the same join key, if you don't deduplicate before merging, you'll get 1T rows Commented Oct 28 at 14:03
  • @JonSG I am assuming a human error, idk why tbh this is the dataset I have. Commented Oct 28 at 19:48

1 Answer 1

2
# flights -> one row per Loyalty#
flights_agg = (flights.groupby('Loyalty#', as_index=False)
                      .agg(NumFlights=('NumFlights', 'sum')))

# customers -> one row per Loyalty# (choose an explicit rule; here: keep last)
customers_1 = customers.drop_duplicates('Loyalty#', keep='last')

# strict one-to-one merge; raises if either side still has dup keys
df = flights_agg.merge(customers_1, on='Loyalty#', how='inner', validate='one_to_one')

groupby removes repeated flight rows so metrics aren’t double-counted. drop_duplicates fixes the customers grain so the join isn’t many-to-many. validate='one_to_one' guards correctness by throwing a MergeError if duplicate keys slip through. If you track recency, replace the dedupe with customers.sort_values(['Loyalty#','updated_at']).drop_duplicates('Loyalty#', keep='last') to keep the latest record per key.

Sign up to request clarification or add additional context in comments.

1 Comment

That's clear enough. Thank you🙏

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.