1

I have a data frame (dfCust) like so:

|cust_key|first_name|last_name|address        |
-----------------------------------------------
|12345   |John      |Doe      |123 Some street|
|12345   |John      |Doe      |123 Some st    |
|67890   |Jane      |Doe      |456 Some street|

and I would like to basically remove duplicate records such that the cust_key field is unique. I do not care about the record that is dropped, at the point that this happens, the addresses have already been deduplicated so the only ones that trickle through are spelling errors. I would like the following resulting dataframe:

|cust_key|first_name|last_name|address        |
-----------------------------------------------
|12345   |John      |Doe      |123 Some street|
|67890   |Jane      |Doe      |456 Some street|

in R this would basically be done like this:

dfCust <- unique(setDT(dfCust), by = "cust_key")

but I need a way to do this in pandas.

3
  • 1
    df.drop_duplicates('cust_key') for dropping duplicates based on a single col: cust_key Commented Jan 8, 2020 at 16:51
  • perfect, thank you. I knew it was something small I was missing. If you put this into an answer I'll upvote and accept! Commented Jan 8, 2020 at 16:52
  • That's okay, its a dupe: check this: stackoverflow.com/questions/50885093/… Commented Jan 8, 2020 at 16:54

1 Answer 1

4
df.drop_duplicates(subset='cust_key')
Sign up to request clarification or add additional context in comments.

1 Comment

if the Dataframes are separate then it needs to be concatinated

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.