how to perform drop_duplicates with multiple condition in a pandas dataframe

Question

I have a df,

    Sr.No   Name    Class   Data
0   1       Sri     1       sri is a good player
1   ''      Sri     2       sri is good in cricket
2   ''      Sri     3       sri went out
3   2       Ram     1       Ram is a good player
4   ''      Ram     2       sri is good in cricket
5   ''      Ram     3       Ram went out
6   3       Sri     1       sri is a good player
7   ''      Sri     2       sri is good in cricket
8   ''      Sri     3       sri went out
9   4       Sri     1       sri is a good player
10  ''      Sri     2       sri is good in cricket
11  ''      Sri     3       sri went out
12  ''      Sri     4       sri came back

I am trying to drop duplicates based on ["Name","Class","Data"]. The goal is to drop duplicates based on all sentences per Sr No.

My expected output is,

out_df


    Sr.No   Name    Class   Data
0   1       Sri     1       sri is a good player
1           Sri     2       sri is good in cricket
2           Sri     3       sri went out
3   2       Ram     1       Ram is a good player
4           Ram     2       sri is good in cricket
5           Ram     3       Ram went out
9   4       Sri     1       sri is a good player
10          Sri     2       sri is good in cricket
11          Sri     3       sri went out
12          Sri     4       sri came back

Can you please print df.to_dict() and paste the output in your question? Your dataframes are so difficult to copy. — cs95
– cs95, Commented Jan 19, 2018 at 6:26
Your to_dict output is different from what your posted dataframe is. Please do make it consistent so your expected output is clear ;) — cs95
– cs95, Commented Jan 19, 2018 at 6:38
@cᴏʟᴅsᴘᴇᴇᴅ , I edited my question with the proper df.to_dict() pls check — Pyd
– Pyd, Commented Jan 19, 2018 at 6:46
I dont get you, when I do pd.DataFrame(my_dict) it gives my actual df properly. — Pyd
– Pyd, Commented Jan 19, 2018 at 6:55

cs95 · Accepted Answer · 2018-01-19 07:36:03Z

2

Create a dummy column with a groupby + transform operation.

v = df.groupby(df['Class'].diff().le(0).cumsum())['Data'].transform(' '.join)

Or,

v = df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)

This dummy column becomes a factor when deciding what rows are to be dropped.

m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])    
df[~m]

    Class                    Data Name Sr.No
0       1   sri is  a good player  Sri     1
1       2  sri is good in cricket  Sri      
2       3            sri went out  Sri      
3       1    Ram is a good player  Ram     2
4       2  sri is good in cricket  Ram      
5       3            Ram went out  Ram      
9       1   sri is  a good player  Sri     4
10      2  sri is good in cricket  Sri      
11      3            sri went out  Sri      
12      4           sri came back  Sri

Details

Form groups from the monotonically increasing Class values -

i = df['Class'].diff().le(0).cumsum()
i

0     0
1     0
2     0
3     1
4     1
5     1
6     2
7     2
8     2
9     3
10    3
11    3
12    3
Name: Class, dtype: int64

Use this to group, and transform Data with a str.join operation -

v = df.groupby(i)['Data'].transform(' '.join)

Which is simply a column of joined strings. Finally, assign the dummy column and call duplicated -

m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"]) 
m

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9     False
10    False
11    False
12    False
dtype: bool

edited Jan 19, 2018 at 7:36

answered Jan 19, 2018 at 6:55

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Pyd Over a year ago

works very well, thank you coldspped and @jezrael . coldspeed may I know how many years you've been in data science/pandas ?

cs95 Over a year ago

@pyd You're welcome. As for your question, I've been working with pandas for around 5 and a half months.

Pyd Over a year ago

what is "Data" in this line df.groupby(df.Class.diff().le(0).cumsum()).Data.transform(' '.join) my column or a keyword. ]

cs95 Over a year ago

@pyd A column. I've edited for clarity. Also, you unmarked. Did it not work?

cs95 Over a year ago

@pyd Unless I'm missing something, you can also do this with: df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)

|

Collectives™ on Stack Overflow

how to perform drop_duplicates with multiple condition in a pandas dataframe

1 Answer 1

18 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

18 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related