2

I have a df,

    Sr.No   Name    Class   Data
0   1       Sri     1       sri is a good player
1   ''      Sri     2       sri is good in cricket
2   ''      Sri     3       sri went out
3   2       Ram     1       Ram is a good player
4   ''      Ram     2       sri is good in cricket
5   ''      Ram     3       Ram went out
6   3       Sri     1       sri is a good player
7   ''      Sri     2       sri is good in cricket
8   ''      Sri     3       sri went out
9   4       Sri     1       sri is a good player
10  ''      Sri     2       sri is good in cricket
11  ''      Sri     3       sri went out
12  ''      Sri     4       sri came back

I am trying to drop duplicates based on ["Name","Class","Data"]. The goal is to drop duplicates based on all sentences per Sr No.

My expected output is,

out_df


    Sr.No   Name    Class   Data
0   1       Sri     1       sri is a good player
1           Sri     2       sri is good in cricket
2           Sri     3       sri went out
3   2       Ram     1       Ram is a good player
4           Ram     2       sri is good in cricket
5           Ram     3       Ram went out
9   4       Sri     1       sri is a good player
10          Sri     2       sri is good in cricket
11          Sri     3       sri went out
12          Sri     4       sri came back
5
  • Can you please print df.to_dict() and paste the output in your question? Your dataframes are so difficult to copy. Commented Jan 19, 2018 at 6:26
  • Your to_dict output is different from what your posted dataframe is. Please do make it consistent so your expected output is clear ;) Commented Jan 19, 2018 at 6:38
  • @cᴏʟᴅsᴘᴇᴇᴅ , I edited my question with the proper df.to_dict() pls check Commented Jan 19, 2018 at 6:46
  • I dont get you, when I do pd.DataFrame(my_dict) it gives my actual df properly. Commented Jan 19, 2018 at 6:55
  • Nevermind, I misunderstood the question initially. Commented Jan 19, 2018 at 6:55

1 Answer 1

2

Create a dummy column with a groupby + transform operation.

v = df.groupby(df['Class'].diff().le(0).cumsum())['Data'].transform(' '.join)

Or,

v = df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join) 

This dummy column becomes a factor when deciding what rows are to be dropped.

m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])    
df[~m]

    Class                    Data Name Sr.No
0       1   sri is  a good player  Sri     1
1       2  sri is good in cricket  Sri      
2       3            sri went out  Sri      
3       1    Ram is a good player  Ram     2
4       2  sri is good in cricket  Ram      
5       3            Ram went out  Ram      
9       1   sri is  a good player  Sri     4
10      2  sri is good in cricket  Sri      
11      3            sri went out  Sri      
12      4           sri came back  Sri      

Details

Form groups from the monotonically increasing Class values -

i = df['Class'].diff().le(0).cumsum()
i

0     0
1     0
2     0
3     1
4     1
5     1
6     2
7     2
8     2
9     3
10    3
11    3
12    3
Name: Class, dtype: int64

Use this to group, and transform Data with a str.join operation -

v = df.groupby(i)['Data'].transform(' '.join)

Which is simply a column of joined strings. Finally, assign the dummy column and call duplicated -

m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"]) 
m

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9     False
10    False
11    False
12    False
dtype: bool
Sign up to request clarification or add additional context in comments.

18 Comments

works very well, thank you coldspped and @jezrael . coldspeed may I know how many years you've been in data science/pandas ?
@pyd You're welcome. As for your question, I've been working with pandas for around 5 and a half months.
what is "Data" in this line df.groupby(df.Class.diff().le(0).cumsum()).Data.transform(' '.join) my column or a keyword. ]
@pyd A column. I've edited for clarity. Also, you unmarked. Did it not work?
@pyd Unless I'm missing something, you can also do this with: df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.