Drop duplicate rows based on a column value

Question

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.

code:

for i in recommender_train_df['user_id'].unique():
    recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)

problem with this code it's literally does nothing, I tried and tried and same result nothing happens.

quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.

Your requirements is weird, what's the difference with just dropping duplicates whilst keeping the last occurrence? — Ynjxsjmh
– Ynjxsjmh, Commented Apr 10, 2022 at 6:28
@Ynjxsjmh there might be multiple users who has the same date dropping those users would be a problem, so I wanna drop based on user_id not based on the whole data frame. let's say user 1 and user 2 had a date of 2018-04-03 dropping duplicates normally would drop either user 1 or user 2 depending on who last occurred, so going into user_id separately would only drop duplicate dates according to those users. hope this clarifies my idea. — Al-Meqdad Jabi
– Al-Meqdad Jabi, Commented Apr 10, 2022 at 6:37

Ynjxsjmh · Accepted Answer · 2022-04-10 07:39:13Z

1

The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.

If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:

for i in recommender_train_df['user_id'].unique():
    mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
    indices = mask[mask.tolist()].index
    recommender_train_df.drop(indices, inplace=True)

answered Apr 10, 2022 at 7:39

Ynjxsjmh

30.3k7 gold badges43 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Al-Meqdad Jabi Over a year ago

I applied the solution but my data is quite big, 5 million rows, it's been 30 minutes and it's still running, once it finishes I will let you know, thanks for the help.

Al-Meqdad Jabi Over a year ago

considering I have a 100k different user_id this method would at least 11 hours to complete, considering each run costs 0.4 seconds, I need a faster method.

Ynjxsjmh Over a year ago

@Al-MeqdadJabi Have no idea.

Collectives™ on Stack Overflow

Drop duplicate rows based on a column value

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related