1

I'm trying to write a small code to drop duplicate row based on column unique values, what I'm trying to accomplish is getting all the unique values from user_id and drop according to those unique values using drop_duplicates whilst keeping the last occurrence. keeping in mind the column that I want to drop duplicates from which is date_time.

code:

for i in recommender_train_df['user_id'].unique():
    recommender_train_df.loc[recommender_train_df['user_id'] == i].drop_duplicates(subset='date_time', keep="last", inplace=True)

problem with this code it's literally does nothing, I tried and tried and same result nothing happens.

quick note: I have 100k different user_id (unique) so I need a solution that would work as fast as possible for this problem.

2
  • Your requirements is weird, what's the difference with just dropping duplicates whilst keeping the last occurrence? Commented Apr 10, 2022 at 6:28
  • @Ynjxsjmh there might be multiple users who has the same date dropping those users would be a problem, so I wanna drop based on user_id not based on the whole data frame. let's say user 1 and user 2 had a date of 2018-04-03 dropping duplicates normally would drop either user 1 or user 2 depending on who last occurred, so going into user_id separately would only drop duplicate dates according to those users. hope this clarifies my idea. Commented Apr 10, 2022 at 6:37

1 Answer 1

1

The problem is that when you use df.loc, it is returning a copy of original dataframe, so your modification doesn't affect the original dataframe. See python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow for more detail.

If you want to drop duplicated on part of column, you can get the duplicated item index and drop based on these indices:

for i in recommender_train_df['user_id'].unique():
    mask = recommender_train_df.loc[recommender_train_df['user_id'] == 15].duplicated(subset='date_time', keep="last")
    indices = mask[mask.tolist()].index
    recommender_train_df.drop(indices, inplace=True)
Sign up to request clarification or add additional context in comments.

3 Comments

I applied the solution but my data is quite big, 5 million rows, it's been 30 minutes and it's still running, once it finishes I will let you know, thanks for the help.
considering I have a 100k different user_id this method would at least 11 hours to complete, considering each run costs 0.4 seconds, I need a faster method.
@Al-MeqdadJabi Have no idea.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.