7

Say I have a dataframe of the form where rn is the row index

       A1  |  A2 |  A3 
      -----------------
r1     x   |  0  |  t
r2     y   |  1  |  u
r3     z   |  1  |  v
r4     x   |  2  |  w
r5     z   |  2  |  v
r6     x   |  2  |  w

If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2'). However, that'd keep only the first row of the unique value and delete the rest. For this example, only r2 and r4 will be in the subset.

What I want is that any of the rows with duplicate values are selected randomly rather than the first row. So for this example, for A2 == 1, r2 or r3 is selected randomly or for A2 == 2 any of r4, r5 or r6 is selected randomly. How would I go about implementing this?

1 Answer 1

9

Shuffle the DataFrame first and then drop the duplicates:

df.sample(frac=1).drop_duplicates(subset='A2')

If the order of the rows is important you can use sort_index as @cᴏʟᴅsᴘᴇᴇᴅ suggested:

df.sample(frac=1).drop_duplicates(subset='A2').sort_index()
Sign up to request clarification or add additional context in comments.

3 Comments

You'd typically want to sort the index after if you want to retain order.
@cᴏʟᴅsᴘᴇᴇᴅ Sure. Let me add that.
Wow that was embarrassingly simple. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.