Randomly select unique row from dataframe in Pandas

Question

Say I have a dataframe of the form where rn is the row index

       A1  |  A2 |  A3 
      -----------------
r1     x   |  0  |  t
r2     y   |  1  |  u
r3     z   |  1  |  v
r4     x   |  2  |  w
r5     z   |  2  |  v
r6     x   |  2  |  w

If I wanted to subset this dataframe such that the column A2 has only unique values, I'd use df.drop_duplicates('A2'). However, that'd keep only the first row of the unique value and delete the rest. For this example, only r2 and r4 will be in the subset.

What I want is that any of the rows with duplicate values are selected randomly rather than the first row. So for this example, for A2 == 1, r2 or r3 is selected randomly or for A2 == 2 any of r4, r5 or r6 is selected randomly. How would I go about implementing this?

score 9 · Accepted Answer · 2017-11-13 19:29:23Z

9

Shuffle the DataFrame first and then drop the duplicates:

df.sample(frac=1).drop_duplicates(subset='A2')

If the order of the rows is important you can use sort_index as @cᴏʟᴅsᴘᴇᴇᴅ suggested:

df.sample(frac=1).drop_duplicates(subset='A2').sort_index()

edited Nov 13, 2017 at 19:29

answered Nov 13, 2017 at 19:25

user2285236

Sign up to request clarification or add additional context in comments.

3 Comments

cs95 Over a year ago

You'd typically want to sort the index after if you want to retain order.

user2285236 Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Sure. Let me add that.

HMK Over a year ago

Wow that was embarrassingly simple. Thanks.

Collectives™ on Stack Overflow

Randomly select unique row from dataframe in Pandas

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related