Randomly selecting rows from dataframe column

Question

For a given dataframe column, I would like to randomly select roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together like so:

dict0 = {'x1': [1,2,3,4,5,6]}
data = pd.DataFrame(dict0)### 

dict1 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
data = pd.DataFrame(dict1)### 


dict2 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
data = pd.DataFrame(dict2)### 

dict3 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-   4,'nan','nan'],,'x4': [1,-2,3,-4,5,6]}
data = pd.DataFrame(dict3)###

Quang Hoang · Accepted Answer · 2020-04-27 18:00:48Z

2

If you don't need the intermediate columns:

mask = np.random.choice([1,-1], p=[0.6,0.4], size=len(data))

data['x4'] = data['x1']*mask

Of course the intermediate columns are easy as well:

data['x2'] = data['x1'].where(mask==1)

data['x3'] = data['x1'].mask(mask==1)
# or data['x3'] = data['x1'].where(mask==-1)

answered Apr 27, 2020 at 18:00

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

piRSquared Over a year ago

Great answer (-:

Tartaglia Over a year ago

Wow, great answer indeed. So much more elegant than what I had in mind (multiple nested loops etc.) Thank you so much!!!

Tartaglia Over a year ago

I have added an additional specification, that the 60/40 ratio be maintained on a daily basis (after adding a date column). I created a new post here: stackoverflow.com/questions/61467220/…

amain Over a year ago

This solution uses probabilities and therefore doesn't guarantee a 60/40 ratio, see my answer above for details.

amain · Accepted Answer · 2020-04-30 10:32:36Z

1

While the first answer proposes an elegant solution, it stretches the stated requirement to select roughly 60% of the rows. The problem is that it doesn't guarantee a 60/40 distribution. Using probabilities, the selected samples could by chance easily be all 1 or all -1, in effect selecting all or no rows, not roughly 60%.

The chance of this to occur obviously decreases with larger dataframes, but it's never zero and is immediately visible when trying it with the provided example data.

If this is relevant to you, take a look at this code, which does guarantee a 60/40 ratio of rows.

indices = np.random.choice(len(data), size=int(0.4 * len(data)), replace=False)
data['x4'] = np.where(data.index.isin(indices), -1 * data['x1'], data['x1'])

Update: One answer to your follow-up question proposes df.sample. Indeed, it lets you express the above much more elegantly:

indices = data.sample(frac=0.4).index
data['x4'] = np.where(data.index.isin(indices), -data['x1'], data['x1'])

edited Apr 30, 2020 at 10:32

answered Apr 27, 2020 at 19:01

amain

1,68813 silver badges20 bronze badges

2 Comments

Tartaglia Over a year ago

That is actually very important to me and I highly appreciate you pointing this out.

Tartaglia Over a year ago

I have added related to your point an additional specification, that the 60/40 ratio be maintained on a daily basis (after adding a date column). I created a new post here: stackoverflow.com/questions/61467220/…

Collectives™ on Stack Overflow

Randomly selecting rows from dataframe column

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related