0

For a given dataframe column, I would like to randomly select roughly 60% and add to a new column, add the remaining 40% to another column, multiply the 40% column by (-1), and create a new column that merges these back together like so:

dict0 = {'x1': [1,2,3,4,5,6]}
data = pd.DataFrame(dict0)### 

dict1 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',2,'nan',4,'nan','nan']}
data = pd.DataFrame(dict1)### 


dict2 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-4,'nan','nan']}
data = pd.DataFrame(dict2)### 

dict3 = {'x1': [1,2,3,4,5,6],'x2': [1,'nan',3,'nan',5,6],'x3': ['nan',-2,'nan',-   4,'nan','nan'],,'x4': [1,-2,3,-4,5,6]}
data = pd.DataFrame(dict3)### 

2 Answers 2

2

If you don't need the intermediate columns:

mask = np.random.choice([1,-1], p=[0.6,0.4], size=len(data))

data['x4'] = data['x1']*mask

Of course the intermediate columns are easy as well:

data['x2'] = data['x1'].where(mask==1)

data['x3'] = data['x1'].mask(mask==1)
# or data['x3'] = data['x1'].where(mask==-1)
Sign up to request clarification or add additional context in comments.

4 Comments

Great answer (-:
Wow, great answer indeed. So much more elegant than what I had in mind (multiple nested loops etc.) Thank you so much!!!
I have added an additional specification, that the 60/40 ratio be maintained on a daily basis (after adding a date column). I created a new post here: stackoverflow.com/questions/61467220/…
This solution uses probabilities and therefore doesn't guarantee a 60/40 ratio, see my answer above for details.
1

While the first answer proposes an elegant solution, it stretches the stated requirement to select roughly 60% of the rows. The problem is that it doesn't guarantee a 60/40 distribution. Using probabilities, the selected samples could by chance easily be all 1 or all -1, in effect selecting all or no rows, not roughly 60%.

The chance of this to occur obviously decreases with larger dataframes, but it's never zero and is immediately visible when trying it with the provided example data.

If this is relevant to you, take a look at this code, which does guarantee a 60/40 ratio of rows.

indices = np.random.choice(len(data), size=int(0.4 * len(data)), replace=False)
data['x4'] = np.where(data.index.isin(indices), -1 * data['x1'], data['x1'])

Update: One answer to your follow-up question proposes df.sample. Indeed, it lets you express the above much more elegantly:

indices = data.sample(frac=0.4).index
data['x4'] = np.where(data.index.isin(indices), -data['x1'], data['x1'])

2 Comments

That is actually very important to me and I highly appreciate you pointing this out.
I have added related to your point an additional specification, that the 60/40 ratio be maintained on a daily basis (after adding a date column). I created a new post here: stackoverflow.com/questions/61467220/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.