Okay this is somewhat tricky. I have a DataFrame of people and I want to randomly select 27% of them. I want to create a new Boolean column in that DataFrame that shows if that person was randomly selected.
Anyone have any idea how to do this?
The in-built sample function provides a frac argument to give the fraction contained in the sample.
If your DataFrame of people is people_df:
percent_sampled = 27
sample_df = people_df.sample(frac = percent_sampled/100)
people_df['is_selected'] = people_df.index.isin(sample_df.index)
pandas incessantly throws, for very little good reason. Please read the discussion here: stackoverflow.com/questions/20625582/… I would recommend simply suppressing the warning by using pd.options.mode.chained_assignment = None after you import pandas.Defining a dataframe with 100 random numbers in column 0:
import random
import pandas as pd
import numpy as np
a = pd.DataFrame(range(100))
random.shuffle(a[0])
Using random.sample to choose 27 random numbers from the list, WITHOUT repetition: (replace 27 with 0.27*int(len(a[0]) if you want to define this as percentage)
choices = random.sample(list(a[0]),27)
Using np.where to assign boolean values to new column in dataframe:
a['Bool'] = np.where(a[0].isin(choices),True,False)
df.sample(frac=0.27)ordf['selected'] = np.random.choice([0,1], size=len(df), p=[0.73,0.27])?