Random sampling from a dataframe

Question

I want to generate 2x6 dataframe which represents a Rack.Half of this dataframe are filled with storage items and the other half is with retrieval items. I want to do is random chosing half of these 12 items and say that they are storage and others are retrieval. How can I randomly choose?

I tried random.sample but this chooses random columns.Actually I want to choose random items individually.

Providing text instead of images helps to get faster recommendations from the community — RF1991
– RF1991, Commented Mar 29, 2022 at 19:46

mozway · Accepted Answer · 2022-03-29 20:16:59Z

1

Assuming this input:

   0  1  2  3   4   5
0  0  1  2  3   4   5
1  6  7  8  9  10  11

You can craft a random numpy array to select/mask half of the values:

a = np.repeat([True,False], df.size//2)
np.random.shuffle(a)
a = a.reshape(df.shape)

Then select your two groups:

df.mask(a)
     0   1    2    3   4     5
0  NaN NaN  NaN  3.0   4   NaN
1  6.0 NaN  8.0  NaN  10  11.0

df.where(a)
     0  1    2    3   4    5
0  0.0  1  2.0  NaN NaN  5.0
1  NaN  7  NaN  9.0 NaN  NaN

If you simply want 6 random elements, use nummy.random.choice:

np.random.choice(df.to_numpy(). ravel(), 6, replace=False)

Example:

array([ 4,  5, 11,  7,  8,  3])

answered Mar 29, 2022 at 20:16

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

GTek Over a year ago

df.mask and df.where are the exact solutions I've been looking for, thanks a lot. But for the dataframe operations how can I get rid of NaN values?

mozway Over a year ago

@GTek what would you want as output? You can pick any fill value in mask/where, for example df.mask(a, -999)

GTek Over a year ago

I want to calculate index and column distances for each items in df.mask(a) and df.where(a) dataframes.For example 4 is in (0,4) in df.mask(a) and 1 is in (0,1) in df.where(a). index distance= 0-0=0 and column distance=4-1=3. I want to calculate this type of operations for each pair

mozway Over a year ago

Ok, then you can stack, this will get rid of the NaNs and you'll get the row/col as MultiIndex

GTek Over a year ago

When I stack, then there is a warning like that Series' object has no attribute 'DataFrame' when I try an operation like: for r in df.mask(a).index: for c in df.mask(a).columns : g = dfmask.at[r,c] #dict3 = { i : [r,c] } for r2 in df.where(a).index: for c2 in df.where(a).columns: t = dfwhere.at[r2,c2] #dict2 = { j : [int(r2),int(c2)] } if r>=r2: VD.at[g,t]= int(r)-int(r2)

|

Collectives™ on Stack Overflow

Random sampling from a dataframe

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related