I am try to shuffle each column in a pandas data frame separately. Here the functions I wrote:
def shuffle_x(x):
x = x.copy()
np.random.shuffle(x)
return x
def shuffle_table(df):
df_shuffled = df.apply(shuffle_x, raw = True, axis = 0)
return df_shuffled
Now, I am testing on a pandas dataframe df with 30000 rows and 1000 columns, if I directly do shuffle_table(df), this is really slow, takes more than 1500 seconds. However, if I do something like this:
df_split = np.split(df, 100, axis = 1)
df_shuffled = pd.concat([shuffle_table(x) for x in df_split], axis = 1)
This is much faster and only takes 60 seconds
My best guest is that this is an issue related to the way that pandas allocate space for a generating new dataframe.
Besides, the fastest way that I can come up with is:
tmp_d = {}
for col in df.columns:
tmp_val = df[col].values
np.random.shuffle(tmp_val)
tmp_d[col] = tmp_val
df_shuffled = pd.DataFrame(tmp_d)
df_shuffled = df_shuffled[df.columns]
This takes approximately 15 secs