5

I am try to shuffle each column in a pandas data frame separately. Here the functions I wrote:

def shuffle_x(x):
    x = x.copy()
    np.random.shuffle(x)

    return x


def shuffle_table(df):
    df_shuffled = df.apply(shuffle_x, raw = True, axis = 0)
    return df_shuffled

Now, I am testing on a pandas dataframe df with 30000 rows and 1000 columns, if I directly do shuffle_table(df), this is really slow, takes more than 1500 seconds. However, if I do something like this:

df_split = np.split(df, 100, axis = 1)
df_shuffled = pd.concat([shuffle_table(x) for x in df_split], axis = 1)

This is much faster and only takes 60 seconds

My best guest is that this is an issue related to the way that pandas allocate space for a generating new dataframe.

Besides, the fastest way that I can come up with is:

tmp_d = {}
for col in df.columns:
    tmp_val = df[col].values
    np.random.shuffle(tmp_val)
    tmp_d[col] = tmp_val

df_shuffled = pd.DataFrame(tmp_d)
df_shuffled = df_shuffled[df.columns]

This takes approximately 15 secs

1
  • That is why we need chunks ... also , when you draw a line with the timing , you will see the time is correlated with chunks size , you can find the optimize chunks size to speed up the whole procedure . Commented Aug 6, 2018 at 19:30

1 Answer 1

8

It's faster because it's not doing the same thing.

To fully shuffle a sequence ensuring complete randomization requires at least O(n) time. So the bigger your DataFrame the longer it will take to shuffle.

Your second example is not equivalent, because it's not fully random. It only shuffles individual chunks. If there is a column like [1, 2, 3, ..., 29999, 30000], your second method will never, for instance, generate a result like [1, 30000, 2, 29999, ...], because it will never shuffle together the beginning of the sequence with the end. There are many possible shuffles that can't be achieved with the chunk-based shuffling.

In theory if you split your DataFrame into 100 equal-sized chunks, you would expect each one to shuffle 100 times faster than the whole. Based on your timings it looks like it's actually taking longer than this for the sub-shuffles, which I would guess is at least partly due to the overhead of creating the subtables in the first place.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi BrenBarn, I made some mistake in my original post, just updated. Those two approach should do the same thing and the time difference still stands

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.