Why my func run faster after I split pandas DataFrame into chunks comparing to simply do apply()?

Question

I am try to shuffle each column in a pandas data frame separately. Here the functions I wrote:

def shuffle_x(x):
    x = x.copy()
    np.random.shuffle(x)

    return x


def shuffle_table(df):
    df_shuffled = df.apply(shuffle_x, raw = True, axis = 0)
    return df_shuffled

Now, I am testing on a pandas dataframe df with 30000 rows and 1000 columns, if I directly do shuffle_table(df), this is really slow, takes more than 1500 seconds. However, if I do something like this:

df_split = np.split(df, 100, axis = 1)
df_shuffled = pd.concat([shuffle_table(x) for x in df_split], axis = 1)

This is much faster and only takes 60 seconds

My best guest is that this is an issue related to the way that pandas allocate space for a generating new dataframe.

Besides, the fastest way that I can come up with is:

tmp_d = {}
for col in df.columns:
    tmp_val = df[col].values
    np.random.shuffle(tmp_val)
    tmp_d[col] = tmp_val

df_shuffled = pd.DataFrame(tmp_d)
df_shuffled = df_shuffled[df.columns]

This takes approximately 15 secs

That is why we need chunks ... also , when you draw a line with the timing , you will see the time is correlated with chunks size , you can find the optimize chunks size to speed up the whole procedure . — BENY
– BENY, Commented Aug 6, 2018 at 19:30

BrenBarn · Accepted Answer · 2018-08-06 19:32:22Z

8

It's faster because it's not doing the same thing.

To fully shuffle a sequence ensuring complete randomization requires at least O(n) time. So the bigger your DataFrame the longer it will take to shuffle.

Your second example is not equivalent, because it's not fully random. It only shuffles individual chunks. If there is a column like [1, 2, 3, ..., 29999, 30000], your second method will never, for instance, generate a result like [1, 30000, 2, 29999, ...], because it will never shuffle together the beginning of the sequence with the end. There are many possible shuffles that can't be achieved with the chunk-based shuffling.

In theory if you split your DataFrame into 100 equal-sized chunks, you would expect each one to shuffle 100 times faster than the whole. Based on your timings it looks like it's actually taking longer than this for the sub-shuffles, which I would guess is at least partly due to the overhead of creating the subtables in the first place.

answered Aug 6, 2018 at 19:32

BrenBarn

253k39 gold badges421 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Eric He Over a year ago

Hi BrenBarn, I made some mistake in my original post, just updated. Those two approach should do the same thing and the time difference still stands

Collectives™ on Stack Overflow

Why my func run faster after I split pandas DataFrame into chunks comparing to simply do apply()?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related