Numpy shuffle multidimensional array by row only, keep column order unchanged

Question

How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).

I am looking for the most efficient solution, because my matrix is very huge. Is it also possible to do this highly efficient on the original array (to save memory)?

Example:

import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)

What I expect now is original matrix:

[[ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.45174186  0.8782033 ]
 [ 0.75623083  0.71763107]
 [ 0.26809253  0.75144034]
 [ 0.23442518  0.39031414]]

Output shuffle the rows not cols e.g.:

[[ 0.45174186  0.8782033 ]
 [ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.75623083  0.71763107]
 [ 0.23442518  0.39031414]
 [ 0.26809253  0.75144034]]

Option 1: shuffled view onto an array. I guess that would mean a custom implementation. (almost) no impact on memory usage, Obv. some impact at runtime. It really depends on how you intend to use this matrix. — Dima Tisnek
– Dima Tisnek, Commented Feb 26, 2016 at 9:19
Option 2: shuffle array in place. np.random.shuffle(x), docs state that "this function only shuffles the array along the first index of a multi-dimensional array", which is good enough for you, right? Obv., some time taken at startup, but from that point, it's as fast as original matrix. — Dima Tisnek
– Dima Tisnek, Commented Feb 26, 2016 at 9:21
Compare to np.random.shuffle(x), shuffling index of nd-array and getting data from shuffled index is more efficient way to solve this problem. For more details comparision refer my answer bellow — John
– John, Commented May 1, 2017 at 8:20

Kasravnd · Accepted Answer · 2021-01-31 08:48:05Z

88

You can use numpy.random.shuffle().

This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

In [2]: import numpy as np                                                                                                                                                                                  

In [3]:                                                                                                                                                                                                     

In [3]: X = np.random.random((6, 2))                                                                                                                                                                        

In [4]: X                                                                                                                                                                                                   
Out[4]: 
array([[0.71935047, 0.25796155],
       [0.4621708 , 0.55140423],
       [0.22605866, 0.61581771],
       [0.47264172, 0.79307633],
       [0.22701656, 0.11927993],
       [0.20117207, 0.2754544 ]])

In [5]: np.random.shuffle(X)                                                                                                                                                                                

In [6]: X                                                                                                                                                                                                   
Out[6]: 
array([[0.71935047, 0.25796155],
       [0.47264172, 0.79307633],
       [0.4621708 , 0.55140423],
       [0.22701656, 0.11927993],
       [0.20117207, 0.2754544 ],
       [0.22605866, 0.61581771]])

For other functionalities you can also check out the following functions:

The function random.Generator.permuted is introduced in Numpy's 1.20.0 Release.

The new function differs from shuffle and permutation in that the subarrays indexed by an axis are permuted rather than the axis being treated as a separate 1-D array for every combination of the other indexes. For example, it is now possible to permute the rows or columns of a 2-D array.

edited Jan 31, 2021 at 8:48

answered Feb 26, 2016 at 8:33

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Georg Schölly Over a year ago

I wonder if this could be sped up by numpy, maybe taking advantage of concurrency.

Kasravnd Over a year ago

@GeorgSchölly I thinks this is the most available optimized approach in python. If you want to speed it up you need to make changes on algorithm.

Georg Schölly Over a year ago

I completely agree. I just realized that you are using np.random instead of the Python random module which also contains a shuffle function. I'm sorry for causing confusion.

robert Over a year ago

This shuffle is not always working, see my new answer here below. Why is it not always working?

MJimitater Over a year ago

This method returns a NoneType object - any solution for keeping the object a numpy array? EDIT: sorry all good: I had X = np.random.shuffle(X), which returns a NoneType object, but the key was just np.random.shuffle(X), since it is shuffled in place.

|

Community · Accepted Answer · 2017-05-23 10:31:19Z

30

You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Sample run -

In [23]: X
Out[23]: 
array([[ 0.60511059,  0.75001599],
       [ 0.30968339,  0.09162172],
       [ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.0957233 ,  0.96210485],
       [ 0.56843186,  0.36654023]])

In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);

In [25]: X
Out[25]: 
array([[ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.30968339,  0.09162172],
       [ 0.56843186,  0.36654023],
       [ 0.0957233 ,  0.96210485],
       [ 0.60511059,  0.75001599]])

Additional performance boost

Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

np.random.rand(X.shape[0]).argsort()

Speedup results -

In [32]: X = np.random.random((6000, 2000))

In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop

In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop

Thus, the shuffling solution could be modified to -

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Runtime tests -

These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

In [40]: X = np.random.random((6000, 2000))

In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop

In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop

In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop

So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Feb 26, 2016 at 8:37

Divakar

222k19 gold badges273 silver badges374 bronze badges

2 Comments

robert Over a year ago

This sounds nice. Can you add a timing information to your post, of your np.take v.s. standard shuffle? The np.shuffle on my system is faster (27.9ms) vs your take (62.9 ms), but as I read in your post, there is a memory advantage?

Divakar Over a year ago

@robert Just added, check it out!

John · Accepted Answer · 2021-07-02 10:52:12Z

After a bit of experiment (i) found the most memory and time-efficient way to shuffle data(row-wise)in an nD array. First, shuffle the index of an array then, use the shuffled index to get the data. e.g.

rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]

in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

def main():
    # shuffle data itself
    rand_num = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.random.shuffle(rand_num)
    print('Time for direct shuffle: {0}'.format((time.time() - start)))
    
    # Shuffle index and get data from shuffled index
    rand_num2 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    perm = np.arange(rand_num2.shape[0])
    np.random.shuffle(perm)
    rand_num2 = rand_num2[perm]
    print('Time for shuffling index: {0}'.format((time.time() - start)))
    
    # using np.take()
    rand_num3 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    print("Time taken by np.take, {0}".format((time.time() - start)))

Result for Time

Time for direct shuffle: 0.03345608711242676   # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676     # 67.2msec

Memory profiler Result

Line #    Mem usage    Increment   Line Contents
================================================
    39  117.422 MiB    0.000 MiB   @profile
    40                             def main():
    41                                 # shuffle data itself
    42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))
    43  208.977 MiB    0.000 MiB       start = time.time()
    44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)
    45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))
    46                             
    47                                 # Shuffle index and get data from shuffled index
    48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))
    49  300.531 MiB    0.000 MiB       start = time.time()
    50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])
    51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)
    52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]
    53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))
    54                             
    55                                 # using np.take()
    56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))
    57  392.094 MiB    0.000 MiB       start = time.time()
    58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))

i lost the code to produce memory_profiler output. But it can be very easily reproduced by following steps in the given link.
What I like about this answer is that if I have two matched arrays (which coincidentally I do) then I can shuffle both of them and ensure that data in corresponding positions still match. This is useful for randomising the order of my training set

Minions · Accepted Answer · 2019-11-04 19:11:39Z

8

I tried many solutions, and at the end I used this simple one:

from sklearn.utils import shuffle
x = np.array([[1, 2],
              [3, 4],
              [5, 6]])
print(shuffle(x, random_state=0))

output:

[
[5 6]  
[3 4]  
[1 2]
]

if you have 3d array, loop through the 1st axis (axis=0) and apply this function, like:

np.array([shuffle(item) for item in 3D_numpy_array])

answered Nov 4, 2019 at 19:11

Minions

5,5376 gold badges56 silver badges105 bronze badges

Comments

Ben-Hur Cardoso · Accepted Answer · 2019-03-19 18:03:05Z

3

You can shuffle a two dimensional array A by row using the np.vectorize() function:

shuffle = np.vectorize(np.random.permutation, signature='(n)->(n)')

A_shuffled = shuffle(A)

edited Mar 19, 2019 at 18:03

answered Dec 16, 2018 at 14:00

Ben-Hur Cardoso

312 bronze badges

Collectives™ on Stack Overflow

Numpy shuffle multidimensional array by row only, keep column order unchanged

5 Answers 5

7 Comments

2 Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

7 Comments

2 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related