Slicing without views (or: shuffling multiple arrays)

Question

I have two different numpy arrays and I would like to shuffle them in asynchronized way.

The current solution is taken from https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html and proceeds as follows:

perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
self.images_train = self.images_train[perm]
self.labels_train = self.labels_train[perm]

The problem is that it doubles memory each time I do it. Somehow the old arrays are not getting deleted, probably because the slicing operator creates views I guess. I tried the following change, out of pure desperation:

perm = np.arange(self.no_images_train)
np.random.shuffle(perm)

n_images_train = self.images_train[perm]
n_labels_train = self.labels_train[perm]            

del self.images_train
del self.labels_train
gc.collect()

self.images_train = n_images_train
self.labels_train = n_labels_train

Still the same, memory leaks and I am running out of memory after a couple of operations.

Btw, the two arrays are of rank 100000,224,244,1 and 100000,1.

I know that this has been dealt with here (Better way to shuffle two numpy arrays in unison), but the answer didn't help me, as the provided solution needs slicing again.

Thanks for any help.

Those aren't views. You may have other references to the original arrays somewhere. — user2357112
– user2357112, Commented Jun 14, 2016 at 19:26
"...because the slicing operator creates views I guess." Slicing does create views, but the code that you show is not slicing. When you write a[perm], a copy is made. "Slicing" refers to the operation using a colon: start:end:step, e.g. 0:4, 4:, etc. — Warren Weckesser
– Warren Weckesser, Commented Jun 14, 2016 at 19:27
"... in asynchronized way." I think you are missing a space. Based on what follows, I think you meant "in a synchronized way." — Warren Weckesser
– Warren Weckesser, Commented Jun 14, 2016 at 19:28
"...rank 100000,224,244,1..." That's almost 5.5 gigabytes (assuming the data type is 8 bit). Even in your "desperation" code, there is a time when self.images_train and `n_images_train" will both exist, which will require 11 gigabytes. This is not a memory "leak". — Warren Weckesser
– Warren Weckesser, Commented Jun 14, 2016 at 19:40
I think a better title for this question is "How do I apply the same random permutation to two arrays without making temporary copies of the arrays?" — Warren Weckesser
– Warren Weckesser, Commented Jun 14, 2016 at 20:16

Warren Weckesser · Accepted Answer · 2016-06-14 19:53:05Z

1

One way to permute two large arrays in-place in a synchronized way is to save the state of the random number generator and then shuffle the first array. Then restore the state and shuffle the second array.

For example, here are my two arrays:

In [48]: a
Out[48]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [49]: b
Out[49]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

Save the current internal state of the random number generator:

In [50]: state = np.random.get_state()

Shuffle a in-place:

In [51]: np.random.shuffle(a)

Restore the internal state of the random number generator:

In [52]: np.random.set_state(state)

Shuffle b in-place:

In [53]: np.random.shuffle(b)

Check that the permutations are the same:

In [54]: a
Out[54]: array([13, 12, 11, 15, 10,  5,  1,  6, 14,  3,  9,  7,  0,  8,  4,  2])

In [55]: b
Out[55]: array([13, 12, 11, 15, 10,  5,  1,  6, 14,  3,  9,  7,  0,  8,  4,  2])

For your code, this would look like:

state = np.random.get_state()
np.random.shuffle(self.images_train)
np.random.set_state(state)
np.random.shuffle(self.labels_train)

edited Jun 14, 2016 at 19:53

answered Jun 14, 2016 at 19:48

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Flonks Over a year ago

It does help, thank you. However, I actually found a better thing: to circumvent the problem. I decided not to shuffle the data periodically, but to only recreate a permutation vector and to sample using it. I'd still like to know why the original solution fails.

Guillaum Over a year ago

However this solution need two calls to the random number generator, which may become a performance bottleneck. You may use a different random number generator to reduce this effect.

Warren Weckesser Over a year ago

@Guillaum Yes, the two calls to the random number generator (to generate the same sequence!) might be an issue, so some performance testing is recommended. How would using a different generator help?

Guillaum Over a year ago

@WarrenWeckesser As far as I know, the random generator of numpy is a Mersenne Twister. It exists random number generators with different behavior (on quality and speed). For example, using C++ std::mt19937_64 (Mersenne Twister) or std::minstd_rand (a simpler approach) to generate 10 millions random numbers runs in 8.3s versus 1.0s on my computer. However I was thinking that numpy comes with different generator but I was wrong.

Community · Accepted Answer · 2017-05-23 11:58:47Z

0

Actually I don't think that there is any issue with numpy or python. Numpy uses the system malloc / free to allocate the array and this leads to memory fragmentation (see Memory Fragmentation on SO).

So I guess that your memory profile may increase and suddenly drops when the system is able to reduce fragmentation, if possible.

edited May 23, 2017 at 11:58

CommunityBot

11 silver badge

answered Jun 15, 2016 at 16:24

Guillaum

1806 bronze badges

1 Comment

Flonks Over a year ago

Memory increases in steps of 6GB, and at 230GB I kill the process on my machine with 64GB of physical memory. I am not sure that this can be totally attributed to fragmentation. Especially since there is no real reason why there should be more than 6GB memory used over longer periods of time (apart from temporal allocations for copying etc.).

Collectives™ on Stack Overflow

Slicing without views (or: shuffling multiple arrays)

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related