2

I have two different numpy arrays and I would like to shuffle them in asynchronized way.

The current solution is taken from https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html and proceeds as follows:

perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
self.images_train = self.images_train[perm]
self.labels_train = self.labels_train[perm]

The problem is that it doubles memory each time I do it. Somehow the old arrays are not getting deleted, probably because the slicing operator creates views I guess. I tried the following change, out of pure desperation:

perm = np.arange(self.no_images_train)
np.random.shuffle(perm)

n_images_train = self.images_train[perm]
n_labels_train = self.labels_train[perm]            

del self.images_train
del self.labels_train
gc.collect()

self.images_train = n_images_train
self.labels_train = n_labels_train

Still the same, memory leaks and I am running out of memory after a couple of operations.

Btw, the two arrays are of rank 100000,224,244,1 and 100000,1.

I know that this has been dealt with here (Better way to shuffle two numpy arrays in unison), but the answer didn't help me, as the provided solution needs slicing again.

Thanks for any help.

7
  • 1
    Those aren't views. You may have other references to the original arrays somewhere. Commented Jun 14, 2016 at 19:26
  • "...because the slicing operator creates views I guess." Slicing does create views, but the code that you show is not slicing. When you write a[perm], a copy is made. "Slicing" refers to the operation using a colon: start:end:step, e.g. 0:4, 4:, etc. Commented Jun 14, 2016 at 19:27
  • "... in asynchronized way." I think you are missing a space. Based on what follows, I think you meant "in a synchronized way." Commented Jun 14, 2016 at 19:28
  • "...rank 100000,224,244,1..." That's almost 5.5 gigabytes (assuming the data type is 8 bit). Even in your "desperation" code, there is a time when self.images_train and `n_images_train" will both exist, which will require 11 gigabytes. This is not a memory "leak". Commented Jun 14, 2016 at 19:40
  • I think a better title for this question is "How do I apply the same random permutation to two arrays without making temporary copies of the arrays?" Commented Jun 14, 2016 at 20:16

2 Answers 2

1

One way to permute two large arrays in-place in a synchronized way is to save the state of the random number generator and then shuffle the first array. Then restore the state and shuffle the second array.

For example, here are my two arrays:

In [48]: a
Out[48]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [49]: b
Out[49]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

Save the current internal state of the random number generator:

In [50]: state = np.random.get_state()

Shuffle a in-place:

In [51]: np.random.shuffle(a)

Restore the internal state of the random number generator:

In [52]: np.random.set_state(state)

Shuffle b in-place:

In [53]: np.random.shuffle(b)

Check that the permutations are the same:

In [54]: a
Out[54]: array([13, 12, 11, 15, 10,  5,  1,  6, 14,  3,  9,  7,  0,  8,  4,  2])

In [55]: b
Out[55]: array([13, 12, 11, 15, 10,  5,  1,  6, 14,  3,  9,  7,  0,  8,  4,  2])

For your code, this would look like:

state = np.random.get_state()
np.random.shuffle(self.images_train)
np.random.set_state(state)
np.random.shuffle(self.labels_train)
Sign up to request clarification or add additional context in comments.

4 Comments

It does help, thank you. However, I actually found a better thing: to circumvent the problem. I decided not to shuffle the data periodically, but to only recreate a permutation vector and to sample using it. I'd still like to know why the original solution fails.
However this solution need two calls to the random number generator, which may become a performance bottleneck. You may use a different random number generator to reduce this effect.
@Guillaum Yes, the two calls to the random number generator (to generate the same sequence!) might be an issue, so some performance testing is recommended. How would using a different generator help?
@WarrenWeckesser As far as I know, the random generator of numpy is a Mersenne Twister. It exists random number generators with different behavior (on quality and speed). For example, using C++ std::mt19937_64 (Mersenne Twister) or std::minstd_rand (a simpler approach) to generate 10 millions random numbers runs in 8.3s versus 1.0s on my computer. However I was thinking that numpy comes with different generator but I was wrong.
0

Actually I don't think that there is any issue with numpy or python. Numpy uses the system malloc / free to allocate the array and this leads to memory fragmentation (see Memory Fragmentation on SO).

So I guess that your memory profile may increase and suddenly drops when the system is able to reduce fragmentation, if possible.

1 Comment

Memory increases in steps of 6GB, and at 230GB I kill the process on my machine with 64GB of physical memory. I am not sure that this can be totally attributed to fragmentation. Especially since there is no real reason why there should be more than 6GB memory used over longer periods of time (apart from temporal allocations for copying etc.).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.