How to quickly randomly update values in an np array?

Question

So I have a large 3D array (~ 2000 x 1000 x 1000). I want to update each value in the array to a random integer value between 1 and the current max such that all values = x are updated to the same random integer. I want to keep zeros unchanged. Also there can't be any repeats, i.e. different values in the original array can't be updated to the same random int. The values are currently in a continuous range between 0 and 9000. There are quite a lot of values in the array;

np.amax(arr) #output = 9000

So tried the method below...

max_v = np.amax(arr)
vlist = []
for l in range(1,max_v): vlist.append(l)
for l in tqdm(range(1,max_v)):
    m = random.randint(1,len(vlist))
    n = vlist[m]
    arr = np.where(arr == l, n, arr)
    vlist.remove(n)

My current code takes about 13 s per iteration with 9000 itertions (for the first few iterations at least which is too slow). I've thought about parallelisation with concurrent.futures but i'm sure it's likely i've missed something obvious here XD

Can you write this in a form with minimal mutations (e.g. no vlist.remove(n))? — Mateen Ulhaq
– Mateen Ulhaq, Commented Nov 7, 2022 at 21:56
I think you're overcomplicating this. The values of the array are indices into a simple permutation. The entire thing can be done with a shuffle and index. — Mad Physicist
– Mad Physicist, Commented Nov 7, 2022 at 22:16

Mad Physicist · Accepted Answer · 2022-11-07 22:13:20Z

2

If your current values are in a continuous range, and you want another continuous range, you're in luck! At that point, you aren't really generating 2 billion random numbers: you're just permuting 9000 or so integers. For example:

arr = np.random.randint(9001, size=(10, 20, 20))
p = np.arange(arr.max(None) + 1)
np.random.shuffle(p)
arr = p[arr]

The replacement values do not have to start with zero, but if you plan on doing this iteratively, you will have to subtract off the offset before using arr as an index into p.

answered Nov 7, 2022 at 22:13

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Sam Mason Over a year ago

np.random.permutation(n) might be easier than shuffle(arange(n))

Mad Physicist Over a year ago

@SamMason. If this is done iteratively, you only allocate the arange once and keep shuffling it. permutation will keep returning copies.

Sam Mason Over a year ago

yup, I was writing an answer doing basically the same as you answer. using OPs full array takes my computer ~10secs. also, the new-style RNG is ~3x faster at generating ints for me

Mad Physicist Over a year ago

@SamMason. Since OP is concerned with speed, and you did a bunch of benchmarks in addition to using the new API, I would recommend posting another answer.

Sam Mason · Accepted Answer · 2022-11-08 11:50:16Z

1

As suggested by Mad Physicist, here's my almost identical solution:

from sys import getsizeof
import numpy as np

# create a new-style random generator
rng = np.random.default_rng()

# takes ~20 seconds, ~60 secs with legacy generator
X = rng.integers(9001, size=(2000, 1000, 1000), dtype=np.uint16)

# output: 3.73 GiB, uint16 takes 1/4 space of the default int64
print(f"{getsizeof(X) / 2**30:.2f} GiB")

# generate a permutation, converting to same datatype makes slightly faster
p = rng.permutation(np.max(X)+1).astype(X.dtype)

# iterate applying permutation, takes ~10 seconds in total
for i in range(len(X)):
    X[i] = p[X[i]]

I'm iterating while applying the permutation, to reduce transient memory demands. it will only need one slice of the first dimension at a time (~2MiB) rather than trying to completely allocate a new copy again.

MadPhysicist asked why I'm doing the for loop at the end rather than just directly executing X[:] = p[X]. This is about reducing the memory demands of the program. Under Linux, I'd use something like:

from resource import getrusage, RUSAGE_SELF

print(getrusage(RUSAGE_SELF).ru_maxrss)

to tell me the most RAM that had been allocated to the Python process (in KiB). If I run that after running the above code I get 3938904 printed, so 3.76GiB. If I don't use the for loop, then this goes up to 7.48 GiB. If I don't ensure the permutation is also of type uint16 (i.e. with .astype(X.dtype)) then my laptop would start swapping as it would require more than 16GiB of RAM.

edited Nov 8, 2022 at 11:50

answered Nov 7, 2022 at 22:54

Sam Mason

16.5k1 gold badge49 silver badges71 bronze badges

2 Comments

Mad Physicist Over a year ago

Not sure why the loop at the end. Did you mean X[:] = p[X]?

Sam Mason Over a year ago

@MadPhysicist have added more explanation, hope that makes sense

Collectives™ on Stack Overflow

How to quickly randomly update values in an np array?

2 Answers 2

4 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related