0

The problem is very simple, I have a vector of indices from which I want to extract one set randomly chosen and its complement. So I write the following code:

import numpy as np    
vec = np.arange(0,25000)
idx = np.random.choice(vec,5000)
idx_r = np.delete(vec,idx)

However, when I print the length of vec, idx, and idx_r they do not match. The sum between idx and idx_r return values higher than len(vec). For example, the following code:

print(len(idx))
print(len(idx_r))
print(len(idx_r)+len(idx))
print(len(vec))

returns:

5000 20462 25462 25000

Python version is 3.8.1 and GCC is 9.2.0.

1 Answer 1

0

The np.random.choice has a keyword argument replace. Its default value is True. If you set the value to False, I think you will get the desired result.

import numpy as np

vec = np.arange(0, 25000)

idx = np.random.choice(vec, 5000, replace=False)

idx_r = np.delete(vec, idx)

print([len(item) for item in (vec, idx, idx_r)])

Out:

[25000, 5000, 20000]

However, numpy.random.choice with replace=False is extremely inefficient due to poor implementation choices they're stuck with for backward compatibility - it generates a permutation of the whole input just to take a small sample. You should use the new Generator API instead, which doesn't have this issue:

rng = np.random.default_rng()

idx = rng.choice(vec, 5000, replace=False)
Sign up to request clarification or add additional context in comments.

1 Comment

You're welcome. I'm just learning Numpy myself. Thanks for posting this. I didn't know about the methods you are using until now. Please mark it as the correct answer if it solved your issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.