Efficient way to sample a large array many times with NumPy?

Question

If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line

I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x, and wish to compute the error on the statistic f(x) for which usual gaussian assumptions in error analysis do not hold. x is very large.

To do this, I resample x using numpy.random.choice(), where the size of my resample is the size of the original array, with replacement:

resample = np.random.choice(x, size=len(x), replace=True)

This gives me a new realization of x. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

and then compute the statistic f(x) on each realization;

results = [f(arr) for arr in resamples]

then I have inferred the error of f(x) to be something like

np.std(results)

the idea being that even though f(x) itself cannot be described using gaussian error analysis, a distribution of f(x) measures subject to random error can be.

Okay, so that's a bootstrap. Now, my problem is that the line

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension

results = [f(arr) for arr in resamples]

can be pretty slow too, depending on the details of the function f(x).

An approach that I've used, to avoid constructing the subsamples, is to instead create a flag array which is 1 if the corresponding datum is included in the sample and 0 if it is not. Then creating a subsample means permuting the flags. If allocating and filling the subsample array is a substantial part of the computation time, then working with flags is a win. — Robert Dodier
– Robert Dodier, Commented Oct 24, 2017 at 17:40
@RobertDodier Thanks! This is very similar to Divakar's answer below, though you use array masking rather than indexing. I'm not sure which of those would be faster - it's likely the same, I would think. — pretzlstyle
– pretzlstyle, Commented Oct 24, 2017 at 18:06
Yes, just using a list of indices is equivalent and maybe simpler if you just need to include/exclude data. I have often wanted weights other than 1 or 0, e.g. weight > 1 to represent sampling with replacement or emphasizing some data (e.g. a class which has few examples in the data) more than others. If you don't need non-0/1 weights, you can just use a list of indices. — Robert Dodier
– Robert Dodier, Commented Oct 24, 2017 at 18:35

Divakar · Accepted Answer · 2017-10-24 18:11:12Z

8

Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint and then simply index to get resamples equivalent, like so -

num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]

One more approach would be to generate random number from uniform distribution with numpy.random.rand and scale to length of array, like so -

resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]

Runtime test with x of 5000 elems -

In [221]: x = np.random.randint(0,10000,(5000))

# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop

# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop

# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop

For very large x

With a very large array x of 600,000 elements, you might not want to create all those indices for 1000 samples. In that case, per sample solution would have their timings something like this -

In [234]: x = np.random.randint(0,10000,(600000))

# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop

# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop

# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop

edited Oct 24, 2017 at 18:11

answered Oct 24, 2017 at 17:40

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pretzlstyle Over a year ago

Ah, of course, thanks so much. I will likely implement the approach with rand. Also, I said x was "very big"; it is over 600,000 elements long. So, by extrapolating your timing results, I could save around 4 hours of run time

Divakar Over a year ago

@jphollowed Wow with such a large x, you might not get to create all the indices as proposed in this post. But, give it a try I guess if you have enough system memory.

Divakar Over a year ago

@jphollowed Also check out For very large x section.

Brad Solomon · Accepted Answer · 2017-10-24 18:00:23Z

2

As alluded to by @Divakar you can pass a tuple to size to get a 2d array of resamples rather than using list comprehension.

Here assume for a second that f is just sum rather than some other function. Then:

x = np.random.randn(100000)
resamples = np.random.choice(x, size=(1000, x.shape[0]), replace=True)
# resamples.shape = (1000, 1000000)
results = np.apply_along_axis(f, axis=1, arr=resamples)
print(results.shape)
# (1000,)

Here np.apply_along_axis is admittedly just a glorified for-loop equivalent to [f(arr) for arr in resamples]. But I am not exactly sure if you need to index x here based on your question.

answered Oct 24, 2017 at 18:00

Brad Solomon

41.2k39 gold badges167 silver badges260 bronze badges

4 Comments

pretzlstyle Over a year ago

This is very simply and substantially faster, about 4 times. My original code to do the resampling takes 16ish ms per loop, and specifying the size as a tuple takes less than 4ms

pretzlstyle Over a year ago

The only problem with this is that it does not work in general - that is, if f(x) happens to be a second-order statistic, implying that I want to sample without replacement. The desired behavior would be for each row specified by the size tuple to forbid replacement, but for each subsequent row not to mind if it shares samples with the previous. This disables me from using this solution if replace=false since I am told ValueError: Cannot take a larger sample than population when 'replace=False'

pretzlstyle Over a year ago

Simply, this method cannot be used to take N resample realizations of a dataset, where each realization is sampled without replacement.

Divakar Over a year ago

@jphollowed For replace=False , a vectorized approach would be - stackoverflow.com/a/45438143

Collectives™ on Stack Overflow

Efficient way to sample a large array many times with NumPy?

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related