8

If you don't care about the details of what I'm trying to implement, just skip past the lower horizontal line

I am trying to do a bootstrap error estimation on some statistic with NumPy. I have an array x, and wish to compute the error on the statistic f(x) for which usual gaussian assumptions in error analysis do not hold. x is very large.

To do this, I resample x using numpy.random.choice(), where the size of my resample is the size of the original array, with replacement:

resample = np.random.choice(x, size=len(x), replace=True)

This gives me a new realization of x. This operation must now be repeated ~1,000 times to give an accurate error estimate. If I generate 1,000 resamples of this nature;

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

and then compute the statistic f(x) on each realization;

results = [f(arr) for arr in resamples]

then I have inferred the error of f(x) to be something like

np.std(results)

the idea being that even though f(x) itself cannot be described using gaussian error analysis, a distribution of f(x) measures subject to random error can be.


Okay, so that's a bootstrap. Now, my problem is that the line

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

is very slow for large arrays. Is there a smarter way to do this without a list comprehension? The second list comprehension

results = [f(arr) for arr in resamples]

can be pretty slow too, depending on the details of the function f(x).

3
  • 1
    An approach that I've used, to avoid constructing the subsamples, is to instead create a flag array which is 1 if the corresponding datum is included in the sample and 0 if it is not. Then creating a subsample means permuting the flags. If allocating and filling the subsample array is a substantial part of the computation time, then working with flags is a win. Commented Oct 24, 2017 at 17:40
  • @RobertDodier Thanks! This is very similar to Divakar's answer below, though you use array masking rather than indexing. I'm not sure which of those would be faster - it's likely the same, I would think. Commented Oct 24, 2017 at 18:06
  • Yes, just using a list of indices is equivalent and maybe simpler if you just need to include/exclude data. I have often wanted weights other than 1 or 0, e.g. weight > 1 to represent sampling with replacement or emphasizing some data (e.g. a class which has few examples in the data) more than others. If you don't need non-0/1 weights, you can just use a list of indices. Commented Oct 24, 2017 at 18:35

2 Answers 2

8

Since we are allowing repetitions, we could generate all the indices in one go with np.random.randint and then simply index to get resamples equivalent, like so -

num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]

One more approach would be to generate random number from uniform distribution with numpy.random.rand and scale to length of array, like so -

resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]

Runtime test with x of 5000 elems -

In [221]: x = np.random.randint(0,10000,(5000))

# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop

# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop

# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop

For very large x

With a very large array x of 600,000 elements, you might not want to create all those indices for 1000 samples. In that case, per sample solution would have their timings something like this -

In [234]: x = np.random.randint(0,10000,(600000))

# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop

# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop

# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop
Sign up to request clarification or add additional context in comments.

3 Comments

Ah, of course, thanks so much. I will likely implement the approach with rand. Also, I said x was "very big"; it is over 600,000 elements long. So, by extrapolating your timing results, I could save around 4 hours of run time
@jphollowed Wow with such a large x, you might not get to create all the indices as proposed in this post. But, give it a try I guess if you have enough system memory.
@jphollowed Also check out For very large x section.
2

As alluded to by @Divakar you can pass a tuple to size to get a 2d array of resamples rather than using list comprehension.

Here assume for a second that f is just sum rather than some other function. Then:

x = np.random.randn(100000)
resamples = np.random.choice(x, size=(1000, x.shape[0]), replace=True)
# resamples.shape = (1000, 1000000)
results = np.apply_along_axis(f, axis=1, arr=resamples)
print(results.shape)
# (1000,)

Here np.apply_along_axis is admittedly just a glorified for-loop equivalent to [f(arr) for arr in resamples]. But I am not exactly sure if you need to index x here based on your question.

4 Comments

This is very simply and substantially faster, about 4 times. My original code to do the resampling takes 16ish ms per loop, and specifying the size as a tuple takes less than 4ms
The only problem with this is that it does not work in general - that is, if f(x) happens to be a second-order statistic, implying that I want to sample without replacement. The desired behavior would be for each row specified by the size tuple to forbid replacement, but for each subsequent row not to mind if it shares samples with the previous. This disables me from using this solution if replace=false since I am told ValueError: Cannot take a larger sample than population when 'replace=False'
Simply, this method cannot be used to take N resample realizations of a dataset, where each realization is sampled without replacement.
@jphollowed For replace=False , a vectorized approach would be - stackoverflow.com/a/45438143

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.