3

I am trying to sample data from a list of integers. The tricky part is that each sample should have a different size to emulate some other data I have. I am doing a for loop right now that can do the job, but I was just wondering if there are faster ways that I am not aware of.

Since I think random.sample is supposed to be fast, I am doing:

result = []
for i in range(100000):
    size = list_of_sizes[i]
    result.append(random.sample(data, size))

So the result I get is something like:

>>>list_of_sizes
    [3, 4, 1, 2,...]

>>>result
    [[1, 2, 3],
     [3, 6, 2, 8],
     [9],
     [10, 100],
     ...]

I have tried using np.random.choice(data, size, replace=False) and random.sample(data, k=size), but they don't allow giving an array of different sizes to vectorize the operation (when np.random.choice takes an array in the size parameter, it creates a tensor whose output's shape is that of size, but not an array of samples). Ideally, I would be expecting something like:

>>>np.random.choice(data, list_of_sizes, replace=False)
    [[1, 2, 3],
     [3, 6, 2, 8],
     [9],
     [10, 100],
     ...]
3
  • Do you need different samples to be independent? Commented Feb 26, 2020 at 16:53
  • Also, do you expect the list of sizes to have many repeated values? Commented Feb 26, 2020 at 16:55
  • @hilberts_drinking_problem I would expect each sample to be independent and represent a bigger population. The sizes are randomly generated following a distribution of the real data, so I do expect repetitions. But the sizes are bounded between [0, ~70000] and for the moment I generate an array of 100k, so there shouldn't be many repetitions Commented Feb 26, 2020 at 17:09

2 Answers 2

3

It seems that np.random.choice is indeed not optimized for choice with replacement. However, you can get better performance by using Generator.choice, as discussed here.

I see a 14x speedup for your parameters:

data = np.arange(10**6)
sample_sizes = np.random.randint(1, 70_000, 100)

def f(data, sample_sizes):
  result = []
  for s in sample_sizes:
    result.append(np.random.choice(data, s, replace=False))

def f2(data, sample_sizes):
  g = np.random.Generator(np.random.PCG64())
  n = data.shape[0]
  return [data[g.choice(n, k, replace=False)] for k in sample_sizes]

%timeit f(data, sample_sizes)
%timeit f2(data, sample_sizes)
1 loop, best of 3: 5.18 s per loop
1 loop, best of 3: 375 ms per loop
Sign up to request clarification or add additional context in comments.

3 Comments

I get a better result using random.sample than np.random.Generator(np.random.PCG64()), but it was useful to know more generators and also reminded of the list comprehension which helps a lot. Thank you
The list comprehension vs. append in for-loop also significantly adds to the speed-up, since it can allocate all the memory at once, using the size hint from sample_sizes, while the loop version will have to resize + memcopy the underlying list multiple times as it grows (and 70,000 items means quite a number of reallocs). So for a fair comparison both functions should use the same code except for the RNG. But anyway most contribution to the speed-up seems to come from Generator indeed.
Right, I realize that my benchmark is not entirely fair. My lesson for today is that np.random.choice has pretty disappointing performance for sparse sampling without replacement.
1

Depending on your hardware as well as data sizes, using multiprocessing can give a sizeable speed-up. This needs to be estimated for your specific problem setup, however. For example using multiprocessing.pool.Pool:

from functools import partial
from multiprocessing.pool import Pool

with Pool() as pool:
    result = pool.map(partial(sample, data), sizes)

Performance comparison

Here are some example results (using 4 CPU cores):

from functools import partial
from multiprocessing.pool import Pool
from random import choices, sample
from statistics import mean, stdev
import time


def baseline(data, sizes):
    return [sample(data, k) for k in sizes]


def multiprocessing(data, sizes):
    with Pool(4) as pool:
        return pool.map(partial(sample, data), sizes)


def timeit(f, *args, n=7):
    timings = []
    for __ in range(n):
        t_start = time.time()  # using time because of multiprocessing
        f(*args)
        t_stop = time.time()
        timings.append(t_stop - t_start)
    print(f'[{f.__name__}] {mean(timings):.2f} +/- {stdev(timings):.2f} s')


data = list(range(1_000_000))
sizes = choices(range(max(data) // 100), k=1_000)

timeit(baseline, data, sizes)
timeit(multiprocessing, data, sizes)

which gives:

[baseline] 3.19 +/- 0.07 s
[multiprocessing] 2.10 +/- 0.02 s

But again, this depends on hardware and data, so it needs to be checked on each individual system.

1 Comment

I also get a significant speedup by dropping partial and defining sampling function using data from the global scope.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.