Python - Fast way to sample data from array when sample size changes

Question

I am trying to sample data from a list of integers. The tricky part is that each sample should have a different size to emulate some other data I have. I am doing a for loop right now that can do the job, but I was just wondering if there are faster ways that I am not aware of.

Since I think random.sample is supposed to be fast, I am doing:

result = []
for i in range(100000):
    size = list_of_sizes[i]
    result.append(random.sample(data, size))

So the result I get is something like:

>>>list_of_sizes
    [3, 4, 1, 2,...]

>>>result
    [[1, 2, 3],
     [3, 6, 2, 8],
     [9],
     [10, 100],
     ...]

I have tried using np.random.choice(data, size, replace=False) and random.sample(data, k=size), but they don't allow giving an array of different sizes to vectorize the operation (when np.random.choice takes an array in the size parameter, it creates a tensor whose output's shape is that of size, but not an array of samples). Ideally, I would be expecting something like:

>>>np.random.choice(data, list_of_sizes, replace=False)
    [[1, 2, 3],
     [3, 6, 2, 8],
     [9],
     [10, 100],
     ...]

Also, do you expect the list of sizes to have many repeated values? — hilberts_drinking_problem
– hilberts_drinking_problem, Commented Feb 26, 2020 at 16:55
@hilberts_drinking_problem I would expect each sample to be independent and represent a bigger population. The sizes are randomly generated following a distribution of the real data, so I do expect repetitions. But the sizes are bounded between [0, ~70000] and for the moment I generate an array of 100k, so there shouldn't be many repetitions — Miguel Wang
– Miguel Wang, Commented Feb 26, 2020 at 17:09

hilberts_drinking_problem · Accepted Answer · 2020-02-26 18:34:36Z

3

It seems that np.random.choice is indeed not optimized for choice with replacement. However, you can get better performance by using Generator.choice, as discussed here.

I see a 14x speedup for your parameters:

data = np.arange(10**6)
sample_sizes = np.random.randint(1, 70_000, 100)

def f(data, sample_sizes):
  result = []
  for s in sample_sizes:
    result.append(np.random.choice(data, s, replace=False))

def f2(data, sample_sizes):
  g = np.random.Generator(np.random.PCG64())
  n = data.shape[0]
  return [data[g.choice(n, k, replace=False)] for k in sample_sizes]

%timeit f(data, sample_sizes)
%timeit f2(data, sample_sizes)
1 loop, best of 3: 5.18 s per loop
1 loop, best of 3: 375 ms per loop

answered Feb 26, 2020 at 18:34

hilberts_drinking_problem

11.6k3 gold badges25 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Miguel Wang Over a year ago

I get a better result using random.sample than np.random.Generator(np.random.PCG64()), but it was useful to know more generators and also reminded of the list comprehension which helps a lot. Thank you

a_guest Over a year ago

The list comprehension vs. append in for-loop also significantly adds to the speed-up, since it can allocate all the memory at once, using the size hint from sample_sizes, while the loop version will have to resize + memcopy the underlying list multiple times as it grows (and 70,000 items means quite a number of reallocs). So for a fair comparison both functions should use the same code except for the RNG. But anyway most contribution to the speed-up seems to come from Generator indeed.

hilberts_drinking_problem Over a year ago

Right, I realize that my benchmark is not entirely fair. My lesson for today is that np.random.choice has pretty disappointing performance for sparse sampling without replacement.

a_guest · Accepted Answer · 2020-02-26 23:06:21Z

1

Depending on your hardware as well as data sizes, using multiprocessing can give a sizeable speed-up. This needs to be estimated for your specific problem setup, however. For example using multiprocessing.pool.Pool:

from functools import partial
from multiprocessing.pool import Pool

with Pool() as pool:
    result = pool.map(partial(sample, data), sizes)

Performance comparison

Here are some example results (using 4 CPU cores):

from functools import partial
from multiprocessing.pool import Pool
from random import choices, sample
from statistics import mean, stdev
import time


def baseline(data, sizes):
    return [sample(data, k) for k in sizes]


def multiprocessing(data, sizes):
    with Pool(4) as pool:
        return pool.map(partial(sample, data), sizes)


def timeit(f, *args, n=7):
    timings = []
    for __ in range(n):
        t_start = time.time()  # using time because of multiprocessing
        f(*args)
        t_stop = time.time()
        timings.append(t_stop - t_start)
    print(f'[{f.__name__}] {mean(timings):.2f} +/- {stdev(timings):.2f} s')


data = list(range(1_000_000))
sizes = choices(range(max(data) // 100), k=1_000)

timeit(baseline, data, sizes)
timeit(multiprocessing, data, sizes)

which gives:

[baseline] 3.19 +/- 0.07 s
[multiprocessing] 2.10 +/- 0.02 s

But again, this depends on hardware and data, so it needs to be checked on each individual system.

edited Feb 26, 2020 at 23:06

answered Feb 26, 2020 at 19:48

a_guest

36.7k15 gold badges75 silver badges137 bronze badges

1 Comment

hilberts_drinking_problem Over a year ago

I also get a significant speedup by dropping partial and defining sampling function using data from the global scope.

Collectives™ on Stack Overflow

Python - Fast way to sample data from array when sample size changes

2 Answers 2

3 Comments

Performance comparison

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Performance comparison

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related