Python: downsample a population using frequency data

Question

Given a data series representing frequencies of elements in a population, what would be the easiest way to downsample it ?

The following population: pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']

Can be summeriezed as: freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}

Using the simple: from collections import Counter; Counter(pop)

To randomly downsample that population to 5 individuals I can do:

>>> from random import sample
>>> from collections import Counter
>>> pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 3, 'c': 1, 'b': 1})

But I'm searching for a way to do this directly from the freq information without having to build the pop list. You will agree that proceeding like this should not be necessary:

>>> from random import sample
>>> from collections import Counter
>>> flatten = lambda x: [item for sublist in x for item in sublist]
>>> freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
>>> pop = flatten([[k]*v for k,v in freq.items()])
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 2, 'c': 2, 'd': 1})

For memory considerations and speed requirements, I would like to avoid placing in memory the pop list. This can surely be done using some type of weighted random generator.

Would you be interested in an approximate solution? Since you are interested in downsampling, presumably the values in freq are very large. If so, you could floordiv (//) them all by some factor, then enumerate the smaller freq dict, and take a sample of that. — unutbu
– unutbu, Commented Jun 13, 2013 at 14:20

unutbu · Accepted Answer · 2013-06-14 10:22:57Z

1

Here is a basic algorithm that downsamples frequencies:

import random
import bisect
import collections

def downsample(freq, n):
    cumsums = []
    total = 0
    choices, weights = zip(*freq.items())
    for weight in weights:
        total += weight
        cumsums.append(total)
    assert 0 <= n <= total
    result = collections.Counter()
    for _ in range(n):
        rnd = random.uniform(0, total)
        i = bisect.bisect(cumsums, rnd)
        result[choices[i]] += 1
        cumsums = [c if idx<i else c-1 for idx, c in enumerate(cumsums)]
        total -= 1
    return result

freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
print(downsample(freq, 5))

which prints results like

Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})

edited Jun 14, 2013 at 10:22

answered Jun 13, 2013 at 14:31

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

xApple Over a year ago

That's it, using the bisect. That seems logical now that I see it. Thanks.

Collectives™ on Stack Overflow

Python: downsample a population using frequency data

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related