1

Given a data series representing frequencies of elements in a population, what would be the easiest way to downsample it ?

The following population: pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']

Can be summeriezed as: freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}

Using the simple: from collections import Counter; Counter(pop)

To randomly downsample that population to 5 individuals I can do:

>>> from random import sample
>>> from collections import Counter
>>> pop = ['a', 'b', 'a', 'c', 'c', 'd', 'c', 'a', 'a', 'b', 'a']
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 3, 'c': 1, 'b': 1})

But I'm searching for a way to do this directly from the freq information without having to build the pop list. You will agree that proceeding like this should not be necessary:

>>> from random import sample
>>> from collections import Counter
>>> flatten = lambda x: [item for sublist in x for item in sublist]
>>> freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
>>> pop = flatten([[k]*v for k,v in freq.items()])
>>> smaller_pop = sample(pop, 5)
>>> smaller_freq = Counter(smaller_pop)
>>> print smaller_freq
Counter({'a': 2, 'c': 2, 'd': 1})

For memory considerations and speed requirements, I would like to avoid placing in memory the pop list. This can surely be done using some type of weighted random generator.

2
  • Would you be interested in an approximate solution? Since you are interested in downsampling, presumably the values in freq are very large. If so, you could floordiv (//) them all by some factor, then enumerate the smaller freq dict, and take a sample of that. Commented Jun 13, 2013 at 14:20
  • An exact algorithm is more what I'm looking for. Commented Jun 13, 2013 at 14:22

1 Answer 1

1

Here is a basic algorithm that downsamples frequencies:

import random
import bisect
import collections

def downsample(freq, n):
    cumsums = []
    total = 0
    choices, weights = zip(*freq.items())
    for weight in weights:
        total += weight
        cumsums.append(total)
    assert 0 <= n <= total
    result = collections.Counter()
    for _ in range(n):
        rnd = random.uniform(0, total)
        i = bisect.bisect(cumsums, rnd)
        result[choices[i]] += 1
        cumsums = [c if idx<i else c-1 for idx, c in enumerate(cumsums)]
        total -= 1
    return result

freq = {'a': 5, 'c': 3, 'b': 2, 'd': 1}
print(downsample(freq, 5))

which prints results like

Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1})
Sign up to request clarification or add additional context in comments.

1 Comment

That's it, using the bisect. That seems logical now that I see it. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.