Python balancing items in a list/numpy array

Question

I have an array of tokens, and each token corresponds to a different class from 1 to n. I need to balance the tokens array/list so that there are an equal number of tokens for each class. I want to do this by removing the elements of tokens.

In the example below the class with the lowest number of tokens is class 2 which has only 2 tokens. So I want to remove elements from other classes until their count is also 2.

e.g.

tokens  = array(['a','b','c','d','e','f','g','h','l'])

classes = array([ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3])

In this example, the classes are listed in ascending order (for clarity of task) but in reality, the classes are in no particular order.

e.g.

sol = array(['c','d','e','f','g','h'])

or

sol = array(['a','b','e','f','g','h'])

etc.

Obviously because you have a choice of elements to remove in an excess class, you can have different solutions (like above). I need a function that can take the tokens and classes and output a sol.

Do you want to get one random solution or a deterministic one? — javidcf
– javidcf, Commented Aug 13, 2019 at 9:30
Unless this is the one and only instance you will ever touch python i suggest you try solving it yourself first, otherwise you will learn very little if anything. — IcedLance
– IcedLance, Commented Aug 13, 2019 at 9:49

Andrej Kesely · Accepted Answer · 2019-08-13 09:55:56Z

A solution with Counter:

tokens = ['a','b','c','d','e','f','g','h','l']
lst    = [ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3]

from collections import Counter

c = Counter(lst)
min_cnt = min(c.values())
new_lst = list( zip(tokens, lst) )

while True:
    tmp = []
    should_break = True
    for t, i in new_lst:
        if c[i] > min_cnt:
            c[i] -= 1
            should_break = False
        else:
            tmp.append( (t, i) )

    new_lst = tmp

    if should_break:
        break

print([t for t, _ in new_lst])

Prints:

['c', 'd', 'e', 'f', 'h', 'l']

Other possible solution with groupby:

tokens = ['a','b','c','d','e','f','g','h','l']
lst    = [ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3]

from collections import Counter
from itertools import groupby, islice

c = Counter(lst)
min_cnt = min(c.values())

out = []
for v, g in groupby(sorted(enumerate(zip(tokens, lst)), key=lambda k: k[1][1]), lambda k: k[1][1]):
    out.extend(islice(g, 0, min_cnt))

print( [val for _, (val, _) in sorted(out, key=lambda k: k[0])] )

Prints:

['a', 'b', 'e', 'f', 'g', 'h']

javidcf · Accepted Answer · 2019-08-13 09:53:03Z

Here is a way to do that with NumPy. This will always select the firs appearances of each class.

import numpy as np

def balance(tokens, classes):
    # Count appearances of each class
    c = np.bincount(classes - 1)
    n = c.min()
    # Accumulated counts for each class shifted one position
    cs = np.roll(np.cumsum(c), 1)
    cs[0] = 0
    # Compute appearance index for each class
    i = np.arange(len(classes)) - cs[classes - 1]
    # Mask excessive appearances
    m = i < n
    # Return corresponding tokens
    return tokens[m]

tokens  = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'l'])
classes = np.array([  1,   1,   1,   1,   2,   2,   3,   3,   3])
print(balance(tokens, classes))
# ['a' 'b' 'e' 'f' 'g' 'h']

As it stands, the function returns an empty array when some class is completely missing (as the minimum number of appearances would be zero, so not class would appear in the solution), but you can adapt that if needed.

Sayandip Dutta · Accepted Answer · 2019-08-13 09:53:19Z

1

Another solution with Counter:

import random
from collections import Counter

tokens  = np.array(['a','b','c','d','e','f','g','h','l'])
classes = np.array([ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3])

def sampling(tokens, classes):
    dc = {}
    sol = []
    for i in range(len(classes)):
        if classes[i] in dc:
            dc[classes[i]].append(tokens[i])
        else:
            dc[classes[i]] = [tokens[i]]
    sample_counts = Counter(classes)
    min_sample = min(sample_counts.values())
    for i in dc:
        sol += (random.sample(dc[i],min_sample))
    return sol

print(sampling(tokens, classes))

>>> ['d', 'a', 'f', 'e', 'g', 'h']

answered Aug 13, 2019 at 9:53

Sayandip Dutta

15.9k4 gold badges27 silver badges57 bronze badges

Comments

Drey · Accepted Answer · 2019-08-14 00:31:59Z

Yet another short solution:

import random
from itertools import chain
from operator import itemgetter
import toolz

tokens  = ['a','b','c','d','e','f','g','h','l']
classes = [ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3]

groups = toolz.groupby(itemgetter(1), zip(tokens, classes))
max_size = len(min(groups.values(), key=len))
random_samples = chain.from_iterable(map(lambda x: random.sample(x, k=max_size), list(groups.values())))

chosen_tokens, corresponding_classes = list(zip(*random_samples))

or alternatively completely with buildins modules

import random
from itertools import chain, groupby, tee
from operator import itemgetter

tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'l']
classes = [1, 1, 1, 1, 2, 2, 3, 3, 3]

groups_for_max_size, groups = tee(groupby(zip(tokens, classes), itemgetter(1)), 2)
max_size = len(min(groups_for_max_size, key = len))

random_samples = chain.from_iterable(map(lambda x: random.sample(list(x[1]), k = max_size), groups))
chosen_tokens, corresponding_classes = list(zip(*random_samples))

Edit: I think there is even a more shorter solution:

from itertools import chain, groupby
from operator import itemgetter

groups = (sorted(tokens, key=lambda x: random.random()) 
          for _, tokens in groupby(zip(tokens, classes), itemgetter(1)))
chosen_tokens, corresponding_classes = zip(*chain.from_iterable(zip(*groups)))

There just two steps: 1. make sure the lists per group are randomized (this happens magically in sorted(tokens, key=lambda x: random.random()) because the sort key is always a random value). 2. It is also important to know that zip samples elements until the shortest generator is exhausted (which makes this solution so short). zip(*groups) is an iterator that retrieves triplets (since 3 classes) in each iteration. Because we shuffled the lists beforehand, they are thus sampled randomly. If we want again to separate tokens and classes we concatenate the triplets and unzip them again.

Collectives™ on Stack Overflow

Python balancing items in a list/numpy array

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related