1

I have an array of tokens, and each token corresponds to a different class from 1 to n. I need to balance the tokens array/list so that there are an equal number of tokens for each class. I want to do this by removing the elements of tokens.

In the example below the class with the lowest number of tokens is class 2 which has only 2 tokens. So I want to remove elements from other classes until their count is also 2.

e.g.

tokens  = array(['a','b','c','d','e','f','g','h','l'])

classes = array([ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3])

In this example, the classes are listed in ascending order (for clarity of task) but in reality, the classes are in no particular order.

e.g.

sol = array(['c','d','e','f','g','h'])

or

sol = array(['a','b','e','f','g','h'])

etc.

Obviously because you have a choice of elements to remove in an excess class, you can have different solutions (like above). I need a function that can take the tokens and classes and output a sol.

2
  • Do you want to get one random solution or a deterministic one? Commented Aug 13, 2019 at 9:30
  • Unless this is the one and only instance you will ever touch python i suggest you try solving it yourself first, otherwise you will learn very little if anything. Commented Aug 13, 2019 at 9:49

4 Answers 4

2

A solution with Counter:

tokens = ['a','b','c','d','e','f','g','h','l']
lst    = [ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3]

from collections import Counter

c = Counter(lst)
min_cnt = min(c.values())
new_lst = list( zip(tokens, lst) )

while True:
    tmp = []
    should_break = True
    for t, i in new_lst:
        if c[i] > min_cnt:
            c[i] -= 1
            should_break = False
        else:
            tmp.append( (t, i) )

    new_lst = tmp

    if should_break:
        break

print([t for t, _ in new_lst])

Prints:

['c', 'd', 'e', 'f', 'h', 'l']

Other possible solution with groupby:

tokens = ['a','b','c','d','e','f','g','h','l']
lst    = [ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3]

from collections import Counter
from itertools import groupby, islice

c = Counter(lst)
min_cnt = min(c.values())

out = []
for v, g in groupby(sorted(enumerate(zip(tokens, lst)), key=lambda k: k[1][1]), lambda k: k[1][1]):
    out.extend(islice(g, 0, min_cnt))

print( [val for _, (val, _) in sorted(out, key=lambda k: k[0])] )

Prints:

['a', 'b', 'e', 'f', 'g', 'h']
Sign up to request clarification or add additional context in comments.

Comments

1

Here is a way to do that with NumPy. This will always select the firs appearances of each class.

import numpy as np

def balance(tokens, classes):
    # Count appearances of each class
    c = np.bincount(classes - 1)
    n = c.min()
    # Accumulated counts for each class shifted one position
    cs = np.roll(np.cumsum(c), 1)
    cs[0] = 0
    # Compute appearance index for each class
    i = np.arange(len(classes)) - cs[classes - 1]
    # Mask excessive appearances
    m = i < n
    # Return corresponding tokens
    return tokens[m]

tokens  = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'l'])
classes = np.array([  1,   1,   1,   1,   2,   2,   3,   3,   3])
print(balance(tokens, classes))
# ['a' 'b' 'e' 'f' 'g' 'h']

As it stands, the function returns an empty array when some class is completely missing (as the minimum number of appearances would be zero, so not class would appear in the solution), but you can adapt that if needed.

Comments

1

Another solution with Counter:

import random
from collections import Counter

tokens  = np.array(['a','b','c','d','e','f','g','h','l'])
classes = np.array([ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3])

def sampling(tokens, classes):
    dc = {}
    sol = []
    for i in range(len(classes)):
        if classes[i] in dc:
            dc[classes[i]].append(tokens[i])
        else:
            dc[classes[i]] = [tokens[i]]
    sample_counts = Counter(classes)
    min_sample = min(sample_counts.values())
    for i in dc:
        sol += (random.sample(dc[i],min_sample))
    return sol

print(sampling(tokens, classes))

>>> ['d', 'a', 'f', 'e', 'g', 'h']

Comments

1

Yet another short solution:

import random
from itertools import chain
from operator import itemgetter
import toolz

tokens  = ['a','b','c','d','e','f','g','h','l']
classes = [ 1 , 1 , 1 , 1 , 2 , 2 , 3 , 3 , 3]

groups = toolz.groupby(itemgetter(1), zip(tokens, classes))
max_size = len(min(groups.values(), key=len))
random_samples = chain.from_iterable(map(lambda x: random.sample(x, k=max_size), list(groups.values())))

chosen_tokens, corresponding_classes = list(zip(*random_samples))

or alternatively completely with buildins modules

import random
from itertools import chain, groupby, tee
from operator import itemgetter

tokens = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'l']
classes = [1, 1, 1, 1, 2, 2, 3, 3, 3]

groups_for_max_size, groups = tee(groupby(zip(tokens, classes), itemgetter(1)), 2)
max_size = len(min(groups_for_max_size, key = len))

random_samples = chain.from_iterable(map(lambda x: random.sample(list(x[1]), k = max_size), groups))
chosen_tokens, corresponding_classes = list(zip(*random_samples))

Edit: I think there is even a more shorter solution:

from itertools import chain, groupby
from operator import itemgetter

groups = (sorted(tokens, key=lambda x: random.random()) 
          for _, tokens in groupby(zip(tokens, classes), itemgetter(1)))
chosen_tokens, corresponding_classes = zip(*chain.from_iterable(zip(*groups)))

There just two steps: 1. make sure the lists per group are randomized (this happens magically in sorted(tokens, key=lambda x: random.random()) because the sort key is always a random value). 2. It is also important to know that zip samples elements until the shortest generator is exhausted (which makes this solution so short). zip(*groups) is an iterator that retrieves triplets (since 3 classes) in each iteration. Because we shuffled the lists beforehand, they are thus sampled randomly. If we want again to separate tokens and classes we concatenate the triplets and unzip them again.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.