balance numpy array with over-sampling

Question

please help me finding a clean way to create a new array out of existing. it should be over-sampled, if the number of example of any class is smaller than the maximum number of examples in the class. samples should be taken from the original array (makes no difference, whether randomly or sequentially)

let's say, initial array is this:

[  2,  29,  30,   1]
[  5,  50,  46,   0]
[  1,   7,  89,   1]
[  0,  10,  92,   9]
[  4,  11,   8,   1]
[  3,  92,   1,   0]

the last column contains classes:

classes = [ 0,  1,  9]

the distribution of the classes is the following:

distrib = [2, 3, 1]

what i need is to create a new array with equal number of samples of all classes, taken randomly from the original array, e.g.

[  5,  50,  46,   0]
[  3,  92,   1,   0]
[  5,  50,  46,   0] # one example added
[  2,  29,  30,   1]
[  1,   7,  89,   1]
[  4,  11,   8,   1]
[  0,  10,  92,   9]
[  0,  10,  92,   9] # two examples
[  0,  10,  92,   9] # added

Jaime · Accepted Answer · 2014-04-30 15:52:30Z

The following code does what you are after:

a = np.array([[  2,  29,  30,   1],
              [  5,  50,  46,   0],
              [  1,   7,  89,   1],
              [  0,  10,  92,   9],
              [  4,  11,   8,   1],
              [  3,  92,   1,   0]])

unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype)
for j in xrange(len(unq)):
    indices = np.random.choice(np.where(unq_idx==j)[0], cnt)
    out[j*cnt:(j+1)*cnt] = a[indices]

>>> out
array([[ 5, 50, 46,  0],
       [ 5, 50, 46,  0],
       [ 5, 50, 46,  0],
       [ 1,  7, 89,  1],
       [ 4, 11,  8,  1],
       [ 2, 29, 30,  1],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9]])

When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True,
                                  return_counts=True)

Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:

unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype)
slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt)))
for j in xrange(len(unq)):
    indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j])
    out[slices[j]:slices[j+1]] = a[indices]
out = np.vstack((a, out))

>>> out
array([[ 2, 29, 30,  1],
       [ 5, 50, 46,  0],
       [ 1,  7, 89,  1],
       [ 0, 10, 92,  9],
       [ 4, 11,  8,  1],
       [ 3, 92,  1,  0],
       [ 5, 50, 46,  0],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9]])

askewchan · Accepted Answer · 2014-04-30 16:00:35Z

5

This gives a random distribution with equal probability for each class:

distrib = np.bincount(a[:,-1])
prob = 1/distrib[a[:, -1]].astype(float)
prob /= prob.sum()

In [38]: a[np.random.choice(np.arange(len(a)), size=np.count_nonzero(distrib)*distrib.max(), p=prob)]
Out[38]: 
array([[ 5, 50, 46,  0],
       [ 4, 11,  8,  1],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9],
       [ 2, 29, 30,  1],
       [ 0, 10, 92,  9],
       [ 3, 92,  1,  0],
       [ 1,  7, 89,  1],
       [ 1,  7, 89,  1]])

Each class has equal probability, not guaranteed equal incidence.

answered Apr 30, 2014 at 16:00

askewchan

46.7k18 gold badges125 silver badges135 bronze badges

1 Comment

funkifunki Over a year ago

while being a really cool piece of code it doesn't actually solve the problem, since the equal presence of all classes not guaranteed: you can get [0 0 0 1 1 1 9 9 9], but also it is possible to catch [9 0 0 9 9 9 0 1 9]. thanks a lot, though, cool example!

ThisSuitIsBlackNot · Accepted Answer · 2020-04-26 14:38:58Z

You can use the imbalanced-learn package:

import numpy as np
from imblearn.over_sampling import RandomOverSampler

data = np.array([
    [  2,  29,  30,   1],
    [  5,  50,  46,   0],
    [  1,   7,  89,   1],
    [  0,  10,  92,   9],
    [  4,  11,   8,   1],
    [  3,  92,   1,   0]
])

ros = RandomOverSampler()

# fit_resample expects two arguments: a matrix of sample data and a vector of
# sample labels. In this case, the sample data is in the first three columns of 
# our array and the labels are in the last column
X_resampled, y_resampled = ros.fit_resample(data[:, :-1], data[:, -1])

# fit_resample returns a matrix of resampled data and a vector with the 
# corresponding labels. Combine them into a single matrix
resampled = np.column_stack((X_resampled, y_resampled))

print(resampled)

Output:

[[ 2 29 30  1]
 [ 5 50 46  0]
 [ 1  7 89  1]
 [ 0 10 92  9]
 [ 4 11  8  1]
 [ 3 92  1  0]
 [ 3 92  1  0]
 [ 0 10 92  9]
 [ 0 10 92  9]]

The RandomOverSampler offers different sampling strategies, but the default is to resample all classes except the majority class.

Collectives™ on Stack Overflow

balance numpy array with over-sampling

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related