10

please help me finding a clean way to create a new array out of existing. it should be over-sampled, if the number of example of any class is smaller than the maximum number of examples in the class. samples should be taken from the original array (makes no difference, whether randomly or sequentially)

let's say, initial array is this:

[  2,  29,  30,   1]
[  5,  50,  46,   0]
[  1,   7,  89,   1]
[  0,  10,  92,   9]
[  4,  11,   8,   1]
[  3,  92,   1,   0]

the last column contains classes:

classes = [ 0,  1,  9]

the distribution of the classes is the following:

distrib = [2, 3, 1]

what i need is to create a new array with equal number of samples of all classes, taken randomly from the original array, e.g.

[  5,  50,  46,   0]
[  3,  92,   1,   0]
[  5,  50,  46,   0] # one example added
[  2,  29,  30,   1]
[  1,   7,  89,   1]
[  4,  11,   8,   1]
[  0,  10,  92,   9]
[  0,  10,  92,   9] # two examples
[  0,  10,  92,   9] # added

3 Answers 3

11

The following code does what you are after:

a = np.array([[  2,  29,  30,   1],
              [  5,  50,  46,   0],
              [  1,   7,  89,   1],
              [  0,  10,  92,   9],
              [  4,  11,   8,   1],
              [  3,  92,   1,   0]])

unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq),) + a.shape[1:], a.dtype)
for j in xrange(len(unq)):
    indices = np.random.choice(np.where(unq_idx==j)[0], cnt)
    out[j*cnt:(j+1)*cnt] = a[indices]

>>> out
array([[ 5, 50, 46,  0],
       [ 5, 50, 46,  0],
       [ 5, 50, 46,  0],
       [ 1,  7, 89,  1],
       [ 4, 11,  8,  1],
       [ 2, 29, 30,  1],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9]])

When numpy 1.9 is released, or if you compile from the development branch, then the first two lines can be condensed into:

unq, unq_idx, unq_cnt = np.unique(a[:, -1], return_inverse=True,
                                  return_counts=True)

Note that, the way np.random.choice works, there is no guarantee that all rows of the original array will be present in the output one, as the example above shows. If that is needed, you could do something like:

unq, unq_idx = np.unique(a[:, -1], return_inverse=True)
unq_cnt = np.bincount(unq_idx)
cnt = np.max(unq_cnt)
out = np.empty((cnt*len(unq) - len(a),) + a.shape[1:], a.dtype)
slices = np.concatenate(([0], np.cumsum(cnt - unq_cnt)))
for j in xrange(len(unq)):
    indices = np.random.choice(np.where(unq_idx==j)[0], cnt - unq_cnt[j])
    out[slices[j]:slices[j+1]] = a[indices]
out = np.vstack((a, out))

>>> out
array([[ 2, 29, 30,  1],
       [ 5, 50, 46,  0],
       [ 1,  7, 89,  1],
       [ 0, 10, 92,  9],
       [ 4, 11,  8,  1],
       [ 3, 92,  1,  0],
       [ 5, 50, 46,  0],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9]])
Sign up to request clarification or add additional context in comments.

Comments

5

This gives a random distribution with equal probability for each class:

distrib = np.bincount(a[:,-1])
prob = 1/distrib[a[:, -1]].astype(float)
prob /= prob.sum()

In [38]: a[np.random.choice(np.arange(len(a)), size=np.count_nonzero(distrib)*distrib.max(), p=prob)]
Out[38]: 
array([[ 5, 50, 46,  0],
       [ 4, 11,  8,  1],
       [ 0, 10, 92,  9],
       [ 0, 10, 92,  9],
       [ 2, 29, 30,  1],
       [ 0, 10, 92,  9],
       [ 3, 92,  1,  0],
       [ 1,  7, 89,  1],
       [ 1,  7, 89,  1]])

Each class has equal probability, not guaranteed equal incidence.

1 Comment

while being a really cool piece of code it doesn't actually solve the problem, since the equal presence of all classes not guaranteed: you can get [0 0 0 1 1 1 9 9 9], but also it is possible to catch [9 0 0 9 9 9 0 1 9]. thanks a lot, though, cool example!
1

You can use the imbalanced-learn package:

import numpy as np
from imblearn.over_sampling import RandomOverSampler

data = np.array([
    [  2,  29,  30,   1],
    [  5,  50,  46,   0],
    [  1,   7,  89,   1],
    [  0,  10,  92,   9],
    [  4,  11,   8,   1],
    [  3,  92,   1,   0]
])

ros = RandomOverSampler()

# fit_resample expects two arguments: a matrix of sample data and a vector of
# sample labels. In this case, the sample data is in the first three columns of 
# our array and the labels are in the last column
X_resampled, y_resampled = ros.fit_resample(data[:, :-1], data[:, -1])

# fit_resample returns a matrix of resampled data and a vector with the 
# corresponding labels. Combine them into a single matrix
resampled = np.column_stack((X_resampled, y_resampled))

print(resampled)

Output:

[[ 2 29 30  1]
 [ 5 50 46  0]
 [ 1  7 89  1]
 [ 0 10 92  9]
 [ 4 11  8  1]
 [ 3 92  1  0]
 [ 3 92  1  0]
 [ 0 10 92  9]
 [ 0 10 92  9]]

The RandomOverSampler offers different sampling strategies, but the default is to resample all classes except the majority class.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.