Undersampling numpy array

Question

I have a train set with 10192 samples of '0' and 2512 samples of '1'.
I've applied a PCA on the set to reduce the dimensionality.
I want to undersample this numpy array.
Here's my code :

df = read_csv("train.csv")
X = df.drop(['label'], axis = 1)
y = df['label']
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.2, random_state = 42)
model = PCA(n_components = 19)
model.fit(X_train)
X_train_pca = model.transform(X_train)
X_validation_pca = model.transform(X_validation)

X_train = np.array(X_train_pca)
X_validation = np.array(X_validation_pca)
y_train = np.array(y_train)
y_validation = np.array(y_validation)

How can I undersample '0' class from X_train numpy array?

Are You referring to subsample from entire data set or to balancing classes by undersampling just ''0" class? — ipj
– ipj, Commented Jul 13, 2020 at 9:13
I would suggest taking a look at the imbalanced-learn package rather than doing this yourself. — Dan
– Dan, Commented Jul 13, 2020 at 10:18
A Numpy question is certainly expected to be tagged as numpy (added), and not as machine-learning (removed). — desertnaut
– desertnaut, Commented Jul 13, 2020 at 15:24

ipj · Accepted Answer · 2020-07-13 10:41:47Z

2

Try after importing csv into df:

# class count
count_class_0, count_class_1 = df.label.value_counts()

# separate according to `label`
df_class_0 = df[df['label'] == 0]
df_class_1 = df[df['label'] == 1]

# sample only from class 0 quantity of rows of class 1
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)

Then perform all calculations on df_test_under data frame.

Alternatively use RandomUnderSampler:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)

edited Jul 13, 2020 at 10:41

answered Jul 13, 2020 at 10:10

ipj

3,5981 gold badge17 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Undersampling numpy array

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related