Python - running into x_test y_test fit errors

Question

I have built a neural network and it worked fine with a small dataset of around 300,000 rows with 2 categorical variables and 1 independent variable, but was running into memory errors when i increased it to 6.5 million rows. So I decided to modify the code and am getting closer but now I am running into an issue with fit errors. I have 2 categorical variables and one column for the dependent variable of 1's and 0's(suspicious or not suspicious. To start off the dataset looks like this:

DBF2
   ParentProcess                   ChildProcess               Suspicious
0  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
1  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
2  C:\Windows\System32\svchost.exe                      ...            1
3  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
4  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0
5  C:\Program Files (x86)\Wireless AutoSwitch\wrl...    ...            0

My code followed/with the errors:

import pandas as pd
import numpy as np
import hashlib
import matplotlib.pyplot as plt
import timeit

X = DBF2.iloc[:, 0:2].values
y = DBF2.iloc[:, 2].values#.ravel()

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

onehotencoder = OneHotEncoder(categorical_features = [0,1])
X = onehotencoder.fit_transform(X)

index_to_drop = [0, 2039]
to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))
X = X[:,to_keep]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

#ERROR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 517, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 590, in fit
    return self.partial_fit(X, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 621, in partial_fit
    "Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.

X_test = sc.transform(X_test)

#ERROR
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 677, in transform
    check_is_fitted(self, 'scale_')
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 768, in check_is_fitted
    raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

If this helps i printed the X_train and y_train:

X_train
<5621203x7043 sparse matrix of type '<type 'numpy.float64'>'
with 11242334 stored elements in Compressed Sparse Row format>

y_train
array([0, 0, 0, ..., 0, 0, 0])

Have you tried the fix that the error message suggests? sc = StandardScalar(with_mean=False)? — scnerd
– scnerd, Commented Aug 24, 2018 at 17:10

tobsecret · Accepted Answer · 2018-08-24 17:11:55Z

X_train is a sparse matrix, which is great for when you're using a large dataset like in your case. The problem is that as the documentation explains:

with_mean : boolean, True by default

If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

You can try passing with_mean=False :

sc = StandardScaler(with_mean=False)
X_train = sc.fit_transform(X_train)

The following line fails because sc is still an untouched StandardScaler object.

X_test = sc.transform(X_test)

To be able to use the transform method, you would first have to fit the StandardScaler to a dataset. If your intention was to fit the StandardScaler on your training set and use it to transform both the training set and the test set to the same space, then you could do that as follows:

sc = StandardScaler(with_mean=False)
X_train_sc = sc.fit(X_train)
X_train = X_train_sc.transform(X_train)
X_test = X_train_sc.transform(X_test)

Collectives™ on Stack Overflow

Python - running into x_test y_test fit errors

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related