Upon running my first attempt at a random forest classifier I recieve a set of different tracebacks.
The data that I am using comprises of 27 parameters and one 'results' column on which to train the model. My first attempt used 11,000 rows of data, I had this down as a test dataset because in reality I was hoping to look at datasets closer to the 1,2 million rows. But I recieved the following error:
File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
ValueError: array is too big.
So I reduced the datafile size to 5k rows and recieved the following error:
File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
MemoryError
And finally I reduced the datafile size to 1k rows and I still recieved an error, again different to the previous ones:
File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 309, in fit
raise Exception("Multiple input features cannot have the same"
Exception: Multiple input features cannot have the same value.
I think it has something to do with the cross-Validation function I am using:
# Using a custom cross-validation function
# the one in sklearn 0.14.1 has a bug. Otherwise I would have used
# sklearn.cross_validation.cross_val_score.
def crossValidation(model, X, Y, nfolds=10):
"""
Performs k-fold cross-validation. Takes as arguments an arbitrary
sklearn model, a training dataset (X, Y) and the number of folds.
"""
n = data.shape[0]
r = range(n)
shuffle(r)
scores = list()
X_folds = np.array_split(X[r], nfolds)
Y_folds = np.array_split(Y[r], nfolds)
for k in range(nfolds):
# We use 'list' to copy, in order to 'pop' later on
X_train = list(X_folds)
X_test = X_train.pop(k)
X_train = np.concatenate(X_train)
Y_train = list(Y_folds)
Y_test = Y_train.pop(k)
Y_train = np.concatenate(Y_train)
model.fit(X_train, Y_train)
y = model.predict(X_test)
score = metrics.mean_squared_error(y, Y_test)
scores.append(score)
return np.mean(scores)
Any thoughts or advice would be appreciated, please note this is my first attempt at running random forest classifiers so I may have made some rooky mistakes.
Edit in responce to comments:
Unfortunately I cannot supply the full code for confidentiallity reasons, however I hope either of the following snippets encode the shape of X_train and Y_train when passing it to model.fit ??
Snippet1
def readCSV(path):
"""
Read a CSV file of floats, with no headder
"""
data = []
mycsv = csv.reader(open(path), delimiter="|")
for counter, row in enumerate(mycsv):
if counter != 0:
data.append(row)
return np.asarray(data, dtype=np.float32)
print np.asarray
Snippet2
data = readCSV("FullUnMergedDataWSPSR14TEST4RFDO.csv")
X = data[0:,:26]
Y = data[:, 27]
np.random.rand), add the import statements, add the code to call your functions so that it produces the error? It will greatly help people trying to answer.