1

Upon running my first attempt at a random forest classifier I recieve a set of different tracebacks.

The data that I am using comprises of 27 parameters and one 'results' column on which to train the model. My first attempt used 11,000 rows of data, I had this down as a test dataset because in reality I was hoping to look at datasets closer to the 1,2 million rows. But I recieved the following error:

File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
ValueError: array is too big.

So I reduced the datafile size to 5k rows and recieved the following error:

File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
MemoryError

And finally I reduced the datafile size to 1k rows and I still recieved an error, again different to the previous ones:

File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 309, in fit
raise Exception("Multiple input features cannot have the same"
Exception: Multiple input features cannot have the same value.

I think it has something to do with the cross-Validation function I am using:

# Using a custom cross-validation function
# the one in sklearn 0.14.1 has a bug. Otherwise I would have used
# sklearn.cross_validation.cross_val_score.
def crossValidation(model, X, Y, nfolds=10):
    """
    Performs k-fold cross-validation. Takes as arguments an arbitrary
    sklearn model, a training dataset (X, Y) and the number of folds.
    """
    n = data.shape[0]
    r = range(n)
    shuffle(r)
    scores = list()
    X_folds = np.array_split(X[r], nfolds)
    Y_folds = np.array_split(Y[r], nfolds)
    for k in range(nfolds):
        # We use 'list' to copy, in order to 'pop' later on
        X_train = list(X_folds)
        X_test  = X_train.pop(k)
        X_train = np.concatenate(X_train)
        Y_train = list(Y_folds)
        Y_test  = Y_train.pop(k)
        Y_train = np.concatenate(Y_train)
        model.fit(X_train, Y_train)
        y = model.predict(X_test)
        score = metrics.mean_squared_error(y, Y_test)
        scores.append(score)
    return np.mean(scores)

Any thoughts or advice would be appreciated, please note this is my first attempt at running random forest classifiers so I may have made some rooky mistakes.

Edit in responce to comments:

Unfortunately I cannot supply the full code for confidentiallity reasons, however I hope either of the following snippets encode the shape of X_train and Y_train when passing it to model.fit ??

Snippet1

def readCSV(path):
    """
    Read a CSV file of floats, with no headder 
    """
    data = []
    mycsv = csv.reader(open(path), delimiter="|")
    for counter, row in enumerate(mycsv):
        if counter != 0:
            data.append(row)
    return np.asarray(data, dtype=np.float32)
print np.asarray

Snippet2

data = readCSV("FullUnMergedDataWSPSR14TEST4RFDO.csv")
X = data[0:,:26]
Y = data[:, 27]
8
  • 1
    What is the shape of X_train and Y_train when passing it to model.fit()? Commented Jan 8, 2014 at 11:45
  • Are you able to turn your sample code into a complete example? By which I mean, add code to generate data of the right shape and dtype (e.g. with np.random.rand), add the import statements, add the code to call your functions so that it produces the error? It will greatly help people trying to answer. Commented Jan 8, 2014 at 11:53
  • @kazemakase By shape do you simply mean the phisical shape/layout of the data? Or does the term shape have an additional meaning here. In accordance with ogrisels comments, commenting out everything to do with the gaussian_process of the script enable it to complete its model Commented Jan 8, 2014 at 15:04
  • See docs.scipy.org/doc/numpy/reference/generated/… and wiki.scipy.org/… Commented Jan 8, 2014 at 15:48
  • @MrE Added the snippets that I believe define the shape of the X_train and Y_train. Commented Jan 9, 2014 at 1:09

1 Answer 1

2

You should print and check the value for n_nonzero_cross_dist and n_features in your first snippets. The expected size of 2D numpy array with shape (n_nonzero_cross_dist, n_features) is n_nonzero_cross_dist * n_features * 8 / 1e9 GB (if the dtype is either np.int or np.float64). You can check for yourself what memory would be required from the dimensions of your problem.

Furthermore the title and description of your question are misleading or incorrect: the error you give is a bout a Gaussian Process model, not a Random Forest.

Sign up to request clarification or add additional context in comments.

4 Comments

many thanks for the reply, i am running a script which uses RandomForest, LinearRegression, LogisticRegression, GradientBoostingRegressor & GaussianProcess. I was not aware as to which one was causing the issue. Many thanks
Yep on looking back, it is there staring at me, facepalm
Please edit the title and description to fix it so that other people don't get confused when googling for issues related to random forests.
Yep Done :) More charecters for charecters sake

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.