3

I have X_train and y_train as 2 numpy.ndarrays of size (32561, 108) and (32561,) respectively.

I am receiving a memory error every time I call fit for my GaussianProcessClassifier.

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.gaussian_process import GaussianProcessClassifier
>>> from sklearn.gaussian_process.kernels import RBF
>>> X_train.shape
(32561, 108)
>>> y_train.shape
(32561,)
 >>> gp_opt = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0))
>>> gp_opt.fit(X_train,y_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 613, in fit
    self.base_estimator_.fit(X, y)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 209, in fit
    self.kernel_.bounds)]
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 427, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 199, in fmin_l_bfgs_b
    **opts)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 201, in obj_func
    theta, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 338, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 753, in __call__
    K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1002, in __call__
    K = self.constant_value * np.ones((X.shape[0], Y.shape[0]))
  File "/home/retsim/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 188, in ones
    a = empty(shape, dtype, order)
MemoryError
>>> 

Why am I getting this error, and how can I fix it?

2 Answers 2

11

According to the Scikit-Learn documentation, the estimator GaussianProcessClassifier (as well as GaussianProcessRegressor), has a parameter copy_X_train which is set to True by default:

class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, optimizer=’fmin_l_bfgs_b’, n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class=’one_vs_rest’, n_jobs=1)

The description for the parameter copy_X_train notes that:

If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.

I had tried fitting the estimator with a similar sized training dataset ( observations and features) as mentioned by the OP, on a PC with 32 GB RAM. With copy_X_train set to True, 'a persistent copy of the training data' was possibly eating up my RAM resulting in a MemoryError. Setting this parameter to False fixed the issue.

Scikit-Learn's description notes that, based on this setting 'just a reference to the training data is stored, which might cause predictions to change if the data is modified externally'. My interpretation of this statement is:

Instead of storing the whole training dataset (in the form of a matrix of size nxn based on n observations) in the fitted estimator, only a reference to this dataset is stored - hence avoiding the high RAM usage. As long as the dataset stays intact externally (i.e not within the fitted estimator), it can be reliably fetched when a prediction has to be made. Modification of the dataset affects the predictions.

There may be better interpretations and theoretical explanations.

Sign up to request clarification or add additional context in comments.

Comments

9

On line 400 of gpc.py, the implementation for the classifier you're using, there's a matrix created that has size (N, N), where N is the number of observations. So the code is trying to create a matrix of shape (32561, 32561). That will obviously cause some problems, since that matrix has over a billion elements.

As to why it's doing this, I don't really know scikit-learn's implementation, but in general, Gaussian processes require estimating covariance matrices over the whole input space, which is why they're not that great if you have high-dimensional data. (The docs say "high-dimensional" is anything greater than a few dozen.)

My only recommendation for how to fix it is to work in batches. Scikit-learn may have some utilities to do generate batches for you, or you can do it manually.

2 Comments

+1. If you already know this and have a lot of RAM, so you expect it to work, then double-check that you are running 64-bit Python. 32-bit Python on a 64-bit OS will probably only be able to access 2GB of RAM.
The link doesn't work anymore. gpc.py was moved to _gpc.py

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.