0

For a research project, I am analyzing correlations using various machine learning algorithms. As such, I run the following code (simplified for demonstration):

# Make a custom scorer for pearson's r (from scipy)    
scorer = lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]

# Create a progress bar
progress_bar = tqdm(14400)

# Initialize a dataframe to store scores
df = pd.DataFrame(columns=["data", "pipeline", "r"])

# Loop over datasets
for data in datasets: #288 datasets
    X_train = data.X_train
    X_test = data.X_test
    y_train = data.y_train
    y_test = data.y_test
    
    # Loop over pipelines
    for pipeline in pipelines: #50 pipelines
        scores = cross_val_score(pipeline, X_train, y_train, cv=int(len(X_train)/3), scoring=scorer)
        r = scores.mean()
        # Create a new row to save data
        df.loc[(df.last_valid_index() or 0) + 1] = {"data": data.name, "pipeline": pipeline, "r": r}
        progress_bar.update(1)

progress_bar.close()
    

X_train is a pandas dataframe with shape (20, 34)

X_test is a pandas dataframe with shape (9, 34)

y_train is pandas series with length 20

y_test is a pandas series with length 9

An example of pipeline is:

Pipeline(steps=[('scaler', StandardScaler()),
                ('poly', PolynomialFeatures(degree=9)),
                ('regressor', LinearRegression())])

However, after approximately 8700 iterations (total), I get the following MemoryError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-9ff48105b8ff> in <module>
     40                 y = targets[label]
     41                 #Finally, we can test the correlation
---> 42                 scores = cross_val_score(regressor, X_train, y.loc[train_indices], cv=int(len(X_train)/3), scoring=lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]) #Three samples per test set, as that seems like the logical minimum for Pearson
     43                 r = scores.mean()
     44 #                     print(f"{regressor} was able to predict {label} based on the {band} band of the {network} network with a Pearson's r of {r} of the data that could be explained.\n")

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
    513     scorer = check_scoring(estimator, scoring=scoring)
    514 
--> 515     cv_results = cross_validate(
    516         estimator=estimator,
    517         X=X,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
    283     )
    284 
--> 285     _warn_or_raise_about_fit_failures(results, error_score)
    286 
    287     # For callabe scoring, the return type is only know after calling. If the

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
    365                 f"Below are more details about the failures:\n{fit_errors_summary}"
    366             )
--> 367             raise ValueError(all_fits_failed_message)
    368 
    369         else:

ValueError: 
All the 6 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
    X, y, X_offset, y_offset, X_scale = _preprocess_data(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
    X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
    array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.8 GiB for an array with shape (16, 350343565) and data type float64

--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
    X, y, X_offset, y_offset, X_scale = _preprocess_data(
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
    X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
    array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 44.4 GiB for an array with shape (17, 350343565) and data type float64

What can I do to prevent this error, and how did it originate in the first place? I tried using sklearn's clone function on the pipeline that was still in my memory, and then calling fit, but I got the same error. However, when I created a new pipeline (still in the same session), and called fit on it, it did work.

2
  • It will hard to help you. We can't replicate it, without your data, or your computer and memory size. And I for one don't like to hang my computer with memory errors. Also it seems to be a transitory problem, occurring after x iterations (which "loop"), and go away with some sort of restart. Is the 41Gb reasonable for your problem? for your computer? Commented Jun 8, 2022 at 14:52
  • Looks like the X or Xt, the first argument to fit is getting too large. The error occurs when it tries to validate it, converting it into a multidimensional numeric float array. The challenge is to trace this back to your code, and determine the shape and dtype of the input frame or array. Does (16, 350343565) shape make any sense? Commented Jun 8, 2022 at 16:22

2 Answers 2

2

The problem is the ginormous basis expansion you're doing. Adding 9th degree polynomial features for 34 features results in 52,451,256 features. Even though you only have a handful of samples, it's no wonder you're running out of memory.

Just look at what a 2nd degree PolynomialFeatures gives you for 4 features:

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.pipeline import make_pipeline

>>> arr = np.random.random(size=(10, 4))
>>> poly = PolynomialFeatures(degree=2).fit(arr)
>>> poly.get_feature_names()

This results in:

['1',
 'x0',
 'x1',
 'x2',
 'x3',
 'x0^2',
 'x0 x1',
 'x0 x2',
 'x0 x3',
 'x1^2',
 'x1 x2',
 'x1 x3',
 'x2^2',
 'x2 x3',
 'x3^2']

If you use even 52 features on 20 instances of data, you are likely well into overfitting territory. Even degree 2 polynomials on your data will give you 630 features, which is way too many. I would use inspection (e.g. pair plots), feature importance, and maybe PCA to reduce the dimensionality, then ditch the basis expansion until you know what direction things are going.

With large numbers of features and high degree polynomials, it may become impossible to ask sklearn for the list, e.g. in order to count them. You can instead compute it using the binomial coefficient function from scipy:

>>> from scipy.special import binom
>>> binom(34, 9)
52451256.0

If you don't want the powers of X to be included, only the products, you can specify interaction_only=True. This will produce fewer features, but not by much.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, just what I was looking for! Is there any way to directly limit the amount of features created by PolynomialFeatures? Also, how do you calculate the amount of features you will end up with given polynomial degree and number of initial features?
I updated my answer to try to address these questions.
0

MemoryError means that Python interpreter runs out of RAM and swap space to allocate new memory. Usually the solutions include 1) work with smaller dataset 2) getting a computer with more RAM. 3) Checking your code does not leak memory.

2 Comments

my dataset is not very large (look at their size in the post). Furthermore, my computer is quite powerful, but there is no computer that has 44 GB of RAM I think. My suspicion is that the pipelines get increasingly larger with every iteration, but like I said, using sklearn's clone function does not help.
Or (4) Use Dask.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.