For a research project, I am analyzing correlations using various machine learning algorithms. As such, I run the following code (simplified for demonstration):
# Make a custom scorer for pearson's r (from scipy)
scorer = lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]
# Create a progress bar
progress_bar = tqdm(14400)
# Initialize a dataframe to store scores
df = pd.DataFrame(columns=["data", "pipeline", "r"])
# Loop over datasets
for data in datasets: #288 datasets
X_train = data.X_train
X_test = data.X_test
y_train = data.y_train
y_test = data.y_test
# Loop over pipelines
for pipeline in pipelines: #50 pipelines
scores = cross_val_score(pipeline, X_train, y_train, cv=int(len(X_train)/3), scoring=scorer)
r = scores.mean()
# Create a new row to save data
df.loc[(df.last_valid_index() or 0) + 1] = {"data": data.name, "pipeline": pipeline, "r": r}
progress_bar.update(1)
progress_bar.close()
X_train is a pandas dataframe with shape (20, 34)
X_test is a pandas dataframe with shape (9, 34)
y_train is pandas series with length 20
y_test is a pandas series with length 9
An example of pipeline is:
Pipeline(steps=[('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=9)),
('regressor', LinearRegression())])
However, after approximately 8700 iterations (total), I get the following MemoryError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-9ff48105b8ff> in <module>
40 y = targets[label]
41 #Finally, we can test the correlation
---> 42 scores = cross_val_score(regressor, X_train, y.loc[train_indices], cv=int(len(X_train)/3), scoring=lambda regressor, X, y: pearsonr(regressor.predict(X), y)[0]) #Three samples per test set, as that seems like the logical minimum for Pearson
43 r = scores.mean()
44 # print(f"{regressor} was able to predict {label} based on the {band} band of the {network} network with a Pearson's r of {r} of the data that could be explained.\n")
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
513 scorer = check_scoring(estimator, scoring=scoring)
514
--> 515 cv_results = cross_validate(
516 estimator=estimator,
517 X=X,
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
283 )
284
--> 285 _warn_or_raise_about_fit_failures(results, error_score)
286
287 # For callabe scoring, the return type is only know after calling. If the
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _warn_or_raise_about_fit_failures(results, error_score)
365 f"Below are more details about the failures:\n{fit_errors_summary}"
366 )
--> 367 raise ValueError(all_fits_failed_message)
368
369 else:
ValueError:
All the 6 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
X, y, X_offset, y_offset, X_scale = _preprocess_data(
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 41.8 GiB for an array with shape (16, 350343565) and data type float64
--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 382, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 692, in fit
X, y, X_offset, y_offset, X_scale = _preprocess_data(
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py", line 262, in _preprocess_data
X = check_array(X, copy=copy, accept_sparse=["csr", "csc"], dtype=FLOAT_DTYPES)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 925, in check_array
array = np.array(array, dtype=dtype, order=order)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 44.4 GiB for an array with shape (17, 350343565) and data type float64
What can I do to prevent this error, and how did it originate in the first place? I tried using sklearn's clone function on the pipeline that was still in my memory, and then calling fit, but I got the same error. However, when I created a new pipeline (still in the same session), and called fit on it, it did work.
xiterations (which "loop"), and go away with some sort of restart. Is the 41Gb reasonable for your problem? for your computer?XorXt, the first argument tofitis getting too large. The error occurs when it tries to validate it, converting it into a multidimensional numeric float array. The challenge is to trace this back to your code, and determine the shape and dtype of the input frame or array. Does (16, 350343565) shape make any sense?