How to run sklearn.model_selection.GridSearchCV without splitting data?

Question

I would like to evaluate the performance of a model pipeline. I am not training my model on the ground-truth labels that I am evaluating the pipeline against, therefore doing a cross-validation scheme is unnecessary. However, I would still like to use the grid search functionality provided in sklearn.

Is it possible to use sklearn.model_selection.GridSearchCV without splitting the data? In other words, I would like to run Grid Search and get scores on the full dataset that I pass in to the pipeline.

Here is a simple example:

I might wish to choose the optimal k for KMeans. I am actually going to be using KMeans on many datasets that are similar in some sense. It so happens that I have some ground-truth labels for a few such datasets, which I will call my "training" data. So, instead of using something like BIC, I decide to simply pick the optimal k for my training data, and employ that k for future datasets. Search over possible values of k is a grid search. KMeans is available in the sklearn library, so I can very easily define a grid search on this model. Incidentally, KMeans takes in an "empty" y value, which simply passes through and can be used in a GridSearchCV scorer. However, there is no sense in doing cross-validation here, since my individual kmeans models never see the ground truth labels and are therefore incapable of overfitting.

To be clear, the above example is simply a contrived example to justify a possible use case for such a thing for those who are afraid that I might abuse this functionality. The solution to the example above that I am interested in is how to not split the data in GridSearchCV.

Is it possible to use sklearn.model_selection.GridSearchCV without splitting the data?

Your doing GridSearchCV equals to doing CV. Of course from technical standpoint you may do that without train/test split. But that invalidates train/validate/test philosophy commonly accepted in ML. — Sergey Bushmanov
– Sergey Bushmanov, Commented Feb 19, 2020 at 16:28
@SergeyBushmanov the train/validate/test philosophy assumes that one is training your data against the ground truth labels (the same labels that one might be testing against). My training pipeline does not use the ground truth labels. Therefore, cross-validation does nothing, and overfitting to the ground truth labels is impossible. — Him
– Him, Commented Feb 19, 2020 at 16:55
"Of course from technical standpoint you may do that without train/test split." Of course it is. However, I am wondering if the existing gridsearch helpers in sklearn can aid in this? — Him
– Him, Commented Feb 19, 2020 at 16:56
In theory you can do gs=GridSearchCV(scoring=None, cv=None); gs.fit(X, None) but you should be more specific in what your problem is.... — Sergey Bushmanov
– Sergey Bushmanov, Commented Feb 19, 2020 at 17:02
cv=None: "None, to use the default 5-fold cross validation,". This does not turn off cross-validation. — Him
– Him, Commented Feb 19, 2020 at 17:17

Him · Accepted Answer · 2020-02-19 19:22:15Z

The docs claim that the cv parameter in the GridSearchCV constructor optionally is capable of accepting "An iterable yielding (train, test) splits as arrays of indices." It turns out that the "arrays of indices" bit is irrelevant, and it is possible to send in arbitrary objects that can be used to index arrays. If we hand in a thing that gives the whole array for both the train and the test split, we can circumvent the cross-validation behavior.

This is one way to accomplish that thing that corresponds to the example given in the question:

grid_search = sklearn.model_selection.GridSearchCV(
    sklearn.cluster.KMeans(),
    {"k": [2,3,4,5,7,10,20]},
    cv=(((slice(None), slice(None)),)
)

If you pass the ground-truth labels as y to this, it will evaluate the outcome of each run of KMeans corresponding to the various k against the entire dataset.

BlackBear · Accepted Answer · 2020-02-19 16:18:09Z

-1

You need to do cross-validation if you do grid search, otherwise you will overfit on the test data, because you evaluate several settings of hyper-parameters on the same data.

answered Feb 19, 2020 at 16:18

BlackBear

23.1k10 gold badges53 silver badges90 bronze badges

4 Comments

Him Over a year ago

I reiterate that I am not employing the ground truth labels during training. It is impossible to overfit to data that I do not provide to the training algorithm.

BlackBear Over a year ago

@Scott then I do not understand what you are doing. Could you clarify? Do you not use labels for training? Is some kind of unsupervised learning?

Him Over a year ago

sort of. Note that grid search is simply an optimization method. Various optimizers that don't involve cross-validation are used for model selection all the time. For example, gradient decent methods optimize over parameters when the optimization surface is differentiable. My optimization requires grid search. The 'hyperparameters' that usually go into GridSearchCV are, in fact, simply my model parameters.

Him Over a year ago

I will try to contrive a simple example.

Collectives™ on Stack Overflow

How to run sklearn.model_selection.GridSearchCV without splitting data?

2 Answers 2

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related