Scikit Learn GridSearchCV without cross validation (unsupervised learning)

Question

Is it possible to use GridSearchCV without cross validation? I am trying to optimize the number of clusters in KMeans clustering via grid search, and thus I don't need or want cross validation.

The documentation is also confusing me because under the fit() method, it has an option for unsupervised learning (says to use None for unsupervised learning). But if you want to do unsupervised learning, you need to do it without cross validation and there appears to be no option to get rid of cross validation.

You can implement a custom cv which will put all data into training and test. — Vivek Kumar
– Vivek Kumar, Commented Jun 20, 2017 at 4:53

Free Palestine · Accepted Answer · 2018-08-22 18:30:13Z

51

After much searching, I was able to find this thread. It appears that you can get rid of cross validation in GridSearchCV if you use:

cv=[(slice(None), slice(None))]

I have tested this against my own coded version of grid search without cross validation and I get the same results from both methods. I am posting this answer to my own question in case others have the same issue.

Edit: to answer jjrr's question in the comments, here is an example use case:

from sklearn.metrics import silhouette_score as sc

def cv_silhouette_scorer(estimator, X):
    estimator.fit(X)
    cluster_labels = estimator.labels_
    num_labels = len(set(cluster_labels))
    num_samples = len(X.index)
    if num_labels == 1 or num_labels == num_samples:
        return -1
    else:
        return sc(X, cluster_labels)

cv = [(slice(None), slice(None))]
gs = GridSearchCV(estimator=sklearn.cluster.MeanShift(), param_grid=param_dict, 
                  scoring=cv_silhouette_scorer, cv=cv, n_jobs=-1)
gs.fit(df[cols_of_interest])

edited Aug 22, 2018 at 18:30

answered Jun 21, 2017 at 17:10

Free Palestine

3,5397 gold badges29 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Kirill Dolmatov Over a year ago

I get the error: AttributeError: 'slice' object has no attribute 'flags'. Python 3.6, sklearn 0.20.3

Tobbey Over a year ago

AttributeError: 'slice' object has no attribute 'flags'

MehmedB Over a year ago

Python 3.8: AttributeError: 'memmap' object has no attribute 'index'

gary69 Over a year ago

why do you call fit in the scoring function?

MJM Over a year ago

I am using hdbscan and would like to implement this method, but when I run GridSearchCV in verbose =10 it says score=nan in the outputs. I have scorer= make_scorer(hdbscan.validity.validity_index,greater_is_better=True). Any help appreciated

|

Scratch'N'Purr · Accepted Answer · 2017-06-20 19:07:36Z

I'm going to answer your question since it seems like it has been unanswered still. Using the parallelism method with the for loop, you can use the multiprocessing module.

from multiprocessing.dummy import Pool
from sklearn.cluster import KMeans
import functools

kmeans = KMeans()

# define your custom function for passing into each thread
def find_cluster(n_clusters, kmeans, X):
    from sklearn.metrics import silhouette_score  # you want to import in the scorer in your function

    kmeans.set_params(n_clusters=n_clusters)  # set n_cluster
    labels = kmeans.fit_predict(X)  # fit & predict
    score = silhouette_score(X, labels)  # get the score

    return score

# Now's the parallel implementation
clusters = [3, 4, 5]
pool = Pool()
results = pool.map(functools.partial(find_cluster, kmeans=kmeans, X=X), clusters)
pool.close()
pool.join()

# print the results
print(results)  # will print a list of scores that corresponds to the clusters list

ihebiheb · Accepted Answer · 2018-05-30 16:48:58Z

7

I think that using cv=ShuffleSplit(test_size=0.20, n_splits=1) with n_splits=1 is a better solution like this post suggested

answered May 30, 2018 at 16:48

ihebiheb

5,3614 gold badges57 silver badges62 bronze badges

1 Comment

squarebrackets Over a year ago

This yields another result as cv = [(slice(None), slice(None))] (the resulting scores are different)

PavelLes · Accepted Answer · 2019-05-17 14:18:17Z

6

I recently came out with the following custom cross-validator, based on this answer. I passed it to GridSearchCV and it properly disabled the cross-validation for me:

import numpy as np

class DisabledCV:
    def __init__(self):
        self.n_splits = 1

    def split(self, X, y, groups=None):
        yield (np.arange(len(X)), np.arange(len(y)))

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

I hope it can help.

edited May 17, 2019 at 14:18

PavelLes

73 bronze badges

answered Mar 24, 2019 at 17:20

MrD

611 silver badge4 bronze badges

1 Comment

user9562553 Over a year ago

I test your solution, I got this error: "return self.n_splits AttributeError: 'numpy.ndarray' object has no attribute 'n_splits' ". Do you know how to fix it?

Nermin · Accepted Answer · 2023-04-25 11:02:17Z

You can create your own GridSearch using ParameterGrid.

For example:

from sklearn.model_selection import ParameterGrid

param_grid = {'a': [1, 2], 'b': [True, False]}

param_candidates = ParameterGrid(param_grid)
print(f'{len(param_candidates)} candidates')
results = []
for i, params in enumerate(param_candidates):
    model = estimator.set_params(**params)
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    results.append([params, score])
    print(f'{i+1}/{len(param_candidates)}: ', params, score)

print(max(results, key=lambda x: x[1]))

To increase performance I would suggest parallelizing the loop:

from joblib import Parallel, delayed

param_grid = {'a': [1, 2], 'b': [True, False]}
param_candidates = ParameterGrid(param_grid)
print(f'{len(param_candidates)} candidates')

def fit_model(params):
    model = estimator.set_params(**params)
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    return [params, score]

results = Parallel(n_jobs=-1)(delayed(fit_model)(params) for params in param_candidates)
print(max(results, key=lambda x: x[1]))

Collectives™ on Stack Overflow

Scikit Learn GridSearchCV without cross validation (unsupervised learning)

5 Answers 5

6 Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related