Getting a memory error when using sklearn.cluster Kmeans

Question

As the title states I'm getting a memory error when I try use kmeans.fit().

The data set I'm using has size:

print(np.size(np_list)): 1248680000
print(np_list.shape): (31217, 40000)

My code, I'm running that gives me a memory error is:

with open('np_array.pickle', 'rb') as handle:
    np_list = pickle.load(handle)


kmeans = KMeans(n_clusters=5)
kmeans.fit(np_list)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

I'm working with a data set of 32k images each of which are black and white and were originally 200x200. I turned the 200x200 dimension into a single dimension of 40k in row major order.

Description of traceback:

Traceback (most recent call last):
  File "C:/Project/ML_Clustering.py", line 54, in <module>
    kmeans.fit(np_list)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py", line 896, in fit
    return_n_iter=True)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py", line 283, in k_means
    X = as_float_array(X, copy=copy_x)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 88, in as_float_array
    return X.copy('F' if X.flags['F_CONTIGUOUS'] else 'C') if copy else X
MemoryError

Okay, so are getting this error because you don't have enough memory on your laptop — Anubhav Singh
– Anubhav Singh, Commented Jul 15, 2019 at 14:25
I am not totally sure. I added the traceback in the description. How would I be able to if it's cause i'm running out of memory on my machine? — somedude1234
– somedude1234, Commented Jul 15, 2019 at 14:29
would I use minibatch the same exact way as I would use kmeans? — somedude1234
– somedude1234, Commented Jul 15, 2019 at 14:30

Anubhav Singh · Accepted Answer · 2019-07-15 14:47:41Z

2

The classic implementation of the KMeans clustering method based on the Lloyd's algorithm. It consumes the whole set of input data at each iteration. You can try sklearn.cluster.MiniBatchKMeans that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k), MiniBatchKMeans is probably much faster than the default batch implementation.

from sklearn.cluster import MiniBatchKMeans

with open('np_array.pickle', 'rb') as handle:
     np_list = pickle.load(handle)

mbk = MiniBatchKMeans(init ='k-means++', n_clusters = 5, 
                      batch_size = 200, 
                      max_no_improvement = 10, verbose = 0) 

mbk.fit(np_list)

Read more about MiniBatchKMeans from here.

edited Jul 15, 2019 at 14:47

answered Jul 15, 2019 at 14:41

Anubhav Singh

8,7395 gold badges28 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

somedude1234 Over a year ago

Okay thank you I will try it out. Also do you have any good recommendations on how to visualize the data. I'm not sure how to go about that yet

Anubhav Singh Over a year ago

You can't in this state. I mean when number of features is huge. If you want to visualize this data then you have to perform dimensionality reduction and reduce that many features to some small values. Then you can visualize a principal component with respect to other. I know this is a bit confusing. Check other options from here: quora.com/Whats-the-best-way-to-visualize-high-dimensional-data

somedude1234 Over a year ago

okay thanks, I'll def check out the link, so what would be the best way to 'see' all the clusters and what are in the clusters then?

Collectives™ on Stack Overflow

Getting a memory error when using sklearn.cluster Kmeans

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related