1

As the title states I'm getting a memory error when I try use kmeans.fit().

The data set I'm using has size:

print(np.size(np_list)): 1248680000
print(np_list.shape): (31217, 40000)

My code, I'm running that gives me a memory error is:

with open('np_array.pickle', 'rb') as handle:
    np_list = pickle.load(handle)


kmeans = KMeans(n_clusters=5)
kmeans.fit(np_list)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

I'm working with a data set of 32k images each of which are black and white and were originally 200x200. I turned the 200x200 dimension into a single dimension of 40k in row major order.

Description of traceback:

Traceback (most recent call last):
  File "C:/Project/ML_Clustering.py", line 54, in <module>
    kmeans.fit(np_list)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py", line 896, in fit
    return_n_iter=True)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\k_means_.py", line 283, in k_means
    X = as_float_array(X, copy=copy_x)
  File "C:\Users\me\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 88, in as_float_array
    return X.copy('F' if X.flags['F_CONTIGUOUS'] else 'C') if copy else X
MemoryError
5
  • Okay, so are getting this error because you don't have enough memory on your laptop Commented Jul 15, 2019 at 14:25
  • 1
    You can try MiniBatchKMeans to avoid this problem. Commented Jul 15, 2019 at 14:28
  • I am not totally sure. I added the traceback in the description. How would I be able to if it's cause i'm running out of memory on my machine? Commented Jul 15, 2019 at 14:29
  • would I use minibatch the same exact way as I would use kmeans? Commented Jul 15, 2019 at 14:30
  • Yeah, that's why i told you to use MiniBatchKMeans Commented Jul 15, 2019 at 14:30

1 Answer 1

2

The classic implementation of the KMeans clustering method based on the Lloyd's algorithm. It consumes the whole set of input data at each iteration. You can try sklearn.cluster.MiniBatchKMeans that does incremental updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k), MiniBatchKMeans is probably much faster than the default batch implementation.

from sklearn.cluster import MiniBatchKMeans

with open('np_array.pickle', 'rb') as handle:
     np_list = pickle.load(handle)

mbk = MiniBatchKMeans(init ='k-means++', n_clusters = 5, 
                      batch_size = 200, 
                      max_no_improvement = 10, verbose = 0) 

mbk.fit(np_list)

Read more about MiniBatchKMeans from here.

Sign up to request clarification or add additional context in comments.

3 Comments

Okay thank you I will try it out. Also do you have any good recommendations on how to visualize the data. I'm not sure how to go about that yet
You can't in this state. I mean when number of features is huge. If you want to visualize this data then you have to perform dimensionality reduction and reduce that many features to some small values. Then you can visualize a principal component with respect to other. I know this is a bit confusing. Check other options from here: quora.com/Whats-the-best-way-to-visualize-high-dimensional-data
okay thanks, I'll def check out the link, so what would be the best way to 'see' all the clusters and what are in the clusters then?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.