0

I want to make clustering model for my dataset with Python and Scikit-Learn lib. Dataset contains continues and categorical values. I have encoded categorical values but when I want to scale the features I'm getting this error:

"Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.

I'm getting that error in this line:

features = scaler.fit_transform(features)

What am I doing wrong?

This is my code:

features = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity',
               'UnitPrice', 'CustomerID', 'Country', 'Total Price']]

columns_for_scaling = ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice', 'CustomerID', 'Country', 'Total Price']

transformerVectoriser = ColumnTransformer(transformers=[('Encoding Invoice number', OneHotEncoder(handle_unknown = "ignore"), ['InvoiceNo']),
                                                        ('Encoding StockCode', OneHotEncoder(handle_unknown = "ignore"), ['StockCode']),
                                                        ('Encoding Description', OneHotEncoder(handle_unknown = "ignore"), ['Description']),
                                                        ('Encoding Country', OneHotEncoder(handle_unknown = "ignore"), ['Country'])],
                                          remainder='passthrough') # Default is to drop untransformed columns

features = transformerVectoriser.fit_transform(features)
print(features.shape)

scaler = StandardScaler()
features = scaler.fit_transform(features)

sum_of_squared_distances = []
for k in range(1,16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(features.inertia_)

Shape of my data before preprocessing: (401604, 8) Shape of my data after preprocessing: (401604, 29800)

1
  • The error message prescribes an easy solution: set with_mean=False in the scaler. Commented Oct 21, 2021 at 17:32

1 Answer 1

3

If you set sparse=False when instantiating the OneHotEncoder then the StandardScaler() will work as expected.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

# define the feature matrix
features = pd.DataFrame({
    'InvoiceNo': np.random.randint(1, 100, 100),
    'StockCode': np.random.randint(100, 200, 100),
    'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
    'Quantity': np.random.randint(1, 1000, 100),
    'UnitPrice': np.random.randint(5, 10, 100),
    'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
    'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Total Price': np.random.randint(100, 1000, 100),
})

# encode the features (set "sparse=False") 
transformerVectoriser = ColumnTransformer(
    transformers=[
        ('Encoding Invoice number', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['InvoiceNo']),
        ('Encoding StockCode', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['StockCode']),
        ('Encoding Description', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Description']),
        ('Encoding Country', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Country'])
    ],
    remainder='passthrough'
)

features = transformerVectoriser.fit_transform(features)

# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(kmeans.inertia_)

Alternatively, you can use features = features.toarray() to convert the sparse matrix to an array.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

# define the feature matrix
features = pd.DataFrame({
    'InvoiceNo': np.random.randint(1, 100, 100),
    'StockCode': np.random.randint(100, 200, 100),
    'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
    'Quantity': np.random.randint(1, 1000, 100),
    'UnitPrice': np.random.randint(5, 10, 100),
    'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
    'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Total Price': np.random.randint(100, 1000, 100),
})

# encode the features
transformerVectoriser = ColumnTransformer(
    transformers=[
        ('Encoding Invoice number', OneHotEncoder(handle_unknown='ignore'), ['InvoiceNo']),
        ('Encoding StockCode', OneHotEncoder(handle_unknown='ignore'), ['StockCode']),
        ('Encoding Description', OneHotEncoder(handle_unknown='ignore'), ['Description']),
        ('Encoding Country', OneHotEncoder(handle_unknown='ignore'), ['Country'])
    ],
    remainder='passthrough'
)

features = transformerVectoriser.fit_transform(features)
features = features.toarray() # convert sparse matrix to array

# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(kmeans.inertia_)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks but I have other problem now: And thatsis: ArrayMemoryError: Unable to allocate 89.2 GiB for an array with shape (401604, 29800) and data type float64 This is shape of my data before and after preprocessing. Shape of my data before preprocessing: (401604, 8), Shape of my data after preprocessing: (401604, 29800)
Also, even if i reduce size of my data, im getting this: AttributeError: 'numpy.ndarray' object has no attribute 'inertia_'
Indeed, it's kmeans.inertia_, not features.inertia_.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.