Error when scaling features for clustering in Python and Sklearn

Question

I want to make clustering model for my dataset with Python and Scikit-Learn lib. Dataset contains continues and categorical values. I have encoded categorical values but when I want to scale the features I'm getting this error:

"Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.

I'm getting that error in this line:

features = scaler.fit_transform(features)

What am I doing wrong?

This is my code:

features = df[['InvoiceNo', 'StockCode', 'Description', 'Quantity',
               'UnitPrice', 'CustomerID', 'Country', 'Total Price']]

columns_for_scaling = ['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'UnitPrice', 'CustomerID', 'Country', 'Total Price']

transformerVectoriser = ColumnTransformer(transformers=[('Encoding Invoice number', OneHotEncoder(handle_unknown = "ignore"), ['InvoiceNo']),
                                                        ('Encoding StockCode', OneHotEncoder(handle_unknown = "ignore"), ['StockCode']),
                                                        ('Encoding Description', OneHotEncoder(handle_unknown = "ignore"), ['Description']),
                                                        ('Encoding Country', OneHotEncoder(handle_unknown = "ignore"), ['Country'])],
                                          remainder='passthrough') # Default is to drop untransformed columns

features = transformerVectoriser.fit_transform(features)
print(features.shape)

scaler = StandardScaler()
features = scaler.fit_transform(features)

sum_of_squared_distances = []
for k in range(1,16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(features.inertia_)

Shape of my data before preprocessing: (401604, 8) Shape of my data after preprocessing: (401604, 29800)

The error message prescribes an easy solution: set with_mean=False in the scaler. — Ben Reiniger
– Ben Reiniger, Commented Oct 21, 2021 at 17:32

user11989081 · Accepted Answer · 2021-10-30 06:58:57Z

If you set sparse=False when instantiating the OneHotEncoder then the StandardScaler() will work as expected.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

# define the feature matrix
features = pd.DataFrame({
    'InvoiceNo': np.random.randint(1, 100, 100),
    'StockCode': np.random.randint(100, 200, 100),
    'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
    'Quantity': np.random.randint(1, 1000, 100),
    'UnitPrice': np.random.randint(5, 10, 100),
    'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
    'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Total Price': np.random.randint(100, 1000, 100),
})

# encode the features (set "sparse=False") 
transformerVectoriser = ColumnTransformer(
    transformers=[
        ('Encoding Invoice number', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['InvoiceNo']),
        ('Encoding StockCode', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['StockCode']),
        ('Encoding Description', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Description']),
        ('Encoding Country', OneHotEncoder(sparse=False, handle_unknown='ignore'), ['Country'])
    ],
    remainder='passthrough'
)

features = transformerVectoriser.fit_transform(features)

# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(kmeans.inertia_)

Alternatively, you can use features = features.toarray() to convert the sparse matrix to an array.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.cluster import KMeans

# define the feature matrix
features = pd.DataFrame({
    'InvoiceNo': np.random.randint(1, 100, 100),
    'StockCode': np.random.randint(100, 200, 100),
    'Description': np.random.choice(['a', 'b', 'c', 'd'], 100),
    'Quantity': np.random.randint(1, 1000, 100),
    'UnitPrice': np.random.randint(5, 10, 100),
    'CustomerID': np.random.choice(['1', '2', '3', '4'], 100),
    'Country': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Total Price': np.random.randint(100, 1000, 100),
})

# encode the features
transformerVectoriser = ColumnTransformer(
    transformers=[
        ('Encoding Invoice number', OneHotEncoder(handle_unknown='ignore'), ['InvoiceNo']),
        ('Encoding StockCode', OneHotEncoder(handle_unknown='ignore'), ['StockCode']),
        ('Encoding Description', OneHotEncoder(handle_unknown='ignore'), ['Description']),
        ('Encoding Country', OneHotEncoder(handle_unknown='ignore'), ['Country'])
    ],
    remainder='passthrough'
)

features = transformerVectoriser.fit_transform(features)
features = features.toarray() # convert sparse matrix to array

# scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

# run the cluster analysis
sum_of_squared_distances = []
for k in range(1, 16):
    kmeans = KMeans(n_clusters=k)
    kmeans = kmeans.fit(features)
    sum_of_squared_distances.append(kmeans.inertia_)

Thanks but I have other problem now: And thatsis: ArrayMemoryError: Unable to allocate 89.2 GiB for an array with shape (401604, 29800) and data type float64 This is shape of my data before and after preprocessing. Shape of my data before preprocessing: (401604, 8), Shape of my data after preprocessing: (401604, 29800)
Also, even if i reduce size of my data, im getting this: AttributeError: 'numpy.ndarray' object has no attribute 'inertia_'

Collectives™ on Stack Overflow

Error when scaling features for clustering in Python and Sklearn

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related