0

For a dataset having different numeric columns, they usually have different range and distributions. As an example, I have used the Iris dataset. The distributions of it's 4 columns are shown:

petal_length

petal_width

sepal_length

sepal_width

My question is:

Should columns having similar distributions use same scaler? In this case, petal length & petal width have similar distributions. Also, sepal length & sepal width have (approximately) similar distributions. Therefore, I have used Min-Max scaler for the columns - petal length & petal width, while Standard scaler for sepal length & sepal.

The sample code for these sets of operations are:

# According to distribution visualizations from above, appropriate scalers are used-
std_scaler = StandardScaler()
iris_data[['sepallength', 'sepalwidth']] = std_scaler.fit_transform(iris_data[['sepallength', 'sepalwidth']])

# 'StandardScaler' subtracts the mean from each feature/attribute and then
# scales to unit variance

# Sanity checks-
iris_data['sepallength'].min(), iris_data['sepallength'].max()
# (-1.870024133847019, 2.4920192021244283)

iris_data['sepalwidth'].min(), iris_data['sepalwidth'].max()
# (-2.438987252491841, 3.1146839106774356)


mm_scaler = MinMaxScaler()
iris_data[['petallength', 'petalwidth']] = mm_scaler.fit_transform(iris_data[['petallength', 'petalwidth']])

# Sanity checks-
iris_data['petallength'].min(), iris_data['petallength'].max()
# (0.0, 1.0)

iris_data['petalwidth'].min(), iris_data['petalwidth'].max()
# (0.0, 1.0)

Due to standard scaler, the range for sepal length and sepal width are different. While, the range for petal length and petal width are the same. Is this a problem, since different columns are on different range which might affect the ML model using them for training?

Is there a golden set of rules for scaling/handling different numeric columns/attributes within a given dataset?

1 Answer 1

1

It depends a bit on the algorithm that you use for your task. For example, tree-based algorithms (RandomForest, XGBoost) tend to be less affected by scale differences (e.g. scale invariant, though this is not completely true because its performance can increase if you scale the variables). On the other hand, SVM and logistic regression require scaling to prevent specific features with large values and high variances dominating the model.

In general, I tend to use StandardScaler() but sometimes the performance of the model is better with MinMaxScaler(). This is a trial and error approach, I suppose. I am unaware of consensus that one form of scaling is better than the other, but I am by no means an expert. Nonetheless, I would advise to use one form of scaling for all features (either StandardScaler() or MinMaxScaler() given that all your features are continous features) for comparability and to counter your problem of different distributions and thus weights.

I do think this question was maybe more suited for CrossValidated instead of StackOverflow.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.