For a dataset having different numeric columns, they usually have different range and distributions. As an example, I have used the Iris dataset. The distributions of it's 4 columns are shown:
My question is:
Should columns having similar distributions use same scaler? In this case, petal length & petal width have similar distributions. Also, sepal length & sepal width have (approximately) similar distributions. Therefore, I have used Min-Max scaler for the columns - petal length & petal width, while Standard scaler for sepal length & sepal.
The sample code for these sets of operations are:
# According to distribution visualizations from above, appropriate scalers are used-
std_scaler = StandardScaler()
iris_data[['sepallength', 'sepalwidth']] = std_scaler.fit_transform(iris_data[['sepallength', 'sepalwidth']])
# 'StandardScaler' subtracts the mean from each feature/attribute and then
# scales to unit variance
# Sanity checks-
iris_data['sepallength'].min(), iris_data['sepallength'].max()
# (-1.870024133847019, 2.4920192021244283)
iris_data['sepalwidth'].min(), iris_data['sepalwidth'].max()
# (-2.438987252491841, 3.1146839106774356)
mm_scaler = MinMaxScaler()
iris_data[['petallength', 'petalwidth']] = mm_scaler.fit_transform(iris_data[['petallength', 'petalwidth']])
# Sanity checks-
iris_data['petallength'].min(), iris_data['petallength'].max()
# (0.0, 1.0)
iris_data['petalwidth'].min(), iris_data['petalwidth'].max()
# (0.0, 1.0)
Due to standard scaler, the range for sepal length and sepal width are different. While, the range for petal length and petal width are the same. Is this a problem, since different columns are on different range which might affect the ML model using them for training?
Is there a golden set of rules for scaling/handling different numeric columns/attributes within a given dataset?



