Feature Scaling

Sugandhita_Pandey · June 22, 2020, 9:20pm

Feature Scaling

End to End ML Project - Fashion MNIST - Data Preparation

Each image (instance) in the dataset has 784 pixels (features) and value of each feature(pixel) ranges from 0 to 255, and this range is too wide , hence we would need to use feature scaling here to apply standardization to this dataset X_train, so that all the values of each feature (pixel) is in a small range (based on the standard deviation value).

x_scaled = (x - x_mean) / standard deviation

Scaling is not needed for Decision Tree and Random Forest algorithms

Please explain of what we are trying to attain from feature scaling incase of Images.

satyajit_das · June 23, 2020, 5:37am

Hi, Sugandhita.

Very good question!.

For this you need to understand the Decisions tree algorithm and how each node splits the data space into pieces based on value of a feature and not on based on the distance between the values.

The thumb rule that you need to follow is that any algorithms that computes the distance or assumes the normality, for them we need to scale the Features.
for examples.

The models where we use the below algos which measures the distances between the values.

k-nearest neighbors --> Calculates the Euclidean Distance.
PCA --> Calculates the variance between the features and skews towards the High Magnitude Features.
Gradient Descents --> This is because “θ” will descend quickly on small ranges and slowly on large ranges also depends on distances measures.

But Tree based models are not distance based models they depends on the values/weights of the Features not the ranges of the features that is why features scaling may not have much effect on these algorithms.
similarly with “Naive Bayes”, “Linear Discriminant Analysis” and ofcourse “Random forest”.
and that is why we say that “Tree based algos are robust to outliers” you must have heard this statements during interviews.

So, your answer should be :-

Because decision trees divide items by current values, and not on based on the absolute magnitude.

It is beautifully explained in the tutorials here along with time and space complexity :

https://cloudxlab.com/assessment/displayslide/1401/session-18-decision-trees-july-14-2018