Skewed data in machine learning


#1

Hi Sandeep,
I was going through the ML self study course on CloudXLab.

Can you help me understand how to handle skewed data in the below cases?

I have to perform regression on data with 9 features:

  1. One of the features is skewed, but the target variable is not skewed. Do we need to apply transformation (log) on Skewed data column or on all the features?
  2. One of the features is skewed and the target variable is also skewed. Do we need to apply transformation (log) on Skewed data column or on the target variable?
  3. If data skew is observed in one feature and I apply log transformation on the feature. Can I apply Standardization or Normalization on other features considering the fact that for one of the feature, Log transformation is applied?

Regards,
Pradeep.


#2

Hi Pradeep,

Good questions.

One of the features is skewed, but the target variable is not skewed. Do we need to apply transformation (log) on Skewed data column or on all the features?

Sometime applying min-max normalisation or standardization sufficies. You would use the min-max normalization if the values are linearly separated but there is really huge variance, you would apply the standardization. If in some cases, the values seem to be varying exponentially, you can apply a function like to create a new feature and discard old feature.

So, in your scenario, it is correct to apply the transformation only on the feature and not on the target. Just keep in mind that during the prediction time, you have to apply the same transformation on the features.

One of the features is skewed and the target variable is also skewed. Do we need to apply transformation (log) on Skewed data column or on the target variable?

Generally, we don’t apply the transformation to the target variable. Instead, we create the loss function such that the skewness is taken care of by the loss function. So, you can have the log function in the loss function.

If data skew is observed in one feature and I apply log transformation on the feature. Can I apply Standardization or Normalization on other features considering the fact that for one of the feature, Log transformation is applied?

Yes. Just that during test or prediction time, you have apply the same set of transformations.


#3

Thanks Sandeep for the clarification.

Can you please clarify what the below statement mean:
“So, you can have the log function in the loss function.”
Is it applying log function on a feature variable ie which indirectly means ML model will apply the function on transformed data?