Dear CloudX Team,
Greetings!!! I have been going through some of the ML Projects on Kaggle, Github, Medium and other sources. I have observed many data scientists have performed various procedures on the dataset viz. Data Cleansing, Identifying Outliers & their Treatment, Identifying Missing Values & their treatment, Data Visualization, Data Exploration, Correlation, Feature Engineering etc. techniques before proceeding towards a Train-Test Dataset Split.
I do acknowledge that a Train-Test Split on the Dataset is important and agree with your line of teaching that the same i.e. Train-Test Split should be performed immediately after importing the dataset. The purpose of this step is to avoid a Bias-Variance Tradeoff so as to prevent the following viz.:
a) Avoid leakage of the Test data
b) To prevent the Overfitting & Underfitting of the Data
All these steps are fine and agree with the direction and the goals that are set to be delivered.
However, frankly speaking there still remains an element of mystery/doubt regarding the quantum of Exploratory Analysis, Data Cleansing, Data Visualization & even Feature Engineering steps to be performed before/prior to a Train-Test Split.
Could CloudX Team shed some more light on this aspect regarding the do’s and don’t’s before performing a Train-Test Data Split?