Nitty-Gritties regarding Performing a Train-Test Split on the Datasets

Dear CloudX Team,

Greetings!!! I have been going through some of the ML Projects on Kaggle, Github, Medium and other sources. I have observed many data scientists have performed various procedures on the dataset viz. Data Cleansing, Identifying Outliers & their Treatment, Identifying Missing Values & their treatment, Data Visualization, Data Exploration, Correlation, Feature Engineering etc. techniques before proceeding towards a Train-Test Dataset Split.

I do acknowledge that a Train-Test Split on the Dataset is important and agree with your line of teaching that the same i.e. Train-Test Split should be performed immediately after importing the dataset. The purpose of this step is to avoid a Bias-Variance Tradeoff so as to prevent the following viz.:
a) Avoid leakage of the Test data
b) To prevent the Overfitting & Underfitting of the Data

All these steps are fine and agree with the direction and the goals that are set to be delivered.

However, frankly speaking there still remains an element of mystery/doubt regarding the quantum of Exploratory Analysis, Data Cleansing, Data Visualization & even Feature Engineering steps to be performed before/prior to a Train-Test Split.

Could CloudX Team shed some more light on this aspect regarding the do’s and don’t’s before performing a Train-Test Data Split?

Good observation. Here are a few pointers:

  1. Train-Test-Dev set split is done so that we can train the model on the train set, perform a cross-validation on the dev set to improve the performance of the model, and finally evaluate the model on the test set. This is done to, yes you are correct, avoid overfitting.

  2. Data leakage, or target leakage, is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model’s utility when run in a production environment. This is more related to the features of the data than the train-test split.

  3. A the bias–variance tradeoff is the property of a set of predictive models whereby models with a lower bias in parameter estimation have a higher variance of the parameter estimates across samples, and vice versa. This is not related to splitting the data into train-test split.

  4. I see that you have enrolled in the MLS course. EDA is a subject in itself, we did cover it partially within the End-to-End project in Machine Learning, this includes Feature Engineering. Having said that, EDA is more of a part of Data Science than Machine Learning and requires a course of it’s own. We would consider this as a feedback, and would consider it for starting a separate course on this subject.

  5. I would suggest you to go through the End-to-End project once again to understand the do’s and dont’s before performing splitting of the data.

1 Like

Dear Rajthilak,

Greetings!!! Appreciate your detailed reply. I shall once again go through the contents of End-to-End-Machine Learning Project.

Nowadays, in SMEs (not speaking about bigger organizations), it has become a norm that an employee should work end-to-end on a project without any dependencies at work due to various reasons:
a) Reduced earnings or revenue from the outsourced projects.
b) Running cost margins are on the increase.
c) To utilize employees/manpower more effectively and derive maximum from them.
d) To cut down on employee head-count.
e) In SMEs, there would be hardly or absence of resources on the bench.
f) An employee is expected to have additional skills while on the job and should be capable of multi-tasking as well should be willing to go in for Refresher courses in order to stay relevant and in tune with the job markets.
g) With economic uncertainty looming large globally & unforeseen challenges ahead , clearly the focus is on leaner teams with minimal levels of hierarchy as part of Cost-cutting measures without much “frills” with focus mainly towards the fulfilment of organizational goals & deliverables (product or services) as per the SLAs between two collaborating partners.

In brief, an employee needs to be ready for reskilling & relearning opportunities in order to remain relevant in terms of Employability…
So from this, I guess, the same applies to the role of a role of a DS/BA.ML/AI professional, as most of these terms are interchangeably used, if I’m not mistaken.

By End-to-End project I was referring to the lecture with the same name in the course.