Why do we split the data in first step?

sgiri · June 16, 2020, 1:31pm

In a machine learning project concept, we learnt to first split the test and train data and do the cleaning process.

Why don’t we clean the data when it is in raw form so that it could be easier for us?

– Visaal KS

sgiri · June 16, 2020, 1:32pm

This is basically to avoid snooping bias. We might end up snooping at the test data. Therefore, before looking at the data we keep a testset aside.

Ankur_Sinha · June 16, 2020, 1:45pm

Yes, as our brains are very fast in visualizing patterns within data by looking just for a single time and thus it might make us prefer one algorithm over another.

ss7dec · July 4, 2020, 6:48am

Ya, this point does make a lot of sense…in order to avoid the bias in the datasets.

However, in Real-Time, have seen a lot of individuals working upon ML Projects in the following order viz.:

a) Data Exploration, Data Cleansing & Data Preparation

b) Splitting of Datasets into Train & Test data
( NOTE: Have come across individuals executing this step ONLY AFTER finding out Correlations between the various variables given in the dataset).

c) Thereafter suitable ML Algorithm techniques are adopted.

I do understand the aforesaid sequencing of steps that I have stated is NOT an appropriate technique…

But the reasons stated by our Mentor - Sandeep Giri does make a lot of sense and is meaningful in order to avoid the snooping bias. Thanks for your clarifications.