Query on limit categories in median income - ML


Can you please let me know why are we dividing the median income by 1.5. In doing so, are we not excluding the median income data which are above 6 thousand dollars.how are we achieving the stratified data in doing so.Can you please help?

Thanks and Regards,

HI @sandeep_sathyamurthy,

In our project, we may want to ensure that the test set is representative of the various categories of incomes in the whole dataset.Since the median income is a continuous numerical attribute, we first need to create an income category attribute. Looking at the median income histogram more closely
most median income values are clustered around 1.5 to 6 (i.e.,$15,000–$60,000), but some median incomes go far beyond 6.

It is important to have a sufficient number of instances in our dataset for each stratum, or else the estimate of the stratum’s importance may be biased. This means that we should not have too
many strata, and each stratum should be large enough.

We are making bins for the proper distribution i.e. 0. 1.5, 3.0, 4.5, 6. and after that
we do stratified sampling based on the income category using Scikit-Learn’s
StratifiedShuffleSplit class.

I hope it helps.


1 Like

Hi Ankur, thanks for detailed explanation!!

Hi @sandeep_sathyamurthy,

Happy to help.


Hi Sandeep,

There is one more point that needs to be added. Whatever my peer/friend Ankur_Sinha has replied with regards your query/doubt is the perfect & apt reply to you query from the Technical perspective.

Still this can be elaborated further from Business Case-Study or Domain (i.e. Real Estate Sector) . Points wrt your query are enlisted below:—

a) The dataset housing.csv file is a case-study for California region in US, wherein data has been gathered from various sources by Market Research Sampling Techniques.
b) Housing.csv is a Sampling Study taken from the entire Population pertaining to California region.
c) This Sampling Dataset —that is segregated/divided into Train & Test data is “representative of the entire population” in California.
d) The attributes/variables viz. “longitude, latitude, total_rooms” etc…are basically the x variables or also called as Independent/Response variables whereas
the attribute/variable “median_income” is basically the y variable which is also known as the Dependant/Target variable.
e) This being a Sampling data, the dependant variable “median_income” is an important factor in the given dataset.

While taking into consideration “median_income” variable/attribute, we should understand why is it so???

Before analyzing any dataset, we need to understand and comprehend 3 basic steps that needs to carried out during Data Exploration process. i.e.:

  1. Measures of Central Tendency (Mean, Median, Mode, Quartiles, Percentiles)
  2. Measures of Dispersion (Range, IQR, Standard Deviation, Variance)
  3. Measures of Shape (Skewness, Kurtosis, Box-Whisker Plots)

In common practice, Arithmetic Mean (or simply called as Mean) is the most commonly used measure for describing your data (i.e. Descriptive Statistics as well as Measures of Central Tendency)…

Drawbacks of Arithmetic Mean—

  • Affected by presence of extreme values (outliers) in the sample.
  • In skewed distributions,Arithmetic Mean is not a suitable measure.

In order to overcome the drawbacks of Mean (or Arithmetic Mean), Median as a measure has been considered in Housing.csv data. This is described as “median_ income” in housing.csv dataset which is basically the y variable or even called as the Dependant variable.

Advantages of selecting Median as a Central Tendency Measure for this datase is as follows:—

  • Well defined and based on all the observations.
  • Does not get affected by the presence of the extreme values/outliers

Hence “median_income” is the apt measure for selection as y or the target/dependent variable.
Furthermore, the data-distribution type for “median_income” is of Continuous Frequency type.

Now here is where the answer clarified by my peer-friend [Ankur_Sinha]:—(https://discuss.cloudxlab.com/u/Ankur_Sinha)

fits into the picture wherein Binning of the data is done and can be further categorized into different segments called as Frequencies viz. >=1.5 - <3, >3 - <=6 and so on. But again this is based on the Business Requirements and Case-to-Case Study. Thereafter Categorical variables (eg. High, Medium, Low etc.) can be assigned to each such stratum to make the data & resultant outcomes more meaningful. Incidentally, this also constitutes Feature Engineering, as we are deriving new variable(s) from the given data.

Based on the observations that are drawn from Data Exploration techniques of housing.csv after doing Data Analysis as well as Interpreting the Graphs, Interpretation of the Geographical Maps, Domain Analysis i.e. Real-Estate Sector etc.,it is observed that categories beyond or > 6 (i.e. $60,000) is not taken into consideration or excluded simply because these data-points / records / rows / instances / observations are “Outliers” restricted to coastal areas of California region wherein only the affluent society resides contributing to <5% of data for “median_income” variable/feature for housing.csv dataset specifically pertaining to California (CA) state, USA. .

sandeep_sathyamurthy I hope it enhances your understanding regarding the topic. A mix of techno-functional knowledge does help for better interpretation of the results.

1 Like

Hi Sameer,

Thank you so much for providing detailed information on fundamentals,it really helps a lot.You guys are doing awesome job in providing support on technical queries.


1 Like

Hi Sandeep,

It’s our duty to reach out to peers & friends alike and help in solving their doubts…It helps a lot in sharing our knowledge, initiating discussions on the topics & mutual understanding …

Ours Mentors & Gurus at Cloud X Lab are always there to support and correct us, if our understanding regarding the Technical concepts and Statistical Fundamentals are incorrect…We should welcome those suggestions with constructive criticism and with an open mind. Truly it helps a lot!!!