Hi Sandeep,
There is one more point that needs to be added. Whatever my peer/friend Ankur_Sinha has replied with regards your query/doubt is the perfect & apt reply to you query from the Technical perspective.
Still this can be elaborated further from Business Case-Study or Domain (i.e. Real Estate Sector) . Points wrt your query are enlisted below:—
a) The dataset housing.csv file is a case-study for California region in US, wherein data has been gathered from various sources by Market Research Sampling Techniques.
b) Housing.csv is a Sampling Study taken from the entire Population pertaining to California region.
c) This Sampling Dataset —that is segregated/divided into Train & Test data is “representative of the entire population” in California.
d) The attributes/variables viz. “longitude, latitude, total_rooms” etc…are basically the x variables or also called as Independent/Response variables whereas
the attribute/variable “median_income” is basically the y variable which is also known as the Dependant/Target variable.
e) This being a Sampling data, the dependant variable “median_income” is an important factor in the given dataset.
While taking into consideration “median_income” variable/attribute, we should understand why is it so???
Before analyzing any dataset, we need to understand and comprehend 3 basic steps that needs to carried out during Data Exploration process. i.e.:
- Measures of Central Tendency (Mean, Median, Mode, Quartiles, Percentiles)
- Measures of Dispersion (Range, IQR, Standard Deviation, Variance)
- Measures of Shape (Skewness, Kurtosis, Box-Whisker Plots)
In common practice, Arithmetic Mean (or simply called as Mean) is the most commonly used measure for describing your data (i.e. Descriptive Statistics as well as Measures of Central Tendency)…
Drawbacks of Arithmetic Mean—
- Affected by presence of extreme values (outliers) in the sample.
- In skewed distributions,Arithmetic Mean is not a suitable measure.
In order to overcome the drawbacks of Mean (or Arithmetic Mean), Median as a measure has been considered in Housing.csv data. This is described as “median_ income” in housing.csv dataset which is basically the y variable or even called as the Dependant variable.
Advantages of selecting Median as a Central Tendency Measure for this datase is as follows:—
- Well defined and based on all the observations.
- Does not get affected by the presence of the extreme values/outliers
Hence “median_income” is the apt measure for selection as y or the target/dependent variable.
Furthermore, the data-distribution type for “median_income” is of Continuous Frequency type.
Now here is where the answer clarified by my peer-friend [Ankur_Sinha]:—(https://discuss.cloudxlab.com/u/Ankur_Sinha)
1
fits into the picture wherein Binning of the data is done and can be further categorized into different segments called as Frequencies viz. >=1.5 - <3, >3 - <=6 and so on. But again this is based on the Business Requirements and Case-to-Case Study. Thereafter Categorical variables (eg. High, Medium, Low etc.) can be assigned to each such stratum to make the data & resultant outcomes more meaningful. Incidentally, this also constitutes Feature Engineering, as we are deriving new variable(s) from the given data.
Based on the observations that are drawn from Data Exploration techniques of housing.csv after doing Data Analysis as well as Interpreting the Graphs, Interpretation of the Geographical Maps, Domain Analysis i.e. Real-Estate Sector etc.,it is observed that categories beyond or > 6 (i.e. $60,000) is not taken into consideration or excluded simply because these data-points / records / rows / instances / observations are “Outliers” restricted to coastal areas of California region wherein only the affluent society resides contributing to <5% of data for “median_income” variable/feature for housing.csv dataset specifically pertaining to California (CA) state, USA. .
sandeep_sathyamurthy I hope it enhances your understanding regarding the topic. A mix of techno-functional knowledge does help for better interpretation of the results.