Hi…I am trying to work on the Jupyter Code for California housing prices…
how do I select stratified sampling in the code? Specifically in section 2.3 of the code…
Hi Annapurna,
If you writing a Python code using Scikit Learn library, then you can get ‘Stratified Sampled’ split of the dataset using StratifiedShuffleSplit() function of Scikit Learn (sklearn) library.
If your question is how to select a particular column for Stratified Sampling (e.g. median_income column is selected as the basis of performing Stratified Sampling for housing dataset problem), then, the answer is, you select that column as the basis for Stratified Sampling which is deemed to the most important column (in the dataset) for predictions, this information is normally provided by domain experts.
After you have finalized the column for performing Stratified Sampling (say median_income column here), then, after histogram visualization, you create limited number of strata (not too many strata), say income_categories or income_cat, and then pass this ‘income_cat’ to the StratifiedShuffleSplit() function to create ‘stratified’ split of dataset.