About stratified sampling

Why we are doing stratified sampling??

if i work with other dataset will i use startified sampling always??

Stratified ShuffleSplit cross-validator provides train/test indices to split data in train/test sets. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class. So we can ensure all classes are accounted for while doing the split.

Greetings!!! Let me try to attempt to answer your query/doubt from Statistical perspective as follow:

a) Stratified Sampling technique is rather one of the types of Random Sampling Process.
b) In order to carry out studies in Analytics/Data Science/AI & ML fields, it is essential that the Samples thus selected should be a Random Sample only.
(NOTE I: By taking a Non-Random Sample, the outcomes or the results/outcomes would be inaccurate as well as in appropriate).
(NOTE II: Random Samples & their types: - Totally 4 types viz.:

  1. Simple Random Sampling
  2. Stratified Random Sampling
  3. Systematic Sampling &
  4. Cluster or Area Sampling).

c) Stratified Random Sampling Techniques are generally used in cases wherein Data Distribution comprises of Discrete Data type (In certain cases, Cluster Sampling techniques are used to analyze datasets comprising of discrete values).

d) Appropriate Sampling Techniques are used for making Inferences on Population Parameters — i.e. Mean, Proportion or Variances.

e) In case, if you are using Stratified or Cluster Sampling for your given datasets, the Random samples are thereby collected, thereafter Sample Statistics is calculated from the collected samples & thereafter these calculated Sample Statistics are used for making inferences on Population Parameters.

f) Especially in the case of Stratified Random Samples, one of the important Population Parameter to be studied is Proportion.

h) The main & utmost important reason for selection of Stratified Random Sampling Technique is to reduce Sampling Errors.

NOTE: A Sample should be a TRUE REPRESENTATIVE of the ENTIRE POPULATION.

i) Technically speaking Stratified Random Samples comprises of populations which is further divided into non-overlapping sub-populations called as Strata.

j) With Stratified Random Sampling technique, the potentiality to match the sample with the population is greater unlike Random Sampling techniques. This is because portions of the sample are taken from different population sub-groups.

h) Strata selection is usually based on the given information either through earlier primary research, questionnaires, surveys or census.

i) Stratification is usually done using various demographics for eg. gender, age category, geographic location, socio-economic categories etc. from the population.

j) Internally in Stratified Random Samples thereby collected, one will observe that ----

  1. Internally each strata are Homogeneous in nature i.e. comprising of similar properties, functions,etc,.
  2. However, externally each strata contrast each other i.e. Heterogeneous in nature i.e. comprising of diverse/different properties, functions etc.

k) In addition, one should opt for Stratified Random Samples - IF & ONLY IF —the Sample Size (n) is reasonably large. In case, if the Sampling Size (n) is small, the outcomes/results will be NOT be Meaningful & therefore would be largely Inaccurate.

To illustrate this, problems are encountered while doing either a Train-Test Split or k-fold Split. If the sample size (n) is small, the splitting of datasets or given samples wouldn’t be accurate & therefore as stated earlier the splitted samples/datasets taken for Train & Test purposes, wouldn’t be an actual representative of the Population.

Illustratiion/Example:—
In order to make it easier to understand Stratified Random Sample, a real-time example/illustration is given:-

In order to understand Indian Parliamentary Elections & their outcomes, we try to analyze various strata such as: ---- Age Category, Gender, Socio-Economic Category, Geographical region preferences & so on…

Hope this explanation makes your understanding regarding the selection of Sampling Techniques easier to understand and comprehend…

Hope this helps!!!

2 Likes