Dataset related query

Queen_Saikia · August 14, 2020, 10:33pm

if we are assigned to a project where you are working on a dataset provided with Career
guidance which contains various details of the students and their preferred career choices. Please give me a method of how do you select the important variables?

Queen_Saikia · August 15, 2020, 7:28am

please clear my doubt sir

ss7dec · August 17, 2020, 7:38am

Based on my understanding regarding the Statistical Concepts & related work-experiences accompanied with exposure, let me seize this opportunity in sharing my views & thoughts:—

Firstly there are no hard & fast rules for selection of specific or particularly important variables for your assigned Project entitled “Career Guidance”. More importantly we need to understand the scope & the final outcome of the proposed project by the concerned client . It is of utmost importance to note the expectations and the final goal(s) of your project. Based upon this, the Data Scientist or Business Analytics Professional will pursue the ML Algorithms in the assigned or the given project and move ahead in order to attain the desired outcomes as stated in your Project.

Selection of variables for the dataset or the proposed project depends on the Data Collection Process . In this process, Quality of the data is more important than the quantity. This data collection process can be carried out for chiefly 2 purposes viz.:-
a) Population Study - the entire population (for eg. for all the students studying at a particular school/college & its affiliated branches) or
b) Sampling Study - simply taking a cross-section of students fulfilling certain criteria in a particular area/region at random for conducting Sampling studies.

Note that this is the difference between Population Study & Sampling Studies. This forms a part of Market Research Process . Such kind of data collection conducted via surveys, questionnaire, face-to-face interviews are examples of Primary Research Process . (Note: the other type is Secondary Market Research which is collected via Internet, published articles, journals and relevant third-party sources. However in this particular case or proposed Project Study kindly note that Secondary Market Research isn’t applicable for Career Guidance Project).

Sampling Studies are generally preferred over Population Studies for the following reasons:
a) Resource constraint (i.e. availability of sufficient & qualified manpower to conduct such studies/surveys)
b) Time Constraint
c) Financial Constraint

In Sampling Studies , whatever data is gathered and collated via the Data Collection Process , it should fulfill the following criteria: —
a) Should be taken from a cross-section of students (from a single college or multiple colleges depending on the scope of the proposed project).
b) Should be a " true representative" of the population (i.e. student population studying in 12th standard and so on)
c) Should be free from any kind of data bias
d) Qualitative data is more important than the quantitative data.
e) Survey Questionnaire should be well-balanced and should be able to explore the desires of the student based on his capabilities, logical thinking, reasoning power, analytical abilities, craftsmanship etc.

Based on Data Collection Process, the datasets or tables with relevant details are created stored in .csv, .xlsx, .txt etc. formats.

Thereafter, based on the scope of your project , it is during the Data Exploration phase, it is the Data Scientist or Business Analytics professional who will decide which variables to retain and which variables need to be dropped only after doing performing a series of steps with this Data Exploration process as enlisted below viz.:

Data Visualization - for eg. Scatter Plots, Histograms, Box & Whisker Plots, Correlogram etc.
Understanding the Descriptive Statistics
Understanding Correlation between the various variables within a given dataset (s) for eg. Career Guidance Project in this case.
Identification of Missing data & its treatment
Identification of Outliers & their treatment
Performing various statistical tests to validate the outcomes

Hence, as mentioned earlier there is no hard and fast rule for selection of important variables within a given dataset for your proposed Project i,e. Career Guidance Project.

Based on the outcomes and inferences made during the Data Exploration phase of your proposed project accompanied with Data Interpretation capabilities, is the Actual key for selection of important variables (i.e. either to retain or not to retain or to drop the variables ) for the proposed project study . For eg. Career Guidance Project. This is precisely where the Decision-Making Process & Analytical Capabilities of a Data Scientist or a Business Analytics professional comes into the picture. Thus Data Interpretation is the crux or the crucial part in any proposed Project Study or Project Implementation…

The study of Data Science or Business Analytics evolves only with an individual’s personal experience, latent thinking abilities and ability to think from “out-of-the-box” perspectives. This should be accompanied by relevant domain knowledge pertaining to that particular field for possible solutions for eg. supply chain management, banking, insurance, healthcare agriculture etc… These solutions may or may not be implemented as deemed fit or appropriate by the concerned Management based on their organization’s objectives. It also depends on other factors viz.

Products & Services being offered by the organization
Customer outreach & Customer Segmentation
Company’s positioning strategy in a competitive market etc

Unlike other professions for eg. Medicine, Engineering, Logistics etc. wherein there are suitable remedies or solutions for a particular problem, but the field of Data Science is not so. It depends on the data shared in the project and more importantly how the concerned data scientist plans to tackle such projects based on the goals/expectations as shared by the concerned client in a real-time scenario .

It is purely with the decisions made the Data Scientist , the project is either successful or a failure .

Hope my knowledge and experience shared helps!!!

satyajit_das · August 17, 2020, 8:52am

Need to see the datasets! it must be numerical I think.

You need to check how the target variable is changing w.r.t each features or variables.

Check the dependency using Pair plots if they are linear positively dependent then take that features. If negatively dependent reject that features.
If you want to do features reductions then you can use the correlation matrix to find which features are most correlated and less among themselves.
By observing the changes you can form an equations with linear/quadratic or higher orders.