Statistical Concepts You Should Learn To Understand Supervised Learning
Supervised machine learning is an interdisciplinary field that uses statistics, probability, algorithms to learn from provided data. These algorithms are used to build intelligent applications. Just like Probability, a knowledge of Statistical concepts is invaluable when working on a machine learning project.
It would be fair to say that statistical concepts are required to effectively work through a supervised learning project. Here’s a list:
To frame a problem- select its type and classify its structure. Once you have classified the types of inputs and outputs it involves, statistical methods like Exploratory Data Analysis and Data Mining can be used to extract information. Exploratory Data analysis involves summarization and Visualisation to explore provisional views of the data. Data mining will help automatically discover the structured relationships and patterns of the data.
To understand the data you need to have a good grasp of both- the distributions of variables, and the relationships between them. Summary Statistics or Data Visualization is a method used to summarise the distribution and relationships between variables using statistical quantities. It could be in the form of charts, plots or graphs.
Records suggest that observations from a domain are not always uncorrupted. Although the process is automated and goes through security, it may have been subjected to processes that can damage the fidelity of the data resulting in Data error or Data loss. The whole process of data cleaning requires identifying and repairing for any loss or error. Statistically, we can implement Outlier Detection and Imputation for this process. Outlier Detection involves methods used for identifying observations that are far from expected value in a distribution. Imputation is filling up for the missing or corrupted values in observations.
Not all values and variables are relevant when modelling. Data sampling systematically creates smaller representatives from larger data sets, that can be used for predictions. You also need to identify which variable is the closest to the expected outcome. The process is called Data Feature selection.
Quite often you are required to tweak the shape or structure of data making it rather suitable for learning algorithms. Data preparation is executed using these statistical methods- Scaling: Standardization and Normalization.
Encoding: Integer and one-hot encoding
Transforms: Power transforms like the Box-Cox methods.
Routined evaluation of a learning method is significant in Supervised Learning. This often requires estimation of the skill of the model when making predictions. Experimental design is the process of designing systematic experiments to compare the effect of independent variables on an outcome.
Any given Supervised Learning algorithm often has a suite of parameters that allow the learning method to be tailored to a specific platform. The interpretation and comparison of results between different parameters configurations is made using Statistical Hypothesis Tests. It involves methods that quantify the likelihood of observing the result based on an assumption regarding the result,
Once a final model is ready, it can be presented to stakeholders prior to being deployed on real data. However, a part of presenting the final product involves presenting the model first. Any quantification in the skill of the final model is done with Estimation statistics.
Statistical concepts play a major role in Supervised Learning. To understand the data better, clean and prepare it for modelling, and finally in selection of the model and presenting the skill, and predictions from final model- statistics helps you through the entire process.