Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed
From Arthur Samuels (source : Wikipedia)
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E"
From Tom M,Mitchel
Supervised learning
Semi-supervised learning
Unsupervised learning
Reinforcement learning
Transfert learning
Principle :
Find the general rule that links data and labels
Principe :
find a structure in data
Clsutering
Estimate the distribution of data
Estimate covariance in the data
Identify outlier data
Reduce the number of variables
...
Clustering
Principle: find a structure in the data but with a large amount of unlabeled data
Principle:
Procedure in which the algorithm can query the supervisor to obtain new data
from sckit-learn
Principle:
An agent learns the actions to perform on its environment by maximizing a reward
Principle:
Apply knowledge learned on a previous task to new tasks
From Wikipedia
linear
logistic
polynomial
Management of missing data
Management of outliers
Normalisation
Data formatting (matrices)
Reduction of dimensionality
...
Almost all machine learning algorithms have internal parameters to adjust in order to give good results!
"tuning"
Parameters
Hyper parameters
Their values are learned
Their values are fixed before learning
Ex: coefficients of a régression
Ex: number K of clusters K-means
Almost all machine learning algorithms have internal parameters to adjust in order to give good results!
Objective: minimize a cost/error function
Some optimization methods are quite specific:
...
least squares
regression
gradient descent
neural network
boosting methods
Almost all machine learning algorithms have internal parameters to adjust in order to give good results!
Objective: minimize a cost/error function
Without preconceived assumptions, use generalized methods:
Grid Search
Random Search
Bayesian Search
Evolutionnary Search
Gradient-based Optimisation
...
More details ici
Quantify the result of a model
accuracy
F1 score
precision &
recall
area uner ROC curve
mean squared error
percentage of variance explained
mutual infomation
...
The problem consists in "tuning" the algorithm so that it learns with sufficient accuracy while keeping good performances on data it has never encountered (its ability to generalize its results)
Expected error calculated on the test set
Variance of the estimator of f
Square of the bias of the estimator of f
irreducible error
"gap between model and data reality"
"ability to estimate f with the least variability when the dataset changes"
learning curves as fonction of model complexity
In practice we often observe a double gradient descent with an error peak in the middle
Problem that appears when our dataset contains many variables compared to the number of observations
...
...
...
variables
observations
"A space almost empty of points"
Under certain conditions, the performance of all algorithms is identical on average
Consequence: there is no "ultimate" algorithm that would always give the best performance for a given dataset
Problem: evaluate the generalization performance of a model while avoiding over fitting
We sample the data in three parts:
data set
test
training & validation
validation
test
training
Problem: evaluate the generalization performance of a model while avoiding over fitting
Principle : Select from the dataset the minimum number of features that contribute the most to good performance
Threshold Methods
...
Intrinsic methods
Objectif : reduce model complexity
Wrapper methods
correlations
information gain
...
threshold methods
recursive feature elimination
hierarchical selection
LASSO
Ridge
...
Domain knowledge
Only applies to the training set, not to the test set --> risk of data leakage
To be applied during each fold of the cross-validation at the same time as the model training !
Idea: constrain coefficient values to limit their variation
linear regression
polynomial regression
Transform or enrich existing data
Add more données (volume & variety)
business data, open data, scraping ...
Feature engineering :
create new variables
enrichment (annotations, metadata, ...)
transformations (image deformations, ...)