is the study of computer algorithms that improve automatically through experience and by the use of data
(source: https://en.wikipedia.org/wiki/Machine_learning)
It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately
Unsupervised learning is very much the opposite of supervised learning. It features no labels. Instead, our algorithm would be fed a lot of data and given the tools to understand the properties of the data
Clustering - Clustering is a data mining technique which groups unlabeled data based on their similarities or differences.
Dimensionality reduction - While more data generally yields more accurate results, it can also impact the performance of machine learning algorithms (e.g. overfitting) and it can also make it difficult to visualize datasets
Regression is used to understand the relationship between dependent and independent variables.
Examples: predict the outputs, forecasting the data, analyzing the time series, and finding the causal effect dependencies between the variables
Problem statement: We have a dataset with the price of some sold houses. For each house we also have some characteristics: number of rooms, size (m^2).
What can we do with this data?
Build a linear regression model to predict the price.
Problem statement: We have a dataset with the price of some sold houses. For each house we also have some characteristics: number of rooms, size (m^2).
What can we do with this data?
Build a linear regression model to predict the price.
If the price is a linear function of the number of rooms and size then we can build a model to predict price house. The model is simply a linear function:
Price(house) = a * number of rooms + b * size
Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables).
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data.
What if the response variable is not a linear combination of the predictor variables (features) ?
Let's assume that the price is quadratic increasing with the number of rooms.
Price(house) = a * (number of rooms) ^ 2 + b * size
Can we still use linear regression?
What if the response variable is not a linear combination of the predictor variables (features) ?
Let's assume that the price is quadratic increasing with the number of rooms.
Price(house) = a * (number of rooms) ^ 2 + b * size
Can we still use linear regression?
Yes, instead of using the features (predictors): number of rooms and size we can use new features (predictors): (number of rooms) ^ 2 and size for the linear regression algorithm.
Hint 1: Feature engineering is an important part of ML
Hint 1: Feature engineering is an important part of ML
Success of a ML algorithm very often depends on what features you choose.
We can gather useful information to build the features from a "domain expert".
Features != raw data
Linear regression can be implemented using the Least-squares estimation algorithm.
Imagine you have some points, and want to have a line that best fits them like this:
Text
We have 2 vectors: one is static and the other one is variable. We have to find the best variable vector that minimize the distance between the 2 vectors.
Text
We have 2 vectors: one is static and the other one is variable. We have to find the best variable vector that minimize the distance between the vector.
The best fit in the least-squares sense minimizes the sum of squared residuals (a residual being: the difference between an observed value, and the fitted value provided by a model).
https://en.wikipedia.org/wiki/Least_squares
https://www.mathsisfun.com/data/least-squares-calculator.html
https://setosa.io/ev/ordinary-least-squares-regression/
Text
https://en.wikipedia.org/wiki/Euclidean_distance
https://www.mathsisfun.com/data/least-squares-regression.html
Text
Short Demo: https://www.codingame.com/playgrounds/3771/machine-learning-with-java---part-1-linear-regression
Text
Hint 2: In ML we usually try to minimize a loss function.
https://algs4.cs.princeton.edu/code/edu/princeton/cs/algs4/LinearRegression.java.html
Text
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters)
The notion of a "cluster" cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms
Text
Homework :)
https://en.wikipedia.org/wiki/Cluster_analysis
https://en.wikipedia.org/wiki/K-means_clustering
https://en.wikipedia.org/wiki/DBSCAN
Text
f(x) = a+b*x1 + c*x2
Features: x1, x2; need to be linear independent between each other; y needs to be linear dependent on them
Weights: a, b, c
Overfitting: learning too well a specific data set and failing to generalize
Train set = data set on which we train the model
Cross validation set = data set on which we test different algs/params
Test set = the final data set on which we evaluate the choosen alg
Text
Hint 3: ML is not about using a specific algorithm. It is more about modeling the problem as a the right type of ML problem e.g. for the exact same problem we can use different ML techniques and for each ML technique we can use different algorithms.
Text
Anomaly detection (aka outlier analysis) is a use case of ML.
Is the problem of finding the data points that deviate from a dataset's normal behavior.
It can be modelled using several ML techniques:
- Regression: you build a prediction model and you compare the predicted value with the real one.
- Clustering: you cluster the data points based on density and data points remaining outside the clusters (outliers) are anomalies.
- Classification: you need 2 balanced (same size) data-sets (one with normal data points and another with anomalies data points) and you train a classification model to distinguish between them.
- Statistically: you infer the probability distribution of the data set and any point with a low probability is an anomaly (example: use a normal distribution)
Text
Problem statement "Short term excessive risk":
- detect excessive betting risk/activity that Kambi takes in a short period of time
Text
Problem statement "Short term excessive risk":
- detect excessive betting risk/activity that Kambi takes in a short period of time.
Solution: apply an aggregation sliding time window. Check if the aggregated result is greater than a pre-defined threshold.
Text
Let's slightly change the problem statement :
- detect suspicious betting risk/activity that Kambi takes in a short period of time.
Text
Current solution problems:
We may have false positive "suspicious" detection
Text
Why feature engineering is important:
Volume derivate = Volume(t) - Volume(t-1)
Define a threshold for the derivative above which we have an anomaly.
Text
- Divide the last hour into 6 equal time intervals: 0 - 10, 10-20...
- Aggregate the risk taken into each interval.
- Build a linear regression alg to predict the "aggregated risk" in the next time interval (next 10 minutes).
- Compare the predicted value with the real aggregated risk of the next 10 minutes.
- If the difference is too big then we have an anomaly.
So...we have an anomaly detection problem solved with linear regression!
Text
- Divide the last hour into 6 equal time intervals: 0 - 10, 10-20...
- Aggregate the risk taken into each interval.
- Calculate the derivative for each point (relatively to prev point).
- Build a linear regression alg to predict the "derivative risk" in the next time interval (next 10 minutes).
- Compare the predicted value with the real derivative risk of the next 10 minutes.
- If the difference is too big then we have an anomaly.
As opposed to the last version we will not have problems when the trend is changing (e.g. from ascending to descending)
Text
Text
Pre-requisites: for this solution we need historical data for both anomalies and normal cases, equally sized.
Offline (build model):
- Calculate the historical betting activity for a fixed time interval
- Label each data point as anomaly or normal
- Train a binary classification algorithm
Online (apply model):
- Take the current betting activity in that fixed time interval
- Predict its class using the trained model
Text
Pre-requisites: for this solution we need historical data for normal cases.
Advantage vs sol 4: we don't need many historical anomaly cases.
Offline (build model):
- Infer from data the probability distribution (usually it is a Gaussian distribution) e.g calculate the mean and standard deviation.
Online (apply model):
- Take the current betting activity in that fixed time interval
- Apply the probability function to get the probability the current point to appear in the distribution
- If the probability is low => we have an anomaly
Text