Linear and logistic regression
Our first supervised model with the hypothesis:
\( h(x) = \theta_0 + \theta_1 x \)
We have a function ("cost function") to evaluate this model error:
And an algorithm to minimize this function in order to get the best parameters \( \theta_0, \theta_1 \):
Repeat until convergence {
}
| Size | Price |
|---|---|
| 1060 | 244 |
| 920 | 231 |
| 250 | 76 |
| ... | ... |
| Size | # of bedrooms | # of floors | Age of home | ... | Price |
|---|---|---|---|---|---|
| 1060 | 5 | 2 | 40 | ... | 244 |
| 920 | 3 | 1 | 32 | ... | 231 |
| 750 | 4 | 2 | 15 | ... | 197 |
| ... | ... | ... | ... | ... | ... |
\( x \) (features)
\( y \) (target)
| Size | # of bedrooms | # of floors | Age of home | Price |
|---|---|---|---|---|
| 1060 | 5 | 2 | 40 | 244 |
| 920 | 3 | 1 | 32 | 231 |
| 750 | 4 | 2 | 15 | 197 |
| ... | ... | ... | ... | ... |
Notation:
\( n \) = Number of training examples
\( m \) = Number of features
\( x \) = Matrice "input" variable / features
\( x_i \) = \( i^{th} \) column of x
\( y \) = "output" variable / "target" variable
Training set: \( n \times m \) matrix
For convenience of notation, define \( x_0 = 1 \):
Repeat until convergence {
}
First steps in feature engineering
Make sure features are on a similar scale
| Size | # bedrooms |
|---|---|
| 1060 | 2 |
| 920 | 4 |
| 250 | 1 |
Example:
\( x_1 \): size (0-1000)
\( x_2 \): # of bedrooms (1- 4)
Variation of \( \theta_2 \) will impact a lot the value of \( J(\theta) \) compared to \( \theta_1 \)
=> We want to put everything at better scale (ie. between some interval)
Simpliest method, rescale to the range [0, 1]:
| Size |
|---|
| 1060 |
| 920 |
| 250 |
| Size' |
|---|
| 1 |
| 0.83 |
| 0 |
| Size |
|---|
| 1060 |
| 920 |
| 250 |
| Size' |
|---|
| 0.39 |
| 0.22 |
| -0.62 |
Remove the mean and unit variance
Remove the mean and unit variance
| Size |
|---|
| 1060 |
| 920 |
| 250 |
| Size' |
|---|
| 0.89 |
| 0.49 |
| -1.39 |
Duplicate some feature with their degree
Example:
\( \theta_0 + \theta_1 x_1 \)
Second degree: \( \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 \)
Third degree: \( \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 \)
...
Model
Email: Spam or not spam?
Tumor: Malignant or benign?
Sport: Will a team win a next game?
Output is:
\( y \in \{0,1\} \)
0 is the "negative" class
1 is the "positive" class
Given some data, we want to predict the probability of an output.
For example, given the score of a team, what is the probability it will win.
\( 0 < p( win\) | \(score ) < 1 \)
If p(win|score) > 0.5 then predict "Win" ("1") else "Loose" ("0")
We want our model to reflect this and outputs in [0,1]
Exponential:
With g = the sigmoid function
"Probability that y = 1, given x, parameterized by \( \theta \)"
Cost function & optimisation
For the linear regression, we saw the cost function named "Mean squared error":
The loss to a certain predicted probability \( p \) in regard to the truth \( y \)
The loss to a certain predicted probability \( p \) in regard to the truth \( y \)
Examples:
0 if the current sample \( i \) has a negative output
\( y = 0 \)
0 if the current sample \( i \) has a positive output
\( y = 1 \)
Nothing really change!
Repeat until convergence {
}
Just compute the partial derivative of our new cost function
For \(K \) classes, \( j = 1 \space ... \space K \)
Train multiple classifier \( h_j(x) \) for each class \( j \)
On a new input \( x \), to make a new prediction, pick the class \( j \) that maximes:
$$ \max_{j} h(x) $$
Linear Regression
from sklearn import linear_model
ols = linear_model.LinearRegression()
ols.fit(X, y) # Train your model
ols.coef_ # Theta
ols.predict(X_new) # Predict on a new valuefrom sklearn import linear_model
clf = linear_model.LogisticRegression()
clf.fit(X, y) # Train your model
clf.coef_ # Theta
clf.predict(X_new) # Predict on a new valueLogistic Regression
You know how to solve a regression and classification model with a linear model.
You also know how to rescale features.
Next session, we will learn how to train and evaluate a model properly in practice.
And how to tackle some common problems (overfitting, CV,...)
After that, for supervised learning, it all will be new models, optimisation methods, tricks to tackle specific problems: imbalanced learning, dealing with text data, image data,...
Kaggle Inclass
Regression: Predict NYC house price
Classification: Predict if a patient has a disease or not
Multi-Classification: Predict if a water pump in Africa is functional, need some repairs or don't work at all