Linear Regression with one variable
Supervised learning
Given the “right answer” for each example in the data
Regression: Predict a real-valued output
Classification: discrete-valued output
Houston prices:
| Size | Price |
|---|---|
| 920 | 223 |
| 1060 | 154 |
| 520 | 68 |
| 955 | 127 |
| 902 | 140 |
| ... | ... |
\( x \)
\( y \)
Notation:
\( n \) = Number of training examples
\( x \) = "input" variable / features
\( y \) = "output" variable / "target" variable
Examples:
\( x_1 = 920 \)
\( y_3 = 68 \)
\( (x_4, y_4) = (955, 127) \)
Training set
Learning algorithm
h
Size of house
Estimated price
Training set
Learning algorithm
h
Size of house
Estimated price
Hypothesis
Maps \( x \) to \( y \)
\( x \)
Estimated value of \( y \)
\( h(x) = \theta_0 + \theta_1 x \)
Linear regression with one variable
\( \theta \) is called parameters
\( h(x) = \theta_0 + \theta_1 x \)
| x | y |
|---|---|
| 920 | 223 |
| 1060 | 154 |
| 520 | 68 |
| 955 | 127 |
| 902 | 140 |
| ... | ... |
Training set
Hypothesis:
\( h(x) = \theta_0 + \theta_1 x \)
Question:
How to choose the parameters \( \theta_i \)?
\( h(x) = \theta_0 + \theta_1 x \)
\( \theta_0 = 1.5, \theta_1 = 0 \)
\( \theta_0 = 0, \theta_1 = 0.5 \)
\( \theta_0 = 1, \theta_1 = 0.5 \)
\( h(x) \)
\( h(x) = 1.5 + 0 x \)
\( h(x) = 0.5 x \)
\( x \)
\( y \)
The left one is obviously the best, but why?
\( h(x) \)
\( x \)
\( y \)
\( h(x) \)
\( x \)
\( y \)
Idea: Choose \( \theta_0, \theta_1 \) so that \( h(x) \) is close to \( y \) for the training examples \( (x, y) \)
\( h(x) \)
\( x \)
\( y \)
\( h(x) \)
\( x \)
\( y \)
Idea: Choose \( \theta_0, \theta_1 \) so that \( h(x) \) is close to \( y \) for the training examples \( (x, y) \)
\( h(x) \)
\( x \)
\( y \)
\( h(x) \)
\( x \)
\( y \)
Idea: Choose \( \theta_0, \theta_1 \) so that \( h(x) \) is close to \( y \) for the training examples \( (x, y) \)
\( h(x) \)
\( x \)
\( y \)
Idea: Choose \( \theta_0, \theta_1 \) so that \( h(x) \) is close to \( y \) for the training examples \( (x, y) \)
\( h(x) \)
$$ \min_{\theta_0, \theta_1} J(\theta_0, \theta_1) $$
Objective:
$$ \min_{\theta_1} J(\theta_1) $$
| x | y |
|---|---|
| 1 | 1 |
| 1.5 | 1.5 |
| 2 | 2 |
Training set
(If you are lucky)
Find the global minimum
$$ \min_{\theta_0, \theta_1} J(\theta_0, \theta_1) $$
If we consider \( J(\theta_0, \theta_1) \) and want
Have some function:
Want
Outline:
$$ \min_{\theta_0, \theta_1} J(\theta_0, \theta_1) $$
repeat until convergence {
}
\( \partial \) = partial derivative = a derivative of a function of two or more variables with respect to one variable, the other(s) being treated as constant.
repeat until convergence {
}
learning rate
derivative
repeat until convergence {
}
"Size of the step"
"One step deeper" or "Direction of the step"
In our case
repeat until convergence {
}
Simultaneous
Tangent of \( \theta_1 \)
Derivative \( \approx \) Step in the tangent direction
Minimum found!
If \( \alpha \) is too small, the gradient descent can be slow
If \( \alpha \) is too large, gradient descent can overshoot the minimum. It may fail to converge or even diverge.
Gradient descent give no assurance if we reach a global minimum
Our first supervised model with the hypothesis:
\( h(x) = \theta_0 + \theta_1 x \)
We have a function ("cost function") to evaluate this model error:
And an algorithm to minimize this function in order to get the best parameters \( \theta_0, \theta_1 \):
Repeat until convergence {
}