- Crash Course -
UNIFEI - August, 2018
Prof. Luiz Eduardo
Hanneli Tavante
<3
Canada
Prof. Maurilio
<3
Deutschland
Note: this list can change
Note: this list can change
AI
ML
Representation Learning
DL
<3
Canada
Let us use the Logistic Regression example
Single training example
We will have multiple training examples:
m training examples in a set
( [
],
First x vector | First y (number) |
---|---|
Second x vector | Second y (number) |
... | ... |
)
( [
],
First x vector | First y (number) |
---|---|
Second x vector | Second y (number) |
... | ... |
)
( [
],
)
[
First x vector | First y (number) |
---|---|
Second x vector | Second y (number) |
... | ... |
...]
m
n
[
First x vector | First y (number) |
---|---|
Second x vector | Second y (number) |
... | ... |
...]
m
n
[
...]
[salary, location, tech...] | Probability y=1 |
---|---|
[RGB values] | 0.8353213 |
... | ... |
In logistic regression, we want to know the probability of y=1 given a vector of x
(note: we could have more columns for extra features)
Recap: for logistic regression, what is the equation format?
Tip: Sigmoid, transpose, wx+b
*
Recap: for logistic regression, what is the equation format?
Tip: Sigmoid, transpose, wx+b
*
+
b
z
Recap: for logistic regression, what is the equation format?
Tip: Sigmoid, transpose, wx+b
*
+
b
z
(
)
Recap: given a training set
You want
What do we do now?
For each training example
For each training example
If y = 1, you want
large
If y = 0, you want
small
What do we do now?
Cost function: applies the loss function to the entire training set
What do we do now?
Cost function: measures how well your parameters are doing in the training set
We want to find w and b that minimise J(w, b)
Gradient Descent: it points downhill
It's time for the partial derivatives
Gradient Descent: It's time for the partial derivatives
source: https://www.wikihow.com/Take-Derivatives
We are looking at the previous step to calculate the derivative
source: https://www.wikihow.com/Take-Derivatives
x
w
b
Forward pass
Adjust the values of w and b in order to MINIMIZE difference at the loss function
x
w
b
How do we compute the derivative?
The equations: (whiteboard)
Backwards step
x
w
b
Good news: this looks like a Neural Network!
Hidden
Layer
Hidden Layer
Hidden Layer
In each hidden unit of each layer, a computation happens. For example:
Hidden Layer
Hidden Layer
Hidden Layer
Hidden Layer
1. What was the formula for the sigmoid function?
2. Why do we need the sigmoid function? How does it look like?
3. Can we use any other type of function?
Rectified Linear Unit
(ReLU)
[Question: Why do we need to know the derivatives? :) ]
Which parameters do we have until now?
This is a matrix!
Is b a vector or a matrix?
A: It is a vector
X
W
b
x
w
b
<3
Canada
Will the gradient always work?
What's the best activation function?
How many hidden layers should I have?
Source: https://medium.freecodecamp.org/want-to-know-how-deep-learning-works-heres-a-quick-guide-for-everyone-1aedeca88076
6 Layers; L=6
=number of units in a layer L
General idea
Input:
Output:
Z works like a cache
General idea
Input:
Output:
In a layer l,
Forward
Backprop
<3
Canada
My predictions are awful. HELP
Select a fraction of your data set to check which model performs best.
Once you find the best model, make a final test (unbiased estimate)
Data
Training set
C.V. = Cross Validation set (select the best model)
C.V.
T = Test set (unbiased estimate)
T
Does it look good?
The predicted output is not even fitting the training data!
Does it look good?
The model fits everything
Capacity: Ability to fit a wide variety of functions
Overfitting
Underfitting
Capacity: Ability to fit a wide variety of functions
High variance
High Bias
Training set error | 1% |
---|---|
C.V. Error | 11% |
High variance problem
Training set error | 15% |
---|---|
C.V. Error | 16% |
High bias problem
(not even fitting the training set properly!)
Training set error | 15% |
---|---|
C.V. Error | 36% |
High bias problem AND high variance problem
Are you fitting the data of the training set?
No
High bias
Bigger network (more hidden layers and units)
Train more
Different N.N. architecture
Yes
Is your C.V. error low?
No
High variance
More data
Regularization
Different N.N. architecture
Yes
DONE!
Error
Capacity
J(θ) test error
J(θ) train error
Error
Capacity
J(θ) cv error
J(θ) train error
Bias problem (underfitting)
J(θ) train error is high
J(θ) train ≈ J(θ) cv
J(θ) train error is low
J(θ) cv >> J(θ) train
Variance problem (overfitting)
Error
Capacity
J(θ) test error
J(θ) train error
Stop here!
Why does it happen?
You end up having a very complex N.N., with the elements of the matrix W being too large.
w: HIIIIIIIIIIIIIIIIII
w: HELOOO
w: WOAHHH
w: !!111!!!
W's with a strong presence in every layer can cause overfitting. How can we solve that?
w: HIIIIIIIIIIIIIIIIII
w: HELOOO
w: WOAHHH
w: !!111!!!
SHH
HHHHHH
Norm
L2-Norm (Euclidean)
L1-Norm
Forbenius Norm
<3
Canada
Are there other ways to perform regularization, without calculating a norm?
https://slides.com/hannelitavante-hannelita/deep-learning-unifei-ii#/