Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
huihu@ufl.edu
April 9, 2020
Image from MNIST handwritten digit dataset
A zero that is difficult to distinguish from a six algorithmically
We don't know what program to write because we don't know how it's done by our brains
Let's define our model to be a function
A linear perceptron
Training
80%
Testing
20%
k-fold CV
Tune Models
Evaluate Performance
Raw Data
Data Engineering
Explore
Model Selection
Feature Engineering
Train Model
Evaluate Performance
Data Product
The hypothesis function:
Cost function:
Now we need to estimate the parameters in hypothesis function.
The gradient descent algorithm is:
repeat until convergence:
where j=0,1 represents the feature index number.
Gradient Descent for Linear Regression:
Normal Equation:
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
Works well when n is large |
Slow if n is very large |
For large datasets, we usually use stochastic gradient descent.
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
We want to make it more quadratic
We'll want to eliminate the influence of the cubic and quartic terms.
Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
In general:
L2 regularization (Ridge)
L1 regularization (Lasso):
L2 regularization (Ridge):
L1+L2 regularizations (Elastic net):
Bootstrap Aggregation
AdaBoost (Adaptive Boosting)
Gradient Boosting (Stochastic Gradient Boosting)
Iteration 1
Iteration 2
Iteration 3
Final Model
Intuitive sense: weights will be increased for incorrectly classified observation
- give more focus to next iteration
- weights will be reduced for correctly classified observation
Initial model
Compute residuals
Model residuals
Combinations
...
Model predictions
Residuals
...
Feed-forward NNs:
Logistic function / Sigmoid function: 0~1
Hyperbolic tangent (Tanh) function: -1~1
ReLU (rectified linear unit) function
Does color matter?
No, only the structure matters