Health Data Science Meetup

Regularization

Git and GitHub

Implementations in Python

Regularization

The hypothesis function:

Cost function:

measures the accuracy of our hypothesis function by using a cost function.
takes an average of all the results of the hypothesis with inputs from x's compared to the actual output y's

This function is otherwise called the "Squared error function", or "Mean squared error".
The mean is halved as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 1/2 term.

Now we need to estimate the parameters in hypothesis function.

The way we do this is by taking the derivative of our cost function.
The slope of the tangent is the derivative at that point and it will give us a direction to move towards.
We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α, which is called the learning rate.

The gradient descent algorithm is:

repeat until convergence:

where j=0,1 represents the feature index number.

Gradient Descent for Linear Regression:

Normal Equation:

For large datasets, we usually use stochastic gradient descent.

High bias or underfitting:
- when the form of our hypothesis function h maps poorly to the trend of the data.
- It is usually caused by a function that is too simple or uses too few features.
High variance or overfitting:
- caused by a hypothesis function that fits the available data but does not generalize well to predict new data.
- It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
Two main options to address overfitting:
- Reduce the number of features (manually select which features to keep/use a model selection algorithm)
- Regularization (Keep all the features, but reduce the parameters θ)

If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

We want to make it more quadratic

We'll want to eliminate the influence of the cubic and quartic terms.

Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

In general:

L2 regularization (Ridge)

L1 regularization (Lasso):

L2 regularization (Ridge):

L1+L2 regularizations (Elastic net):

L1 regularization helps perform feature selection in sparse feature spaces
L1 rarely perform better than L2
- when two predictors are highly correlated, L1 regularizer will simply pick one of the two predictors
- in contrast, the L2 regularizer will keep both of them and jointly shrink the corresponding coefficients a little bit
Elastic net has proved to be (in theory and in practice) better than L1/Lasso