The LASSO:
least absolute shrinkage and selection operator
Pierre Ablin
12/03/2019
Parietal Tutorial
Overview
- Linear regression
- The Lasso model
- Non-smooth optimization: proximal operators
Supervised learning
GOAL: Learn f from some realizations of (x,y)
Linear regression
GOAL: Learn f from some realizations of (x,y)
Assumption:
f is linear
GOAL: Learn β from some realizations of (x,y)
Maximum likelihood?
For one sample, if we observe xobs and yobs , we have:
i.e. :
Maximum likelihood?
For one sample, if we observe xobs and yobs , we have:
If we observe a whole dataset of N i.i.d. samples(x1,y1),⋯,(xN,yN),
the likelihood of this observation is:
maximum likelihood estimator:
Matrix interlude
Define the ℓ2 norm of a vector:
∣∣v∣∣=n=1∑Nvn2
And note in condensed matrix form:
y=[y1,⋯,yN]⊤∈RN
Then,
M.L.E.
Pros:
- Consistant: as N→∞, βMLE→β∗
- Comes from a clear statistical framework
- Can behave badly when the model is mispecified (e.g. f is not really linear)
- Lots of variance in βMLE when N is not big enough
- Ill posed problem when N<P !
Cons:
What do we do when n < p?
Bet on sparsity / Ockham's razor: only a few coefficients in β play a role
The ℓ0 Norm:
Example:
Subset selection/matching pursuit
Idea: select only a few active coefficients with the ℓ0 norm.
Where t is an integer controlling the sparsity of the solution.

- Non-convexity
- Instability: adding a sample may completely change β and its support.
- NP-Hard (loooong to solve)
Problems:
The Lasso
Idea: relax the ℓ0 norm to obtain a convex problem
Where t is a level controlling the sparsity of the solution.
See notebook

The Lasso: Lagrange formulation
Where λ is a level controlling the amount of sparsity.
- Relationship between λ and t depends on the data
- It promotes sparsity: there is a threshold λmax such that λ> λmax implies βlasso=0
Derivation
β=0 is a solution to the lasso if:
Where ∂ is the subgradient.
So the condition writes:
Lasso is useful because...
- It performs at the same time model selection and estimation : it tells you which coefficients in x are important.

Boston dataset:
Lasso is useful because...
- Leveraging sparsity enables fast solvers
If p=1000 and n=1000, computing Xβ takes ∼n×p=106 operations
Intuition:
But if we know that there are only 10 non-zero coefficients in β, it takes only ∼104 operations
The same reasoning applies to most quantities useful to estimate the model
Lasso estimation
how do we fit the model?
Gradient descent 101
Minimize f(x) where f is differentiable.
Iterate xt+1=xt−η∇f(xt)
Equivalent to minimizing a quadratic surrogate:

The simplest lasso:
P=1,N=1,X=1: can we minimize 21(y−β)2+λ∣β∣ ?
Yes: proximity operator
Soft-thresholding:
- 0 if ∣y∣≤λ
- y−λ if y>λ
- y+λ if y<−λ

Soft thresholding
- 0 if ∣y∣≤λ
- y−λ if y>λ
- y+λ if y<−λ

Separability:
ISTA: Iterative soft threshdolding algorithm
Gradient of the smooth term:
ISTA:
Comes from:
ISTA: Iterative soft threshdolding algorithm
- Some coefficients are 0 because of the prox
- One iteration takes O(min(N,P)×P)
Can be problematic for large P !
COORDInate descent
Idea : update only one coefficient of β at each iteration.
- If we know the residual r=Xβt−y, each update is O(n) :)
Residual update:
After updating the coordinate j:
βit+1=βit for i=j
Thanks!
The LASSO
By Pierre Ablin
The LASSO
- 1,576