Pierre Ablin
12/03/2019
Parietal Tutorial
GOAL: Learn \( f \) from some realizations of \( (\mathbf{x}, y) \)
GOAL: Learn \( f \) from some realizations of \( (\mathbf{x}, y) \)
\(f\) is linear
GOAL: Learn \( \beta \) from some realizations of \( (\mathbf{x}, y) \)
For one sample, if we observe \( \mathbf{x}_{\text{obs}}\) and \( y_{\text{obs}} \) , we have:
i.e. :
For one sample, if we observe \( \mathbf{x}_{\text{obs}}\) and \( y_{\text{obs}} \) , we have:
If we observe a whole dataset of \(N\) i.i.d. samples$$ (\mathbf{x}_1,y_1), \cdots, (\mathbf{x}_N, y_N) \enspace, $$
the likelihood of this observation is:
Define the \(\ell_2\) norm of a vector:
$$|| \mathbf{v}|| = \sqrt{\sum_{n=1}^N v_n^2}$$
And note in condensed matrix form:
$$ \mathbf{y} = [y_1,\cdots, y_N]^{\top} \in \mathbb{R}^N$$
Then,
Bet on sparsity / Ockham's razor: only a few coefficients in \( \beta \) play a role
Example:
Idea: select only a few active coefficients with the \( \ell_0 \) norm.
Where \( t\) is an integer controlling the sparsity of the solution.
Idea: relax the \(\ell_0\) norm to obtain a convex problem
Where \( t\) is a level controlling the sparsity of the solution.
See notebook
Where \( \lambda \) is a level controlling the amount of sparsity.
\(\beta = 0\) is a solution to the lasso if:
Where \(\partial \) is the subgradient.
So the condition writes:
Boston dataset:
If \(p = 1000 \) and \(n = 1000 \), computing \(X\beta\) takes \(\sim n \times p = 10^6 \) operations
But if we know that there are only \(10 \) non-zero coefficients in \(\beta\), it takes only \(\sim 10^4\) operations
The same reasoning applies to most quantities useful to estimate the model
Minimize \( f(\mathbf{x})\) where \(f\) is differentiable.
Iterate \(\mathbf{x}^{t+1} = \mathbf{x}^t - \eta \nabla f(\mathbf{x}^t) \)
Equivalent to minimizing a quadratic surrogate:
\(P=1, N = 1, X = 1:\) can we minimize \(\frac12 (y - \beta)^2 + \lambda |\beta|\) ?
Soft-thresholding:
Gradient of the smooth term:
Comes from:
Can be problematic for large \(P\) !
Idea : update only one coefficient of \( \beta \) at each iteration.
After updating the coordinate \(j\):
\( \beta^{t+1}_i = \beta^{t}_i \) for \( i \enspace \ne j \)