Latent process z generates observed outputs x:
z→x
The forward operation "→" is generally known:
x=f(z)+ε
Goal of inverse problems: find a mapping
x→z
z : True image
x : Blurred image
f : Convolution with point spread function
x
z
=
⋆
K
z : current density in the brain
x : observed MEG signals
f : linear operator given by physics (Maxwell's equations)
x
D
=
z
Linear forward model : z∈Rm, x∈Rn, D∈Rn×m
x=Dz+ε
Problem: in some applications, m≫n, least-squares ill-posed
→ bet on sparsity : only a few coefficients in z∗ are =0
z is sparse
Simple solution: least squares
z∗∈argmin21∥x−Dz∥2
λ>0 regularization parameter :
z∗∈argmin21∥x−Dz∥2+λ∥z∥1=Fx(z)
Enforces sparsity of the solution.
Easier to see on the equivalent problem: z∗∈argmin21∥x−Dz∥2 s.t. ∥z∥1≤C
Tibshirani, Regression shrinkage and selection via the lasso, 1996
z∗∈argmin21∥x−Dz∥2s.t. ∥z∥1≤C
z∗∈argmin21∥x−Dz∥2
ISTA: simple algorithm to fit the Lasso.
Fx(z)=21∥x−Dz∥2+λ∥z∥1
Idea: use proximal gradient descent
→ 21∥x−Dz∥2 is a smooth function
∇z(21∥x−Dz∥2)=D⊤(Dz−x)
→ λ∥z∥1 has a simple proximal operator
ISTA: simple algorithm to fit the Lasso.
Fx(z)=21∥x−Dz∥2+λ∥z∥1
Starting from iterate z(t), majorization of the quadratic term by an isotropic parabola:
21∥x−Dz∥2= 21∥x−Dz(t)∥2+⟨x−Dz(t),D(z−z(t))⟩+21∥D(z−z(t))∥2
21∥D(z−z(t)∥2≤ 21L∥z−z(t)∥2
L=max∥Dz∥2 s.t. ∥z∥=1
Lipschitz constant of the problem
ISTA: simple algorithm to fit the Lasso.
Fx(z)=21∥x−Dz∥2+λ∥z∥1
Daubechies et al., An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. , 2004
21∥x−Dz∥2≤ 21∥x−Dz(t)∥2+⟨x−Dz(t),D(z−z(t))⟩+21L∥z−z(t)∥2
Therefore:
Fx(z)≤21∥x−Dz(t)∥2+⟨x−Dz(t),D(z−z(t))⟩+21L∥z−z(t)∥2+λ∥z∥1
Minimization of the R.H.S. gives the ISTA iteration:
z(t+1)=st(z(t)−L1D⊤(Dz(t)−x),Lλ)
ISTA:
z(t+1)=st(z(t)−L1D⊤(Dz(t)−x),Lλ)
st is the soft thresholding operator:
st(x,u)=argminz21∥x−z∥2+u∥z∥1
It is an element-wise non-linearity:
st(x,u)=(st(x1,u),⋯,st(xn,u))
In 1D : st(x,u)=
Assume that we want to solve the Lasso for many observations x1,⋯,xN with a fixed dictionary D
e.g. Dictionary learning :
Learn D and sparse representations z1,⋯,zN such that:
xi≃Dzi
D,z1,⋯,zNmini=1∑N(21∥xi−Dzi∥2+λ∥zi∥1)s.t.∥D:j∥=1
D,z1,⋯,zNmini=1∑N(21∥xi−Dzi∥2+λ∥zi∥1)s.t.∥D:j∥=1
Z-step: with fixed D,
z1=argminFx1
⋯
zN=argminFxN
We want to solve the Lasso many times with same D; can we accelerate ISTA ?
→ (x1,⋯,xN) is the training set, drawn from a distribution p and we want to accelerate ISTA on unseen data x∼p
ISTA:
z(t+1)=st(z(t)−L1D⊤(Dz(t)−x),Lλ)
Let W1=Im−L1D⊤D and W2=L1D⊤:
z(t+1)=st(W1z(t)+W2x,Lλ)
Gregor, LeCun, Learning Fast Approximations of Sparse Coding, 2010
A T-layer Lista network is a function Φ parametrized by T parameters Θ=(W1t,W2t,βt)t=0T−1
A T-layer Lista network is a function Φ parametrized by T parameters Θ=(W1t,W2t,βt)t=0T−1
The parameters of the network are learned to get better results than ISTA
A T-layer Lista network is a function Φ parametrized by T parameters Θ=(W1t,W2t,βt)t=0T−1
Supervised-learning
Ground truth s1,⋯,sN available (e.g. such that xi=Dsi)
L(Θ)=i=1∑N(ΦΘ(xi)−si)2
Semi-supervised
Compute s1,⋯,sN
as si=argminFxi
L(Θ)=i=1∑N(ΦΘ(xi)−si)2
Unsupervised
Learn to solve the Lasso
L(Θ)=i=1∑NFxi(ΦΘ(xi))
Unsupervised
Learn to solve the Lasso:
Θ∈argminL(Θ)=i=1∑NFxi(ΦΘ(xi))
If we see a new sample x, we expect:
Fx(ΦΘ(x))≤Fx(ISTA(x)),
where ISTA is applied for T iterations.
Advantages:
Drawbacks:
Consider a "deep" LISTA network: T≫1, assume that we have solved the expected optimization problem:
Θ∈argminEx∼p[Fx(ΦΘ(x))]
As T→+∞, assume (W1t,W2t,βt)→(W1∗,W2∗,β∗). Call α=β∗/λ. Then:
W1∗=Im−αD⊤D
W2∗=αD⊤
β∗=αλ
Ablin, Moreau et al., Learning Step Sizes for Unfolded Sparse Coding, 2019
Corresponds to ISTA with step size α instead of 1/L
W1=Im−L1D⊤D
W2=L1D⊤
β=Lλ
"The deep layers of LISTA only learn a better step size"
Sad result.
Learn step sizes only
Steps increase as sparsity of z increases. Learns a kind of sparse PCA:
LS=max∥Dz∥2 s.t. ∥z∥=1 and Supp(z)⊂S
1/LS increases as Supp(z) decreases
(ALISTA works in the supervised framework, fails here)
- Much fewer parameters: easier to generalize
- Simpler to train ?
- This idea can be transposed to more complicated algorithms than ISTA