Deep learning for inverse problems: solving the lasso with neural networks
Pierre Ablin, Inria
05/11/2019 - Le Palaisien
Joint work with T. Moreau, M. Massias and A.Gramfort
Inverse problems
Latent process z generates observed outputs x:
z→x
The forward operation "→" is generally known:
x=f(z)+ε
Goal of inverse problems: find a mapping
x→z
Example: Image deblurring
z : True image
x : Blurred image
f : Convolution with point spread function



x
z
=
⋆
K
Example: MEG acquisition
z : current density in the brain
x : observed MEG signals
f : linear operator given by physics (Maxwell's equations)
x
D
=
z

Linear regression
Linear forward model : z∈Rm, x∈Rn, D∈Rn×m
x=Dz+ε
Problem: in some applications, m≫n, least-squares ill-posed
→ bet on sparsity : only a few coefficients in z∗ are =0
z is sparse
Simple solution: least squares
z∗∈argmin21∥x−Dz∥2
The Lasso
λ>0 regularization parameter :
z∗∈argmin21∥x−Dz∥2+λ∥z∥1=Fx(z)
Enforces sparsity of the solution.
Easier to see on the equivalent problem: z∗∈argmin21∥x−Dz∥2 s.t. ∥z∥1≤C
Tibshirani, Regression shrinkage and selection via the lasso, 1996
Lasso induces sparsity

z∗∈argmin21∥x−Dz∥2s.t. ∥z∥1≤C

z∗∈argmin21∥x−Dz∥2
Iterative shrinkage-thresholding algorithm
ISTA: simple algorithm to fit the Lasso.
Fx(z)=21∥x−Dz∥2+λ∥z∥1
Idea: use proximal gradient descent
→ 21∥x−Dz∥2 is a smooth function
∇z(21∥x−Dz∥2)=D⊤(Dz−x)
→ λ∥z∥1 has a simple proximal operator
ISTA: derivation
ISTA: simple algorithm to fit the Lasso.
Fx(z)=21∥x−Dz∥2+λ∥z∥1
Starting from iterate z(t), majorization of the quadratic term by an isotropic parabola:
21∥x−Dz∥2= 21∥x−Dz(t)∥2+⟨x−Dz(t),D(z−z(t))⟩+21∥D(z−z(t))∥2
21∥D(z−z(t)∥2≤ 21L∥z−z(t)∥2
L=max∥Dz∥2 s.t. ∥z∥=1
Lipschitz constant of the problem
Iterative shrinkage-thresholding algorithm
ISTA: simple algorithm to fit the Lasso.
Fx(z)=21∥x−Dz∥2+λ∥z∥1
Daubechies et al., An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. , 2004
21∥x−Dz∥2≤ 21∥x−Dz(t)∥2+⟨x−Dz(t),D(z−z(t))⟩+21L∥z−z(t)∥2
Therefore:
Fx(z)≤21∥x−Dz(t)∥2+⟨x−Dz(t),D(z−z(t))⟩+21L∥z−z(t)∥2+λ∥z∥1
Minimization of the R.H.S. gives the ISTA iteration:
z(t+1)=st(z(t)−L1D⊤(Dz(t)−x),Lλ)
Soft thresholding
ISTA:
z(t+1)=st(z(t)−L1D⊤(Dz(t)−x),Lλ)
st is the soft thresholding operator:
st(x,u)=argminz21∥x−z∥2+u∥z∥1
It is an element-wise non-linearity:
st(x,u)=(st(x1,u),⋯,st(xn,u))
In 1D : st(x,u)=
- 0 if ∣x∣≤u
- x−u if x≥u
- x+u if x≤−u
ISTA as a Recurrent neural network
Solving the lasso many times
Assume that we want to solve the Lasso for many observations x1,⋯,xN with a fixed dictionary D
e.g. Dictionary learning :
Learn D and sparse representations z1,⋯,zN such that:
xi≃Dzi
D,z1,⋯,zNmini=1∑N(21∥xi−Dzi∥2+λ∥zi∥1)s.t.∥D:j∥=1
Dictionary learning
D,z1,⋯,zNmini=1∑N(21∥xi−Dzi∥2+λ∥zi∥1)s.t.∥D:j∥=1
Z-step: with fixed D,
z1=argminFx1
⋯
zN=argminFxN
We want to solve the Lasso many times with same D; can we accelerate ISTA ?
Training / Testing
→ (x1,⋯,xN) is the training set, drawn from a distribution p and we want to accelerate ISTA on unseen data x∼p
ISTA is a Neural Net
ISTA:
z(t+1)=st(z(t)−L1D⊤(Dz(t)−x),Lλ)
Let W1=Im−L1D⊤D and W2=L1D⊤:
z(t+1)=st(W1z(t)+W2x,Lλ)
3 iterations of ISTA = 3 layers NN
Learned-ISTA
Gregor, LeCun, Learning Fast Approximations of Sparse Coding, 2010
A T-layer Lista network is a function Φ parametrized by T parameters Θ=(W1t,W2t,βt)t=0T−1
Learned-ISTA
A T-layer Lista network is a function Φ parametrized by T parameters Θ=(W1t,W2t,βt)t=0T−1
- z(0)=0
- z(t+1)=st(W1tz(t)+W2tx,βt)
- Return z(T) =ΦΘ(x)
The parameters of the network are learned to get better results than ISTA
learning parameters
A T-layer Lista network is a function Φ parametrized by T parameters Θ=(W1t,W2t,βt)t=0T−1
Supervised-learning
Ground truth s1,⋯,sN available (e.g. such that xi=Dsi)
L(Θ)=i=1∑N(ΦΘ(xi)−si)2
- z(0)=0
- z(t+1)=st(W1tz(t)+W2tx,βt)
- Return z(T) =ΦΘ(x)
Semi-supervised
Compute s1,⋯,sN
as si=argminFxi
L(Θ)=i=1∑N(ΦΘ(xi)−si)2
Unsupervised
Learn to solve the Lasso
L(Θ)=i=1∑NFxi(ΦΘ(xi))
learning parameters
Unsupervised
Learn to solve the Lasso:
Θ∈argminL(Θ)=i=1∑NFxi(ΦΘ(xi))
If we see a new sample x, we expect:
Fx(ΦΘ(x))≤Fx(ISTA(x)),
where ISTA is applied for T iterations.
Lista
Advantages:
- Can handle large-scale datasets
- GPU friendly
Drawbacks:
- Learning problem is non-convex, non-differentiable: no practical guarantees, and generally hard to train
- Can fail badly on unseen data
what does the network learn?
Consider a "deep" LISTA network: T≫1, assume that we have solved the expected optimization problem:
Θ∈argminEx∼p[Fx(ΦΘ(x))]
As T→+∞, assume (W1t,W2t,βt)→(W1∗,W2∗,β∗). Call α=β∗/λ. Then:
W1∗=Im−αD⊤D
W2∗=αD⊤
β∗=αλ
Ablin, Moreau et al., Learning Step Sizes for Unfolded Sparse Coding, 2019
Corresponds to ISTA with step size α instead of 1/L
W1=Im−L1D⊤D
W2=L1D⊤
β=Lλ
what does the network learn?
"The deep layers of LISTA only learn a better step size"
Sad result.
- The first layers of the network learn something more complicated
- Idea : learn step sizes only → SLISTA
Step-LISTA
Learn step sizes only
Steps increase as sparsity of z increases. Learns a kind of sparse PCA:
LS=max∥Dz∥2 s.t. ∥z∥=1 and Supp(z)⊂S
1/LS increases as Supp(z) decreases

SLISTA seems better for high sparsity

(ALISTA works in the supervised framework, fails here)
Conclusion
- Much fewer parameters: easier to generalize
- Simpler to train ?
- This idea can be transposed to more complicated algorithms than ISTA
Thanks !
deep learning for inverse problems
By Pierre Ablin
deep learning for inverse problems
- 1,476