A matrix W∈Rp×p is orthogonal when:
WW⊤=Ip
Can be used to build layers in neural networks y=σ(Wx+b) for :
[1]: Cisse, Moustapha, et al. "Parseval networks: Improving robustness to adversarial examples." International Conference on Machine Learning. PMLR, 2017.
[2]: Arjovsky, Martin, Amar Shah, and Yoshua Bengio. "Unitary evolution recurrent neural networks." International Conference on Machine Learning. PMLR, 2016.
[3]: Berg, Rianne van den, et al. "Sylvester normalizing flows for variational inference." arXiv preprint arXiv:1803.05649 (2018).
gθ
fθ
z
x
Neural network ϕθ:x↦y with parameters θ. Some parameters are orthogonal matrices.
Dataset x1,…,xn.
Find parameters by empirical risk minimization:
θminf(θ)=n1i=1∑nℓi(ϕθ(xi))
ℓi individual loss function (e.g. log-likelihood, cross-entropy with targets, ...)
How to do this with orthogonal weights?
Op={W∈Rp×p∣W⊤W=Ip} is a Riemannian manifold
Problem:
W∈Opminf(W)
Main approach:
Extend Euclidean algorithms to the Riemannian setting (gradient descent, stochastic gradient descent...)
Classical retraction :
R(W,AW)=exp(A)W
gradf(W)=Skew(∇f(W)W⊤)W
Riemannian gradient descent in a neural network:
Classical retraction :
R(W,AW)=exp(A)W
In a deep learning setting, moving on the manifold is too costly !
Can we have a method that is free to move outside the manifold that
Projection
Follow the gradient of N(M)=41∥MM⊤−Ip∥2
∇N(M)=(MM⊤−Ip)M
Optimization
Riemannian gradient:
gradf(M)=Skew(∇f(M)M⊤) M
These two terms are orthogonal !
Projection
Follow the gradient of N(M)=41∥MM⊤−Ip∥2
∇N(M)=(MM⊤−Ip)M
Optimization
Riemannian gradient:
gradf(M)=Skew(∇f(M)M⊤) M
Λ(M)=gradf(M)+λ∇N(M)
Λ(M)=(Skew(∇f(M)M⊤)+λ(MM⊤−Ip))M
Only matrix-matrix mutliplications ! No expensive linear algebra + parrallelizable on GPU's
Comparison to retractions:
Λ(M)=(Skew(∇f(M)M⊤)+λ(MM⊤−Ip))M
Starting from M0∈Op, iterate
Mt+1=Mt−ηΛ(Mt)