A matrix \(W \in\mathbb{R}^{p\times p}\) is orthogonal when:
$$WW^\top = I_p$$
Can be used to build layers in neural networks \(y = \sigma(Wx + b)\) for :
[1]: Cisse, Moustapha, et al. "Parseval networks: Improving robustness to adversarial examples." International Conference on Machine Learning. PMLR, 2017.
[2]: Arjovsky, Martin, Amar Shah, and Yoshua Bengio. "Unitary evolution recurrent neural networks." International Conference on Machine Learning. PMLR, 2016.
[3]: Berg, Rianne van den, et al. "Sylvester normalizing flows for variational inference." arXiv preprint arXiv:1803.05649 (2018).
Neural network \(\phi_{\theta}: x\mapsto y\) with parameters \(\theta\). Some parameters are orthogonal matrices.
Dataset \(x_1, \dots, x_n\).
Find parameters by empirical risk minimization:
$$\min_{\theta}f(\theta) = \frac1n\sum_{i=1}^n\ell_i(\phi_{\theta}(x_i))$$
\(\ell_i\) individual loss function (e.g. log-likelihood, cross-entropy with targets, ...)
How to do this with orthogonal weights?
\( \mathcal{O}_p = \{W\in\mathbb{R}^{p\times p}|\enspace W^\top W =I_p\}\) is a Riemannian manifold
Problem:
$$\min_{W\in\mathcal{O}_p}f(W)$$
Main approach:
Extend Euclidean algorithms to the Riemannian setting (gradient descent, stochastic gradient descent...)
Classical retraction :
$$\mathcal{R}(W, AW) =\exp(A)W$$
\(\mathrm{grad} f(W) = \mathrm{Skew}(\nabla f(W)W^\top)W\)
Riemannian gradient descent in a neural network:
Classical retraction :
$$\mathcal{R}(W, AW) =\exp(A)W$$
In a deep learning setting, moving on the manifold is too costly !
Can we have a method that is free to move outside the manifold that
Projection
Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$
$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$
Optimization
Riemannian gradient:
$$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top) M$$
These two terms are orthogonal !
Projection
Follow the gradient of $$\mathcal{N}(M) = \frac14\|MM^\top - I_p\|^2$$
$$\nabla \mathcal{N}(M) = (MM^\top - I_p)M$$
Optimization
Riemannian gradient:
$$\mathrm{grad}f(M) = \mathrm{Skew}(\nabla f(M)M^\top) M$$
$$\Lambda(M) = \mathrm{grad}f(M) + \lambda \nabla \mathcal{N}(M)$$
$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$
Only matrix-matrix mutliplications ! No expensive linear algebra + parrallelizable on GPU's
Comparison to retractions:
$$\Lambda(M) = \left(\mathrm{Skew}(\nabla f(M)M^\top) + \lambda (MM^\top-I_p)\right)M$$
Starting from \(M^0\in\mathcal{O}_p\), iterate
$$M^{t+1} = M^t -\eta \Lambda(M^t)$$