Pierre Ablin
Parietal tutorials
We consider the Euclidean space \(\mathbb{R}^P\) equipped with the usual scalar product:
$$ \langle x, x' \rangle = \sum_{p=1}^P x_p x'_p = x^{\top}x' $$
It defines the Euclidean (aka \( \ell_2 \) ) norm \(\|x\| = \sqrt{\langle x,x\rangle}\).
Lemma: any other scalar product on \( \mathbb{R}^P\) writes:
$$( x, x' ) = \sum_{p, q=1}^P x_p x'_q S_{pq} = x^{\top} S x' \enspace, $$
with \(S\) a positive definite matrix \(\to\) all scalar products "look" the same
An Euclidean space is a vector space: you can go in straight lines.
The distance between two points is given by: \(d(x, x') = \|x - x' \| \).
Most machine learning algorithms are designed an Euclidean framework !
- Linear regression: find \(\beta \in \mathbb{R}^P\) such that \(\langle \beta, x_i \rangle \sim y_i \)
- PCA: find a direction \(\beta \in \mathbb{R}^P\) such that \(\langle \beta, x_i \rangle \) has maximal variance
- Nearest neighbor: the predicted class of \(x\) is the class of \(x_i \) that minimizes \( \|x - x_i \| \)
In many applications, the data points dot not live naturally in \(\mathbb{R}^P \) !
A k-dimensional manifold \(\mathcal{M}\) is a smooth subset of \(\mathbb{R}^P\) (no edge/pointy parts). Smooth means that the neighborhood of each point in the manifold looks like a vector space of dimension k: it is flat.
- The sphere \(S^{P-1} = \{x \in \mathbb{R}^{P} \enspace | \|x\| = 1 \} \) is a \( P -1 \) dimensional manifold
- Any k dimensional vector space (or open subset of a vector space)
- The orthogonal matrix manifold \(\mathcal{O}_P = \{M \in \mathbb{R}^{P \times P} \enspace |MM^{\top} =I_P\}\) is a \(P(P-1)/2\) dimensional manifold
- The positive definite matrix set \(S^{++}_P = \{M \in \mathbb{R}^{P \times P} \enspace |M=M^{\top} \) and eig\( (M) > 0\}\) is a \(P(P+1)/2\) dimensional manifold
Consider "curves on the manifold", linking two points \(x\) and \(x'\).
It is a (differentiable) function \(\gamma\) from \([0, 1]\) to \(\mathcal{M}\):
$$ \gamma(t) \in \mathcal{M} \enspace, $$
Such that \(\gamma(0) = x\) and \(\gamma(1) = x' \).
Its derivative at a point \(t\) is a vector in \(\mathbb{R}^P\):
$$\gamma'(t) \in \mathbb{R}^P$$
The derivative is found by the first order expansion: \(\gamma(t+ dt) = \gamma(t) + \gamma'(t) dt\)
"Around each point \(x\) of the manifold, the manifold looks like a vector space": this is the tangent space at \(x\), \(T_x\).
It is a linear subspace of \(\mathbb{R}^P\) of dimension \(k\). It is the set of all the derivatives of the curves passing at \(x\).
- On the sphere \(S^{P-1} = \{x \in \mathbb{R}^{P} \enspace | \|x\| = 1 \} \):
- On the orthogonal manifold \( \mathcal{O}_P = \{M \in \mathbb{R}^{P \times P} | \enspace MM^{\top} = I_P\} \) :
- On a vector space \(F\):
A manifold becomes Riemannian when each tangent space is equipped with an Euclidean structure: there is one scalar product for each tangent space.
$$\langle \xi, \xi' \rangle_x \text{ for } \enspace \xi, \xi' \in T_x$$
There is a positive definite matrix \(S_x \) for all \(x \in \mathcal{M} \) such that \(\langle \xi, \xi' \rangle_x = \xi^{\top} S_x \xi'\).
- The sphere is often endowed with the Euclidean scalar product \(\langle \xi, \xi' \rangle_x = \langle \xi, \xi' \rangle \)
- The P.D. matrices is often endowed with the geometric metric:
\( \langle U, V \rangle_M = \text{Tr}(UM^{-1}VM^{-1}) \)
There are countless possibilities !
Note: The scalar product also defines a norm : \(\|\xi\|_x = \sqrt{\langle \xi, \xi \rangle_x} \)
Let \( x, x' \in \mathcal{M} \). Let \(\gamma : [0, 1] \to \mathcal{M} \) be a curve linking \(x\) to \(x'\).
The length of \(\gamma \) is:
$$\text{length}(\gamma) = \int_{0}^1 \|\gamma'(t) \|_{\gamma(t)} dt \enspace ,$$
and the geodesic distance between \(x\) and \(x'\) is the minimal length:
\(\gamma \) is called a geodesic.
- On the sphere, equipped with the usual metric, \(d(x, x') = \arccos(x^{\top}x') \)
- Geodesic distances are not always available in closed form :(
For a set of points \(x_1,\cdots, x_N \in \mathcal{M}\), we can define the barycenter as:
$$\bar{x} = \arg\min_{x\in \mathcal{M}} \frac1N \sum_{n=1}^Nd(x, x_n)^2$$
Note: if \(d\) is the Euclidean distance, \(\bar{x} = \frac1N \sum_{n=1}^N x_n \) (extends notion of average) :)
Brain Computer Interface (BCI) problem: subject is asked to move its right or left hand. Record EEG, compute the covariance matrix for each task.
\(\to\) dataset \(C_1, \cdots C_N \in S_P^{++}\), and targets \(y_1, \cdots, y_N = \pm 1 \).
Simple classification pipeline:
- Compute the means \(\bar{C}^{\text{left}}, \enspace \bar{C}^{\text{right}} \) for each class.
- Compute the distances to mean for each \(n\): \(d_n^{\text{left}}= d(C_n, \bar{C}^{\text{left}})\), \(d_n^{\text{right}}= d(C_n, \bar{C}^{\text{right}})\) .
[Barachant 2012]
Let \(x\in \mathcal{M} \) and \(\xi \in T_x\). There is a unique geodesic \(\gamma \) such that \(\gamma(0) = x \) and \(\gamma'(0) = \xi \). The exponential map at \(x\) is a function from \(T_x\) to \(\mathcal{M}\):
$$\text{Exp}_x(\xi) = \Gamma(1) \in \mathcal{M}$$
- It is sometimes available in closed form
- Extremely important property for machine learning:
The exponential map preserves distances !
The logarithm is the inverse operation: \(\text{Log}_x\) is a mapping from \(\mathcal{M}\) to \(\mathbb{R}^P\):
$$\text{Log}_x(x') \in \mathbb{R}^P$$
- It is sometimes available in closed form
- Extremely important property for machine learning:
It maps points \(x'\) in the manifold to an Euclidean space, and conserves distances !
It is the natural tool to project points of \(\mathcal{M}\) on a space on which it makes sense to use machine learning algorithms
Optimization on a manifold \(\mathcal{M}\):
$$ \text{minimize} \enspace f(x) \enspace \text{s.t.} \enspace x \in \mathcal{M}\enspace, $$
where \(f\) is a differentiable function.
Example: \(\mathcal{M}\) is the sphere, and \(f(x) = \frac12x^{\top} A x \) (A symmetric) :
$$\text{minimize} \enspace \frac12x^{\top} A x \enspace \text{s.t.} \enspace \|x\| = 1$$
Bonus question: what is the solution?
$$ \text{minimize} \enspace f(x) \enspace \text{s.t.} \enspace x \in \mathcal{M}$$
You often have access to the (Euclidean) projection on \(\mathcal{M}\).
Projected gradient descent:
$$x^{(t+1)} = \text{Proj}_{\mathcal{M}}(x^{(t)} - \eta \nabla f(x^{(t)}))$$
\( \nabla f\) is the usual gradient.
On the sphere, \(\text{Proj}_{\mathcal{M}}(x) = \frac{x}{\|x\|}\), so:
$$ \text{minimize} \enspace f(x) \enspace \text{s.t.} \enspace x \in \mathcal{M}$$
You can compute the Riemannian (a.k.a. Natural) gradient:
$$\text{grad}f(x) \in T_x\enspace, $$
such that :
Natural gradient descent:
$$x^{(t+1)} = \text{Exp}_{x^{(t)}}(-\eta \cdot\text{grad}f(x^{(t)})$$
Example: on the sphere, \(\text{grad}f(x) = Ax -(x^{\top}Ax) x\)
In most optimization problem where there is a manifold constraint, methods using natural gradient work best.
Can be extended to stochastic algorithms, second order methods, etc...
See Optimization Algorithms on Matrix Manifolds by Absil,Mahony and Sepulchre
Training set: samples \(x_1, \cdots, x_N\in \mathcal{M}\). How can we find a way to move these points to a space where it makes sense to use our machine learning artillery?
-Difficult in general
-Simple solution if they are all 'close'
'Vectorization' procedure:
- Compute the average: \(\bar{x} = \text{Mean}(x_1, \cdots, x_N)\)
- Using the Logarithm, project the points into the tangent space at \( \bar{x}\):
$$\xi_n = \text{Log}_{\bar{x}}(x_n)$$
- Project these points in \(\mathbb{R}^k\) using the whitening operation:
$$\nu_n = S_{\bar{x}}^{-\frac12} \xi_n$$
You end up with vectors such that:
$$\|\nu_n - \nu_m\|_2 \simeq d(x_n, x_m) $$
So you can use classical algorithms on the \( (\nu_n) \) !