A numerical analysis perspective on neural networks

Davide Murari

davide.murari@ntnu.no

In collaboration with : Elena Celledoni, James Jackaman, Brynjulf Owren, Carola-Bibiane Schönlieb, and Ferdia Sherry

\( \mathcal{N}(x) = f_{\theta_M}\circ ... \circ f_{\theta_1}(x)\)

Neural networks motivated by dynamical systems

t_0
t_1
t_2
t_i
t_{i+1}
t_M
\cdots
\cdots
\cdots
\mathcal{N}(x) = \Psi_{F_M}^{\delta t_M}\circ ...\circ \Psi_{F_1}^{\delta t_1}(x)

\( \dot{x}(t) = F(x(t),\theta(t)) \)

Where \(F_i(x) = F(x,\theta(t_i))\)

\( \delta t_i = t_{i}-t_{i-1}\)

\( \mathcal{N}(x) = f_{\theta_M}\circ ... \circ f_{\theta_1}(x)\)

Neural networks motivated by dynamical systems

\mathcal{N}(x) = \Psi_{F_M}^{\delta t_M}\circ ...\circ \Psi_{F_1}^{\delta t_1}(x)

\( \dot{x}(t) = F(x(t),\theta(t)) \)

Where \(F_i(x) = F(x,\theta(t_i))\)

\( \delta t_i = t_{i}-t_{i-1}\)

t_0
t_1
t_2
t_i
t_{i+1}
t_M
\cdots
\cdots
\cdots

\( \mathcal{N}(x) = f_{\theta_M}\circ ... \circ f_{\theta_1}(x)\)

Neural networks motivated by dynamical systems

t_0
t_1
t_2
t_i
t_{i+1}
t_M
\cdots
\cdots
\cdots
\mathcal{N}(x) = \Psi_{F_M}^{\delta t_M}\circ ...\circ \Psi_{F_1}^{\delta t_1}(x)

\( \dot{x}(t) = F(x(t),\theta(t)) \)

Where \(F_i(x) = F(x,\theta(t_i))\)

\( \delta t_i = t_{i}-t_{i-1}\)

Neural networks motivated by dynamical systems

Outline

Universal approximation theorem based on gradient flows

Approximation theory for ResNets applied to space-time PDE data

Defining neural networks with prescribed properties

Pixel-based learning

\(U^{n+1} = \Phi(U^n)\)

GOAL: Approximate the map \(\Phi\)

Predictions at time \(40\delta t\)

Fisher's equation:

\(\partial_t u=\alpha \Delta u+u(1-u)\)

Approximation of space-time dynamics of PDEs

\partial_t u = \mathcal{L}u + \sum_{i=1}^n\beta_i\,\partial_{\alpha_{i1}}u\cdot \,...\,\cdot \partial_{\alpha_{im_i}}u,\\ \beta_i\in\mathbb{R},\,\,\alpha_{ij}\in \mathbb{N}^2
u=u(t,x,y)

\(0=t_0 < t_1<...<t_M\)
\(U_{ij}^n = u(t_n,x_i,y_j)\)

{

\delta x

Method of lines

\partial_t u = \mathcal{L}u + \sum_{i=1}^n\beta_i\,\partial_{\alpha_{i1}}u\cdot \,...\,\cdot \partial_{\alpha_{im_i}}u
\dot{U}(t)=L * U(t)+\sum_{i=1}^n \beta_i\left[\left(D_{i 1} * U(t)\right) \odot \cdots \odot\left(D_{i m_i} * U(t)\right)\right]\\ =: F(U(t))

Initial PDE

Semidiscretisation in space with finite differences

Approximation of the time dynamics

U(t)_{ij} := u(x_i,y_j,t)
U^{n+1} = U^n + \delta t F(U^n)=: \Psi_{F}^{\delta t}(U^n)

Method of lines

\partial_t u = \mathcal{L}u + \sum_{i=1}^n\beta_i\,\partial_{\alpha_{i1}}u\cdot \,...\,\cdot \partial_{\alpha_{im_i}}u
\dot{U}(t)=L * U(t)+\sum_{i=1}^n \beta_i\left[\left(D_{i 1} * U(t)\right) \odot \cdots \odot\left(D_{i m_i} * U(t)\right)\right]\\ =: F(U(t))

Initial PDE

Semidiscretisation in space with finite differences

Approximation of the time dynamics

U(t)_{ij} \approx u(x_i,y_j,t)
U^{n+1} = U^n + \delta t F(U^n)=: \Psi_{F}^{\delta t}(U^n)

Method of lines

\partial_t u = \mathcal{L}u + \sum_{i=1}^n\beta_i\,\partial_{\alpha_{i1}}u\cdot \,...\,\cdot \partial_{\alpha_{im_i}}u
\dot{U}(t)=L * U(t)+\sum_{i=1}^n \beta_i\left[\left(D_{i 1} * U(t)\right) \odot \cdots \odot\left(D_{i m_i} * U(t)\right)\right]\\ =: F(U(t))
U^{n+1} = U^n + \delta t F(U^n)=: \Psi_{F}^{\delta t}(U^n)

Initial PDE

Semidiscretisation in space with finite differences

Approximation of the time dynamics

U(t)_{ij} \approx u(x_i,y_j,t)

Finite differences as convolutions

R = K*U
\frac{1}{\delta x^2}\left[\begin{array}{ccc} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{array}\right] * U=\Delta u\left(t, x_i, y_j\right)+\mathcal{O}\left(\delta x^2\right)

Example

Error splitting

\begin{aligned} & \left\|U^{n+1}-\Psi_{F_\theta}^{\delta t}\left(U^n\right)\right\| \\ & =\left\|\mathcal{O}\left(\delta x^2\right)+\Phi_F^{\delta t}\left(U^n\right)-\Psi_{F_\theta}^{\delta t}\left(U^n\right)\right\| \\ & \leq \mathcal{O}\left(\delta x^2\right)+\left\|\Phi_F^{\delta t}\left(U^n\right)-\Psi_F^{\delta t}\left(U^n\right)+\Psi_F^{\delta t}\left(U^n\right)-\Psi_{F_\theta}^{\delta t}\left(U^n\right)\right\| \\ & \leq \underbrace{\mathcal{O}\left(\delta x^2\right)}_{\text {spatial error }}+\underbrace{\left\|\Phi_F^{\delta t}\left(U^n\right)-\Psi_F^{\delta t}\left(U^n\right)\right\|}_{\text {classical error estimate }}+\underbrace{\left\|\Psi_F^{\delta t}\left(U^n\right)-\Psi_{F_\theta}^{\delta t}\left(U^n\right)\right\|}_{\text {neural network approximation }} \\ & \leq \mathcal{O}\left(\delta x^2\right) + \mathcal{O}\left(\delta t^{r+1}\right) + \left\|\Phi_F^{\delta t}\left(U^n\right)-\Phi_{F_\theta}^{\delta t}\left(U^n\right)\right\|\\ & \leq \mathcal{O}\left(\delta x^2\right) + \mathcal{O}\left(\delta t^{r+1}\right) + \delta t \exp (\operatorname{Lip}(F) \delta t) \sup _{\|U\| \leq c}\left\|F_\theta(U)-F(U)\right\| \end{aligned}
U^n \mapsto \Psi_{F_{\theta}}^{\delta t}(U^n)

Layer of a ResNet we want to study

Splitting:

Useful activation functions

Simple but useful property:

\(x^p = \max(0,x)^p + (-1)^p\max(0,-x)^p\)

Approximation theorem

F(U) = L*U + (D_1*U) \odot (D_2*U)

Simplified setting

2ab = (a+b)^2 - (a^2 + b^2)\\ = \sigma_2(a+b) + \sigma_2(-(a+b))- \\ \left(\sigma_2(a) +\sigma_2(-a) + \sigma_2(b)+\sigma_2(-b) \right)
a = \sigma_1(a) - \sigma_1(a)

Approximation theorem

F_{\theta}(U) := \mathcal{\boldsymbol{B}} * \sigma(\mathcal{\boldsymbol{A}}*U)

Approximation theorem

F_{\theta}(U) := \mathcal{\boldsymbol{B}} * \sigma(\mathcal{\boldsymbol{A}}*U)
\begin{aligned} & \left\|U^{n+1}-\Psi_{F_\theta}^{\delta t}\left(U^n\right)\right\|\\ &\leq \mathcal{O}\left(\delta x^2\right) + \mathcal{O}\left(\delta t^{r+1}\right) + \delta t \exp (\operatorname{Lip}(F) \delta t) \sup _{\|U\| \leq c}\left\|F_\theta(U)-F(U)\right\| \end{aligned}

This can theoretically be 0. In practice, it depends on to the optimisation error.

Examples of problems with a specific structure

What if I want a network to satisfy a certain property?

GENERAL IDEA

EXAMPLE

Property \(\mathcal{P}\)

\(\mathcal{P}=\) Volume preservation

Family \(\mathcal{F}\) of vector fields that satisfy \(\mathcal{P}\)

\(F_{\theta}(x,v) = \begin{bmatrix} \Sigma(Av+a) \\ \Sigma(Bx+b)  \end{bmatrix} \)

\(\mathcal{F}=\{F_{\theta}:\,\,\theta\in\mathcal{P}\}\)

Integrator \(\Psi^{\delta t}\) that preserves \(\mathcal{P}\)

\mathcal{N}(x) = \Psi_{F_M}^{\delta t_M}\circ ...\circ \Psi_{F_1}^{\delta t_1}(x)
x_{n+1}=x_n+\delta t\Sigma(A_nv_n+a_n)\\ v_{n+1}=v_n+\delta t\Sigma(B_nx_{n+1}+b_n)\\ (x_{n+1},v_{n+1}) = \Psi^{\delta t}_{F_{\theta}}(x_n,v_n)

GENERAL IDEA

EXAMPLE

Property \(\mathcal{P}\)

\(\mathcal{P}=\) Volume preservation

Family \(\mathcal{F}\) of vector fields that satisfy \(\mathcal{P}\)

\(F_{\theta}(x,v) = \begin{bmatrix} \Sigma(Av+a) \\ \Sigma(Bx+b)  \end{bmatrix} \)

\(\mathcal{F}=\{F_{\theta}:\,\,\theta\in\mathcal{P}\}\)

Integrator \(\Psi^{\delta t}\) that preserves \(\mathcal{P}\)

x_{n+1}=x_n+\delta t\Sigma(A_nv_n+a_n)\\ v_{n+1}=v_n+\delta t\Sigma(B_nx_{n+1}+b_n)\\ (x_{n+1},v_{n+1}) = \Psi^{\delta t}_{F_{\theta}}(x_n,v_n)

What if I want a network to satisfy a certain property?

GENERAL IDEA

EXAMPLE

Property \(\mathcal{P}\)

\(\mathcal{P}=\) Volume preservation

Family \(\mathcal{F}\) of vector fields that satisfy \(\mathcal{P}\)

\(F_{\theta}(x,v) = \begin{bmatrix} \Sigma(Av+a) \\ \Sigma(Bx+b)  \end{bmatrix} \)

\(\mathcal{F}=\{F_{\theta}:\,\,\theta\in\mathcal{P}\}\)

Integrator \(\Psi^{\delta t}\) that preserves \(\mathcal{P}\)

x_{n+1}=x_n+\delta t\Sigma(A_nv_n+a_n)\\ v_{n+1}=v_n+\delta t\Sigma(B_nx_{n+1}+b_n)\\ (x_{n+1},v_{n+1}) = \Psi^{\delta t}_{F_{\theta}}(x_n,v_n)

What if I want a network to satisfy a certain property?

Choice of dynamical systems

\dot{x}(t) = \left[A(\theta(t),x(t))-A(\theta(t),x(t))^T\right]\boldsymbol{1} = F(t,x(t))\\ \mathrm{vec}\left[A(\theta(t),x(t))\right] = V^T(t)\Sigma(W(t)x(t)+w(t))\in\mathbb{R}^{n^2}\\ I(x) = \sum_{i=1}^n x_i = \boldsymbol{1}^Tx\\ \implies \nabla I(x)^TF(t,x)=\boldsymbol{1}^TF(t,x) = 0

MASS-PRESERVING NEURAL NETWORKS

SYMPLECTIC NEURAL NETWORKS

\dot{x}(t) = \mathbb{J}\nabla_x H(\theta(t),x(t))=F(t,x(t))\in\mathbb{R}^{2n}
\mathbb{J} = \begin{bmatrix} 0_n & I_n \\ -I_n & 0_n\end{bmatrix}\in\mathbb{R}^{2n\times 2n}

Choice of dynamical systems

\dot{x}(t) = \left\{A(\theta(t),x(t))-A(\theta(t),x(t))^T\right\}\boldsymbol{1} = F(t,x(t))\\ I(x) = \sum_{i=1}^n x_i = \boldsymbol{1}^Tx\\ \implies \nabla I(x)^TF(t,x)=\boldsymbol{1}^TF(t,x) = 0

MASS-PRESERVING NEURAL NETWORKS

SYMPLECTIC NEURAL NETWORKS

\dot{x}(t) = \mathbb{J}\nabla_x H(\theta(t),x(t))=F(t,x(t))\in\mathbb{R}^{2n}
\mathbb{J} = \begin{bmatrix} 0_n & I_n \\ -I_n & 0_n\end{bmatrix}\in\mathbb{R}^{2n\times 2n}

Choice of numerical methods

I(x) = \boldsymbol{1}^Tx,\quad x_{n+1} = x_n + \delta tF(t_n,x_n)\\ I(x_{n+1}) = \boldsymbol{1}^Tx_{n+1} = \boldsymbol{1}^Tx_{n} + \delta t\boldsymbol{1}^TF(t_n,x_n)\\ = \boldsymbol{1}^Tx_{n} = I(x_n)

MASS-PRESERVING NEURAL NETWORKS

SYMPLECTIC NEURAL NETWORKS

Any Runge-Kutta method

Any Symplectic numerical method

Choice of numerical methods

I(x) = \boldsymbol{1}^Tx,\quad x_{n+1} = x_n + \delta tF(t_n,x_n)\\ I(x_{n+1}) = \boldsymbol{1}^Tx_{n+1} = \boldsymbol{1}^Tx_{n} + \delta t\boldsymbol{1}^TF(t_n,x_n)\\ = \boldsymbol{1}^Tx_{n} = I(x_n)

MASS-PRESERVING NEURAL NETWORKS

SYMPLECTIC NEURAL NETWORKS

Any Runge-Kutta method

Any Symplectic numerical method

Can we still approximate many functions?

f_i(x) = \nabla U_i(x) + X_S^i(x)
U_i(x) = \int_0^1 x^Tf_i(tx)dt
x^TX_S^i(x)=0\quad \forall x\in\mathbb{R}^n

Then \(F\) can be approximated arbitrarily well by composing flow maps of gradient and sphere preserving vector fields.

F:\Omega\subset\mathbb{R}^n\rightarrow\mathbb{R}^n\quad \text{continuous}
\forall \varepsilon>0\,\,\exist f_1,…,f_k\in\mathcal{C}^1(\mathbb{R}^n,\mathbb{R}^n)\,\,\text{s.t.}
\|F-\Phi_{f_k}^{\delta t_k}\circ … \circ \Phi_{f_1}^{\delta t_1}\|<\varepsilon

Presnov Decomposition

Idea of the proof

\Phi_{f_i}^{\delta t} = \Phi_{f_i}^{\alpha_M \delta t} \circ ... \circ \Phi_{f_i}^{\alpha_1 \delta t}
\sum_{i=1}^M \alpha_i = 1
f = \nabla U + X_S \\ \implies \Phi_f^{\delta t}(x) = \Phi_{\nabla U}^{\delta t} \circ \Phi_{X_S}^{\delta t}(x) + \mathcal{O}(\delta t^2)

Thank you for the attention

  • Celledoni, E., Jackaman, J., M., D., & Owren, B., preprint (2023). Predictions Based on Pixel Data: Insights from PDEs and Finite Differences.
  • Celledoni, E., M. D., Owren B., Schönlieb C.B., Sherry F, preprint (2022). Dynamical systems' based neural networks
\dot{y} = \begin{bmatrix} 0 & -y_3y_1^2 & y_2y_3 \\ y_3y_1^2 & 0 & -\sin{y_1} \\ -y_2y_3 & \sin{y_1} & 0\end{bmatrix}\boldsymbol{1}
\dot{x}(t) = \{A_{\theta}(x(t))-A_{\theta}(x(t))^T\}\boldsymbol{1}\\ \mathrm{vec}(A_{\theta}(x)) = \Sigma(Ux+u)

Mass-preserving networks

\(X\) , Label : Plane

\(X+\delta\), \(\|\delta\|_2=0.3\) , Label : Cat

Adversarial robustness

X_{\theta_i}(x) := - \nabla V_{\theta_i}(x) = -A_i^T\Sigma(A_ix+b_i)
Y_{\theta_i}(x) := W_i^T\Sigma(W_ix + v_i)

Lipschitz-constrained networks

X_{\theta_i}(x) := - \nabla V_{\theta_i}(x) = -A_i^T\Sigma(A_ix+b_i)
\Psi^{\delta t_C}_{X_{\theta_i}}(x) = x - {\delta t_C}A_i^T\Sigma(A_ix+b_i)
Y_{\theta_i}(x) := W_i^T\Sigma(W_ix + v_i)
\|\Psi^{\delta t_C}_{X_{\theta_i}}(y) - \Psi^{\delta t_C}_{X_{\theta_i}}(x)\|\leq \sqrt{1-{\delta t_C}+{\delta t_C}^2}\|y-x\|
\Psi^{\delta t_E}_{Y_{\theta_i}}(x) = x + {\delta t_E}W_i^T\Sigma(W_ix+v_i)
\|\Psi^{\delta t_E}_{Y_{\theta_i}}(y) - \Psi^{\delta t_E}_{Y_{\theta_i}}(x)\|\leq (1+{\delta t_E})\|y-x\|
\mathcal{N}(x)=\Psi_{X_{\theta_{2k}}}^{\delta t_{2k}} \circ \Psi_{Y_{\theta_{2k-1}}}^{\delta t_{2k-1}} \circ ... \circ \Psi_{X_{\theta_2}}^{\delta t_2} \circ \Psi_{Y_{\theta_{1}}}^{\delta t_1}(x)

 We impose :

\|\mathcal{N}(x)-\mathcal{N}(y)\|\leq \|x-y\|
\sqrt{1-{\delta t_C}+{\delta t_C}^2}(1+\delta t_E)\leq 1

Lipschitz-constrained networks

Adversarial robustness

Slides Talk INI

By Davide Murari

Slides Talk INI

Slides Talk INI 30 May 2023

  • 271