Dyego Araújo
Roberto Imbuzeiro Oliveira
Daniel Yukimura
Huge improvements & lots of vulnerabilities!
At the end of the talk, we will still not know the answers ...
... but we will get a glimpse of mathematical methods we should use to describe the networks.
\(x_1\)
\(x_1\)
\(x_2\)
\(x_3\)
Input
\(x_3\)
\(\theta_1\)
\(\theta_2\)
\(\theta_3\)
\(a,b\)
Weights
\(a\phi(\sum_ix_i\theta_i + b)\)
Output
(Rosenblatt'59)
Total # of parameters \(10^7\) to \(10^9 \).
Data: points \((X_i,Y_i)\stackrel{\rm i.i.d.}{\sim}\,P\) \(\R^{d_X}\times \R^{d_Y}\).
Network: parameterized function \(\widehat{y}:\R^{d_X}\times \R^{D_N}\to \R^{d_Y}\).
Train: (stochastic online) gradient descent
\[\theta^{(k+1)} = \theta^{(k)} - \epsilon \alpha^{(k)}\,(Y_{k+1}-\widehat{y}(X_{k+1},\theta^{(k)}))^T \nabla_\theta \widehat{y}(X_{k+1},\theta^{(k)}).\]
In expectation, this follows the gradient of loss:
\[L(\theta):=\frac{1}{2}\mathbb{E}_{(X,Y)\sim P}\|Y - \widehat{y}(X,\theta)\|^2 .\]
Source: https://arxiv.org/abs/1812.11118
Belkin, Hsu, Ma & Mandal (2018)
(*) See Ongie, Willett, Soudry & Srebo https://arxiv.org/abs/1910.01635
\(x\mapsto h_{\mu}(x):= \int\,h(x,\theta)\,d\mu(\theta)\) where \(\mu\in M_1(\mathbb{R}^{D})\).
Independent random initialization of weights
Independent random initialization of weights
Layers \(\ell=0,1,\dots,L,L+1\):
Each layer \(\ell\) has \(N_\ell\) units of dimension \(d_\ell\).
"Weight" \(\theta^{(\ell)}_{i_\ell,i_{\ell+1}}\in\mathbb{R}^{D_\ell}\):
Full vector of weights: \(\vec{\theta}_N\).
Initialization:
At output layer \(L+1\), \(N_{L+1}=1\).
Assume \(L\geq 3\) hidden layers + technicalities on \(\sigma\).
One can couple the weights \(\theta^{(\ell)}_{i_\ell,i_{\ell+1}}(t)\) to certain limiting random variables \(\overline{\theta}^{(\ell)}_{i_\ell,i_{\ell+1}}(t)\) with small error:
Limiting random variables coincide with the weights at time 0 & satisfy a series of properties we will now describe.
Full independence except at 1st and Lth hidden layers:
The following are all independent from one another.
1st and Lth hidden layers are deterministic functions:
Distribution \(\mu_t\) of limiting weights along a path at time \(t\),
\[(A_{i_1},\overline{\theta}^{(1)}_{i_1,i_2}(t),\overline{\theta}^{(2)}_{i_2,i_3}(t),\dots,\overline{\theta}^{(L)}_{i_L,i_{L+1}}(t),B_{i_{L+1}})\sim \mu_t\]
has the following factorization into independent components:
\[\mu_t = \mu^{(0,1)}_t\otimes \mu^{(2)}_t \otimes \dots \otimes \mu_{t}^{(L-1)}\otimes \mu_t^{(L,L+1)}.\]
Contrast with time 0 (full product).
At any time \(t\), the loss of function \(\widehat{y}(x,\vec{\theta}_N(t))\) is approximately the loss composition of functions of the generic form \(\int\,h(x,\theta)\,dP(\theta)\). Specifically,
\[L_N(\vec{\theta}_N)\approx \frac{1}{2}\mathbb{E}_{(X,Y)\sim P }\,\|Y - \overline{y}(X,\mu_t)\|^2\] where
Backpropagation (Hinton?): chain rule describes partial derivatives as sums of paths going back and forward in the network
Ansatz: Terms involving a large number of random weights can be replaced by interactions with their density (a law of large numbers).
\(\Rightarrow\) McKean-Vlasov type of behavior.
Consider:
Self-consistency: \(Z(t)\sim \mu_t\) for all times \(t\geq 0\)
Density \(p(t,x)\) of \(\mu_t\) evolves according to a (possibly) nonlinear PDE.
In the case of shallow (\(L=1\)) nets, this corresponds to a
gradient flow with convex potential in the space of probability measures over \(\R^D\).
Ambriosio, Gigli, Savaré, Figalli, Otto, Villani...
Shallow:
Mei, Montanari & Nguyen (2018); Sirignano & Spiliopoulos (2018); Rotskoff & Vanden Einjden (2018).
"Easy case".
Deep (before us):
Modified limit by S&S (2019).
Heuristic by Nguyen (2019).
Hard case: discontinuous drift in McKean-Vlasov
Dependencies in our system are trickier.
Remain i.i.d. at all times when \(N\gg 1\). Direct connection between system and i.i.d. McK-V trajectories.
Will not have i.i.d. trajectories even in the limit because the paths intersect. Also, McK-V is discontinuous due to conditioning appearing in the drifts.
It looks like this.
[Public domain/Wikipedia]
\(x\mapsto h_{\mu}(x):= \int\,h(x,\theta)\,d\mu(\theta)\) where \(\mu\in M_1(\mathbb{R}^{D})\).