\[ \sigma(x;w,a,b) = a \phi(w.x + b) \]
where \(x\in\R^d\) is the input (a vetor),
\(\phi:\R\to\R\) is a non-affine function
and \(w\in\R^d\), \(a\in\R\) , \(b\in \R\) are weights.
\[ \sigma(x;w,a,b) = a \phi(w.x + b) \]
Common choices for \(\phi\):
\[\widehat{L}_n(w,a,b):=\frac{1}{n}\sum_{i=1}^n (Y_i - \sigma(X_i;w,a,b))^2.\]
Grey layer = vector of perceptrons.
\(\vec{h}=(a_i\phi(x.w_i+b_i))_{i=1}^N.\)
Output = another perceptron from gray to green
\(\widehat{y}=a\phi(w.\vec{h}+b).\)
HUGE LITERATURE
\[\theta^{(k+1)} = \theta^{(k)} - \alpha(k)\nabla\widehat{L}(\theta^{(k)})\]
"Backpropagation" (Hinton): compact form for writing gradients via chain rule.
Apparently yes.
\(O(10^1)\) - \(O(10^2)\) layers.
\(O(10^7)\) or more neurons per layer.
ImageNet database \(O(10^9)\) with images.
Requires a lot of computational power.
Nobody knows.
This talk: some theorems towards an explanation.
Error rates down to 1% from 21%
Leo Breiman.
"Statistical Modeling: The Two Cultures" (+ discussion)
Statistical Science v. 16, issue 3 (2001).
\[L(\theta):= \mathbb{E}_{(X,Y)\sim P}(Y-f(X;\theta))^2.\]
In practice, computing the exact expectation is impossible, but we have an i.i.d. sample:
\[(X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\sim P.\]
Problem: how can we use the data to find a nearly optimal \(\theta\)?
Idea: replace expected loss by empirical loss and optimize that instead.
\[\widehat{L}_n(\theta):= \frac{1}{n}\sum_{i=1}^n(Y_i-f(X_i;\theta))^2.\]
Let \(\widehat{\theta}_n\) be the minimizer.
Law of large numbers: for large \(n\),
\[ \frac{1}{n}\sum_{i=1}^n(Y_i-f(X_i;\theta))^2 \approx \mathbb{E}_{(X,Y)\sim P}(Y-f(X;\theta))^2 \]
\[\Rightarrow \widehat{L}_n(\theta)\approx L(\theta).\]
Mathematical theory explains the when and how: Vapnik, Chervonenkis, Devroye, Lugosi, Koltchinskii, Mendelson...
Bias: if \(\Theta\) is "simple", large \(\mathbb{E}(Y-f(X;\theta))^2\) for any choice of (\theta).
Variance: if \(\Theta\) is "complex", it may be that the Law of Large Numbers does not kick in.
Traditional: overfitting \(\approx\) interpolation
Assume \[Y=f_*(X) + \text{noise}. \] Than some parameter setting with \(N\to +\infty\) neurons will give you a bias that is as small as you like.
Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251–257
Teoria tradicional foi pensada para problemas convexos.
DNNs estão longe de serem convexas.
Não há garantias de que descida de gradiente converge para um mínimo global.
Mínimos locais podem ou não ser bons.
DNNs are capable of interpolation. Current theory is useless.
Zhang et al, "Understanding deep learning requires rethinking generalization." ICLR 2017
Teoria para métodos/estimadores que convergem para mínimos locais.
(Arora, Ge, Jordan, Ma, Loh, Wainwright, etc)
É possível encontrar métodos muito simples (não DNN) que interpolam e têm bons resultados em alguns casos.
(Rakhlin, Belkin, Montanari, Mei, etc)
Problema: nada disso é sobre redes neurais.
Evolution of \(\mu(t)\): gradient flow in the space of probability measures over \(\R^D\).
PDEs studied by Ambrosio, Gigli, Savaré; Villani; Otto; etc.
\(\Rightarrow\) convexity in the limit.
Shallow nets are a discretization of a convex nonparametric method.
How does one extend that to deep nets.
Joint with Dyego Araújo (IMPA) e Daniel Yukimura (IMPA/NYU).
"A mean field model for certain deep neural networks"
https://arxiv.org/abs/1906.00193