Empirical Risk Minimization - Part 2
Study Group on Learning Theory
Daniel Yukimura
Goals for today:
- 4.5 - Rademacher Complexity
- 4.6 - Relationship with asymptotic statistics
Quick Overview:
Empirical Risk Minimization:
\text{i.i.d.} \left\{ X_i, Y_i \right\}_{i=1}^n
\argmin\limits_{f\in\mathcal{F}} \frac{1}{n} \sum\limits_{i=1}^n \ell\left( Y_i, f(X_i) \right)
Find
Learning Problem:
(X,Y)\sim \mathbb{P}
f:\mathcal{X}\rightarrow \mathcal{Y} \text{ that minimizes}
Find
\mathcal{R}(f) = \mathbb{E}\left[ \ell(Y, f(X)) \right]
= \hat{ \mathcal{R} }(f)
\mathcal{R}(\hat{f}) - \mathcal{R}^* = \mathcal{R}(\hat{f}) - \inf\limits_{f'\in \mathcal{F}} \mathcal{R}(f') + \inf\limits_{f'\in \mathcal{F}} \mathcal{R}(f') - \mathcal{R}^*
Error decomposition:
estimation error
approximation error
excess risk
Uniform Control:
\mathcal{R}(\hat{f}) - \inf\limits_{f\in \mathcal{F}} \mathcal{R}(f) \leq 2 \sup\limits_{f\in\mathcal{F}} \left|\hat{\mathcal{R}}(f) - \mathcal{R}(f)\right|
4.5 Rademacher Complexity
Context:
\mathcal{Z} = \mathcal{X}\times \mathcal{Y}
\mathcal{H} = \left\{ (x,y) \rightarrow \ell(y, f(x)), f\in\mathcal{F} \right\}
\sup\limits_{f\in\mathcal{F}} \hat{\mathcal{R}}(f) - \mathcal{R}(f) = \sup\limits_{h\in\mathcal{H}} \left( \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - \mathbb{E}\left[ h(Z) \right]\right)
\mathcal{D} = \left\{ Z_1, Z_2,\dots, Z_n \right\}, Z_i = (X_i, Y_i)\sim \mathbb{P}, \text{i.i.d}
- Domain space:
- Data:
- Class of functions:
- Uniform deviation:
Rademacher Complexity:
R_n(\mathcal{H}) = \mathbb{E}_{\varepsilon, \mathcal{D}}\left( \sup\limits_{h\in\mathcal{H}} \frac{1}{n} \sum\limits_{i=1}^n \varepsilon_i h(Z_i) \right)
\text{Given } \mathcal{H}\subseteq \mathcal{F}(\mathcal{Z}, \mathbb{R}) \text{ and } Z_i \sim \mathbb{P}, \text{i.i.d sample in }\mathcal{Z}
\varepsilon\in\{-1,1\}^n, \text{with i.i.d } \varepsilon_i\sim\mathcal{U}(\{-1,1\})
\bullet \text{ } R_n(\mathcal{H}) \text{ near } 1 \Rightarrow \text{``capacity to memorize random labels" }
\bullet \text{ Correlation: } h\in\mathcal{H} \text{ vs ``random labels" } \varepsilon
\mathbb{E}_\mathcal{D} \left[ \sup\limits_{h\in\mathcal{H}} \left( \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - \mathbb{E}\left[ h(Z) \right]\right) \right] \leq 2 R_n(\mathcal{H})
\mathbb{E}_\mathcal{D} \left[ \sup\limits_{h\in\mathcal{H}} \left( \mathbb{E}\left[ h(Z) \right] - \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) \right) \right] \leq 2 R_n(\mathcal{H})
\leq 2 \mathbb{E}_{\varepsilon, \mathcal{D}}\left( \sup\limits_{h\in\mathcal{H}} \frac{1}{n} \sum\limits_{i=1}^n \varepsilon_i h(Z_i) \right)
\text{Proposition 4.2 (symmetrization)}
\mathbb{E}_\mathcal{D} \left[ \sup\limits_{h\in\mathcal{H}} \left( \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - \mathbb{E}\left[ h(Z) \right]\right) \right] \leq 2 R_n(\mathcal{H})
\text{Proof:}
\text{Let } \mathcal{D}' = \{Z_1',Z_2',\dots,Z_n'\} \text{ be an i.i.d. copy of } \mathcal{D}
\mathbb{E}[h(Z)] = \frac{1}{n}\sum\limits_{i=1}^n \mathbb{E}[h(Z_i')]
\sup\limits_{h\in\mathcal{H}} \left( \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - \mathbb{E}\left[ h(Z) \right]\right) = \sup\limits_{h\in\mathcal{H}} \mathbb{E}_{\mathcal{D}'}\left[ \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - h(Z_i') \right]
= \mathbb{E}_{\mathcal{D}'} \left[ \frac{1}{n}\sum\limits_{i=1}^n h(Z_i')\right]
\leq \mathbb{E}_{\mathcal{D}'}\left[ \sup\limits_{h\in\mathcal{H}} \left(\frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - h(Z_i')\right) \right]
\mathbb{E}_\mathcal{D} \left[ \sup\limits_{h\in\mathcal{H}} \left( \frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - \mathbb{E}\left[ h(Z) \right]\right) \right] \leq \mathbb{E}_{\mathcal{D},\mathcal{D}'}\left[ \sup\limits_{h\in\mathcal{H}} \left(\frac{1}{n}\sum\limits_{i=1}^n h(Z_i) - h(Z_i')\right) \right]
\sim h(Z_i') - h(Z_i)
\sim \frac{1}{n} \sum\limits_{i=1}^n \varepsilon_i (h(Z_i) - h(Z_i'))
\leq \mathbb{E}_{\varepsilon, \mathcal{D},\mathcal{D}'}\left[ \sup\limits_{h\in\mathcal{H}} \left(\frac{1}{n}\sum\limits_{i=1}^n \varepsilon_i(h(Z_i) - h(Z_i'))\right) \right]
\leq \mathbb{E}_{\varepsilon, \mathcal{D}}\left[ \sup\limits_{h\in\mathcal{H}} \left(\frac{1}{n}\sum\limits_{i=1}^n \varepsilon_i h(Z_i)\right) \right] + \mathbb{E}_{\varepsilon, \mathcal{D}'}\left[ \sup\limits_{h\in\mathcal{H}} \left(\frac{1}{n}\sum\limits_{i=1}^n \varepsilon_i(- h(Z_i'))\right) \right]
\leq 2 R_n(\mathcal{H})
Lipschitz Continuous Losses
\text{Proposition 4.3 (Contraction principle)}
b, a_i: \Theta \rightarrow \mathbb{R}
\varphi_i: \mathbb{R}\rightarrow \mathbb{R}, \text{ }1\text{-Lipschiz}
\mathbb{E}_\varepsilon \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^n \varepsilon_i \varphi_i(a_i(\theta)) \right] \leq \mathbb{E}_\varepsilon \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^n \varepsilon_i a_i(\theta) \right]
\text{Proof:}
\mathbb{E}_\varepsilon \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^n \varepsilon_i \varphi_i(a_i(\theta)) \right] \leq \mathbb{E}_\varepsilon \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^n \varepsilon_i a_i(\theta) \right]
\text{Induction on } n:
\mathbb{E}_{\varepsilon_{i=1}^{n+1}} \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^{n+1} \varepsilon_i \varphi_i(a_i(\theta)) \right] =
= \dfrac{1}{2} \left( \mathbb{E}_{\varepsilon_{i=1}^{n}} \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^{n} \varepsilon_i \varphi_i(a_i(\theta)) + \varphi_{n+1}(a_{n+1}(\theta)) \right] \right.
\left. \mathbb{E}_{\varepsilon_{i=1}^{n}} \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^{n} \varepsilon_i \varphi_i(a_i(\theta)) - \varphi_{n+1}(a_{n+1}(\theta)) \right] \right)
\varepsilon_{n+1} \sim \frac{1}{2} ( \delta_{-1} + \delta_{+1})
= \mathbb{E}_{\varepsilon_{i=1}^{n}} \left[ \sup\limits_{\theta, \theta' \in\Theta} \frac{b(\theta)+b(\theta')}{2} + \sum\limits_{i=1}^{n} \varepsilon_i \frac{\varphi_i(a_i(\theta))+\varphi_i(a_i(\theta'))}{2} + \frac{\varphi_{n+1}(a_{n+1}(\theta)) - \varphi_{n+1}(a_{n+1}(\theta'))}{2} \right]
\mathbb{E}_{\varepsilon_{i=1}^{n}} \left[ \sup\limits_{\theta, \theta' \in\Theta} \frac{b(\theta)+b(\theta')}{2} + \sum\limits_{i=1}^{n} \varepsilon_i \frac{\varphi_i(a_i(\theta))+\varphi_i(a_i(\theta'))}{2} + \frac{\varphi_{n+1}(a_{n+1}(\theta)) - \varphi_{n+1}(a_{n+1}(\theta'))}{2} \right]
\leq \mathbb{E}_{\varepsilon_{i=1}^{n}} \left[ \sup\limits_{\theta, \theta' \in\Theta} \frac{b(\theta)+b(\theta')}{2} + \sum\limits_{i=1}^{n} \varepsilon_i \frac{\varphi_i(a_i(\theta))+\varphi_i(a_i(\theta'))}{2} + \frac{|a_{n+1}(\theta) - a_{n+1}(\theta')|}{2} \right]
\text{Lipschitz}
\leq \mathbb{E}_{\varepsilon_{i=1}^{n+1}} \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \varepsilon_{n+1} a_{n+1}(\theta) + \sum\limits_{i=1}^{n} \varepsilon_i \varphi_i(a_i(\theta)) \right]
\text{induction hypothesis!}
\leq \mathbb{E}_{\varepsilon_{i=1}^{n+1}} \left[ \sup\limits_{\theta\in\Theta} b(\theta) + \sum\limits_{i=1}^{n+1} \varepsilon_i a_i(\theta) \right]
Application:
\text{If } u_i \rightarrow \ell(Y_i, u_i) \text{ is } G\text{-Lipschitz for all } i \text{ a.s.}
\mathbb{E}_\varepsilon \left( \left.\sup\limits_{f\in\mathcal{F}} \frac{1}{n}\sum\limits_{i=1}^n \varepsilon_i \ell(Y_i, f(X_i)) \right| \mathcal{D}\right) \leq G \mathbb{E}_\varepsilon \left( \left.\sup\limits_{f\in\mathcal{F}} \frac{1}{n}\sum\limits_{i=1}^n \varepsilon_i f(X_i) \right| \mathcal{D}\right)
\Rightarrow R_n(\mathcal{H}) \leq G R_n(\mathcal{F})
Ball-constrained linear predictions:
\mathcal{F} = \{ f_\theta(x) = \theta^T \varphi(x), \Omega(\theta)\leq D \}
R_n(\mathcal{F}) \leq \text{ ?}
\text{norm in }\mathbb{R}^d
R_n(\mathcal{F}) = \mathbb{E}\left[ \sup\limits_{\Omega(\theta) \leq D} \frac{1}{n} \sum\limits_{i=1}^n \varepsilon_i \theta^T \varphi(x_i) \right]
= \mathbb{E}\left[ \sup\limits_{\Omega(\theta) \leq D} \frac{1}{n} \varepsilon^T \Phi \theta \right]
= \frac{D}{n} \mathbb{E}\left[ \Omega^* (\Phi^T \varepsilon) \right]
\Omega^*(u) = \sup\limits_{\Omega(\theta)\leq 1} u^T\theta
Ball-constrained linear predictions:
\text{When } \Omega = \|\cdot\|_2:
R_n(\mathcal{F}) = \frac{D}{n} \mathbb{E}\left[ \| \Phi^T\varepsilon \|_2 \right]
\leq \frac{D}{n} \sqrt{\mathbb{E} \| \Phi^T\varepsilon \|_2^2 }
= \frac{D}{n} \sqrt{\mathbb{E} \left[ \text{tr}\left(\Phi^T\varepsilon \varepsilon^T\Phi\right)\right] }
\mathbb{E}[\varepsilon\varepsilon^T] = I
= \frac{D}{n} \sqrt{ \sum\limits_{i=1}^n \mathbb{E} \left( \Phi^T\Phi \right)_i }
= \frac{D}{n} \sqrt{ \sum\limits_{i=1}^n \mathbb{E} \|\varphi(X_i)\|_2^2 } = \frac{D}{\sqrt{n}} \sqrt{ \mathbb{E} \|\varphi(X)\|_2^2 }
\text{dimension-free!}
Linear Predictions:
\text{Proposition 4.4 (Estimation Error):}
\mathbb{E}[\mathcal{R}(f_{\hat{\theta}})] - \inf\limits_{\|\theta\|_2\leq D} \mathcal{R}(f_\theta) \leq \frac{2 G R D}{\sqrt{n}}
\mathbb{E} \|\varphi(X)\|_2^2 \leq R^2
\text{Case: }G\text{-Lipschitz loss } + \text{ball-constrained linear prediction}
\hat{\theta} = \argmin\limits_{\|\theta\|_2\leq D} \hat{\mathcal{R}}(f_\theta)
\text{Proof: uniform control } + G\text{-Lipschitz } + \text{dimension-free bound}
Linear Predictions:
\inf\limits_{\|\theta\|_2\leq D} \mathcal{R}(f_\theta) - \mathcal{R}(f_{\theta^*})
\theta^* \text{- minimizer of } \mathcal{R}(f_\theta) \text{ over } \mathbb{R}^d
\leq G \inf\limits_{\|\theta\|_2\leq D} \mathbb{E} [|f_\theta(X) - f_{\theta^*}(X)|]
\leq G R \inf\limits_{\|\theta\|_2\leq D} \|\theta-\theta^*\|_2
\mathbb{E}[ \mathcal{R}(f_{\hat{\theta}}) ] \leq G R \inf\limits_{\|\theta\|_2\leq D} \|\theta-\theta^*\|_2 + \dfrac{2 G R D}{\sqrt{n}}
\text{grows for small } D
\text{grows for large } D
Regularized estimation:
\hat{\theta}_\lambda \text{- minimizer of } \hspace{2mm} \hat{\mathcal{R}}(f_\theta) + \frac{\lambda}{2} \|\theta\|_2^2
\frac{\lambda}{2} \|\hat{\theta}\|_2^2 \leq \hat{\mathcal{R}}(f_{\hat{\theta}}) + \frac{\lambda}{2} \|\hat{\theta}\|_2^2 \leq \hat{\mathcal{R}}(f_0)
\Rightarrow \|\hat{\theta}\|_2^2 = \mathcal{O}(\lambda^{-\frac{1}{2}})
D = \mathcal{O}(1/\sqrt{\lambda}) \Rightarrow \text{ deviation } = \mathcal{O}(1/\sqrt{\lambda n})
Regularized estimation:
\text{Proposition 4.5 (Fast rates for regularized objectives):}
\mathbb{E}\left[\mathcal{R}(f_{\hat{\theta}_\lambda})\right] \leq
\inf\limits_{\theta\in\mathbb{R}^d} \left( \mathcal{R}(f_\theta) + \frac{\lambda}{2}\|\theta\|_2^2 \right)
+ \frac{32 G^2 R^2}{\lambda n}
4.6 Relationship with asymptotic statistics
\text{Does }\hat{\theta}_n \text{ converges to } \theta^* \text{ when } n\rightarrow \infty \text{ ?}
\text{(Consistency)}
\bullet \text{ Law of Large Numbers (LLN) } \Rightarrow \hat{\mathcal{R}}(\theta)\rightarrow \mathcal{R}(\theta)
\text{Example (van der Vaart):}
\sup\limits_{\theta\in\Theta} \left|\hat\mathcal{R}_n(\theta) - \mathcal{R}(\theta)\right| \overset{P}{\rightarrow} 0
\sup\limits_{\theta: d(\theta,\theta^*)\geq \varepsilon} \mathcal{R}(\theta) < \mathcal{R}(\theta^*)
\text{Then any seq. }\hat\theta_n \text{ with }\hat\mathcal{R}_n(\hat\theta_n) \geq \hat\mathcal{R}_n(\theta^*) -\mathcal{o}(1) \text{ conv. in prob. to }\theta^*
\text{If } \forall\varepsilon > 0
Intuitive justification:
\mathcal{R}'(\theta^*) = 0 \Rightarrow
\hat{\mathcal{R}}'(\theta^*) = \frac{1}{n}\sum\limits_{i=1}^n \frac{\partial \ell(Y_i, f_\theta{(X_i))}}{\partial \theta}\rightarrow 0 \text{ a.s.}
\text{LLN}
\bullet \hspace{2mm} \ell: \mathcal{Y}\times\mathbb{R}\rightarrow \mathbb{R} \text{ suff. differentiable.}
\bullet \hspace{2mm} \text{Taylor exp.:}
0 = \hat{\mathcal{R}}'(\hat{\theta}_n) \approx \hat{\mathcal{R}}'(\theta^*) + \hat{\mathcal{R}}''(\theta^*)(\hat{\theta}_n - \theta^*)
\bullet \hspace{2mm} \theta^*\in\mathbb{R}^d \text{ minimizer of }
\mathcal{R}(\theta), \text{with } \mathcal{R}''(\theta^*) \text{ pos. def.}
\hat{\mathcal{R}}''(\theta^*) \rightarrow \mathcal{R}''(\theta^*)
\text{LLN}
\hat{\theta}_n - \theta^* \approx \mathcal{R}''(\theta*)^{-1}\hat{\mathcal{R}}'(\theta^*)
\rightarrow 0
= H
Intuitive justification:
\bullet \hspace{2mm} \text{By the Central Limit Theorem (CLT) }
\hat{\mathcal{R}}'(\theta^*) \approx \mathcal{N}\left(0, \sqrt{\frac{1}{n} G(\theta^*)}\right)
G(\theta^*) = \mathbb{E}\left[ \left. \left( \frac{\partial \ell(Y, f_\theta(X))}{\partial\theta} \right)
\left( \frac{\partial \ell(Y, f_\theta(X))}{\partial\theta} \right)^T \right|_{\theta=\theta^*}\right]
\bullet \hspace{2mm} \hat\theta_n \approx \text{ normal with mean }\theta^* \text{ and covariance } \frac{1}{n} H^{-1} G(\theta^*) H^{-1}
\mathbb{E}\left[ \|\hat\theta_n - \theta^*\|_2^2 \right] \sim \frac{1}{n} \text{tr}\left[ H^{-1}G(\theta^*)H^{-1} \right]
\mathbb{E}\left[ \mathcal{R}(\hat\theta_n) -\mathcal{R}(\theta^*)\right] \sim \frac{1}{n} \text{tr}\left[ H^{-1}G(\theta^*) \right]
ERM part 2
By Daniel Yukimura
ERM part 2
- 194