Robustness of neural networks for classification problems

Davide Murari

MaGIC 2022 - 02/03/2022

Part of a joint project together with Elena Celledoni, Brynjulf Owren,

Carola-Bibiane Schönlieb and Ferdia Sherry

 

The classification problem

Given a "sufficiently large" set of \(N\) points in \(\mathcal{M}\subset\mathbb{R}^k\)  that belong to \(C\) classes, we want to learn a function \(F\) assigning all the points of \(\mathcal{M}\) to the correct class.

\mathcal{M} = \bigcup\limits_{i=1}^C \mathcal{M}_i \subset \mathbb{R}^k
F : \mathbb{R}^k \rightarrow \mathbb{R}^C
\ell_F(x):=\arg\max\{F(x)_j:j=1,...,C\} = i=:\ell(x) \\ \forall x\in \mathcal{M}_i

Adversarial examples

\text{Find }\eta\in\mathbb{R}^k\\ \text{s.t.}\,\,\|\eta\|\leq \varepsilon,\,\,\ell_F(x)\neq \ell_F(x+\eta)

What is a robust classifier?

An \(\varepsilon\)-robust classifier is a function that not only correctly classifies the points in \(\mathcal{M}\) but also those in

Suppose that

d(\mathcal{M_i},\mathcal{M}_j)>2\varepsilon\quad \forall i\neq j
\mathcal{M}_{\varepsilon} := \{x\in\mathbb{R}^k\,:\,d(x,\mathcal{M})\leq \varepsilon\}

What is a robust classifier?

An \(\varepsilon\)-robust classifier is a function that not only correctly classifies the points in \(\mathcal{M}\) but also those in

Suppose that

d(\mathcal{M_i},\mathcal{M}_j)>2\varepsilon\quad \forall i\neq j
\mathcal{M}_{\varepsilon} := \{x\in\mathbb{R}^k\,:\,d(x,\mathcal{M})\leq \varepsilon\}

1.

2.

\ell_F(x) = i := \ell(x) \quad \forall x\in \mathcal{M}_i
\ell_F(x+\eta) = i\;\; \forall x\in \mathcal{M}_i,\|\eta\|\leq \varepsilon

In other words, we should learn a

F : \mathbb{R}^k \rightarrow \mathbb{R}^C

such that

The choice of the Euclidean metric

From now on, we denote with \(\|\cdot\|\) the \(\ell^2\) norm.

The sensitivity to input perturbations

F : (\mathbb{R}^k,d_1)\rightarrow (\mathbb{R}^C,d_2)

The Lipschitz constant of \(F\) can be seen as a measure of its sensitivity to input perturbations:

d_2(F(x+\eta),F(x))\leq Lip(F)d_1(x+\eta,x)

Lipschitz networks based on dynamical systems

\Gamma(x) = [\gamma(x_1),...,\gamma(x_k)]\\ V_{\theta} = \boldsymbol{1}^T\Gamma(Ax+b),\,\,\theta =[A,b]\\ X_{\pm,\theta}(z) := \pm \nabla V_{\theta}(z)
\langle \nabla V_{\theta}(z)- \nabla V_{\theta}(y),z-y\rangle \leq \mu \|z-y\|^2\\ \implies \|z(t)-y(t)\|\leq e^{\mu t} \|z_0-y_0\|

Lipschitz networks based on dynamical systems

F(x) \approx F_{\theta}(x)=\Phi_{-\nabla V_{\theta_k}}^{h_k} \circ \Phi_{\nabla V_{\theta_{k-1}}}^{h_{k-1}} \circ ... \circ \Phi_{-\nabla V_{\theta_2}}^{h_2} \circ \Phi_{\nabla V_{\theta_{1}}}^{h_1}(x)
\|F_{\theta}(x)-F_{\theta}(y)\|\leq \exp\left(\sum_{i=1}^k \mu_ih_i\right)\|x-y\|
\exp\left(\sum_{i=1}^k \mu_ih_i\right)
\dot{x}(t) = a_{s(t)}\nabla V_{s(t)}(x(t))=f(t,x(t))
s:[0,T]\rightarrow \{1,...,k\}

piecewise constant

a_{s(t)}=(-1)^{s(t)-1}

The notion of margin

It is not enough to have a small Lipschitz constant.

Idea:

"GOOD"

x
[0.99, 0.05, 0.05]

"BAD"

x
[0.34, 0.33, 0.33]

The classifier might have a small Lipschitz constant but be very sensitive to input perturbations.

\implies

The notion of margin

It is not enough to have a small Lipschitz constant.

Idea:

"GOOD"

x
[0.99, 0.05, 0.05]

"BAD"

x
[0.34, 0.33, 0.33]
\text{Suppose $\ell(x)=i$, i.e. $x\in\mathcal{M}_i$} \\ M_F(x) := \max\{0,F(x)_{i} - \max_{j\neq i} F(x)_j\}\\ M_F(x)\geq \sqrt{2}Lip(F)\varepsilon \implies \ell_F(x) = \ell_F(x+\eta)\,\forall \|\eta\|\leq \varepsilon

LOSS

FUNCTION

\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \sum_{j\neq \ell(x)} \max\{0,m-(F(x)_{\ell(x)}-F(x)_j)\}

Experiments for  \(\ell^2\) robust accuracy

Experiments for  \(\ell^2\) robust accuracy

Thank you for the attention

Made with Slides.com