Resilient AI

a review

Outline

Introduction
Target Domains
Attacks
Defenses
Challenges & Discussion

Introduction

Deep Neural Networks are still vulnerable to adversarial attacks despite new defenses
Adversarial attacks can be imperceptible to human

Target Domains

Computer Vision
Natural Language Processing
Malware

Computer Vision

Mostly studied domain
Continuous input space
Compatible with gradient-based attacks

Computer Vision

Misclassification of image recognition
- Face recognition
- Object detection
- Image segmentation
Reinforcement learning
Generative modeling

Natural Language Processing

Discrete input space
Not directly compatible with gradient-based attacks
- Local search algorithm
- Reinforcement learning

Malware Detection

Discrete input space
- Adapted JSMA, a gradient-based algorithm
- Genetic algorithm for evasive malicious PDF files
- Local search in latent space of MalGan
- Reinforcement Learning algorithm where evasion is considered as reward

Attacks

Direct gradient-based
Search-based

Direct Gradient-based Attacks

Mostly used in Computer Vision domain
Uses gradient of the target models to directly perturb pixel values

Direct Gradient-based Attacks

Mechanism lies in optimizing the objective function which contains two components:
- - Distance between the clean and adversarial input: Lp norm distance in direct input space

Direct Gradient-based Attacks

Can be applied in a white or black box manner
- White-box: Adversary has access to 1) target model’s architecture and parameters, 2) training data, 3) training algorithm, 4) hyperparameters
- Black-box: Adversary might only have access to target model’s prediction (might include confidence level)
  - Transfer attacks from single or an ensemble of substitute target models

Direct Gradient-based Attacks

Trade-off between effectiveness (perturbation & misclassification rate) and computational time

Direct Gradient-based Attacks

Single-step or iterative
Successful gradient based approaches
- FGSM
  - i-FGSM
  - R+FGSM
- JSMA
- C&W
- PGD

Search-based Attacks

Evolutionary & genetic algorithm
- PDF-Malware evasion
- Image misclassification from noisy images
Local search algorithm
- Comprehension task using greedily search
- Malware evasion
- Translation task with adversarial examples searched from latent space of autoencoder

Defenses

Most defenses are in computer vision domain
Adversarial retraining
Regularization techniques
Certification & Guarantees
Network distillation
Adversarial detection
Input reconstruction
Ensemble of defenses
New model architecture

Adversarial Retraining

Strengthen target model by training it on adversarial examples generated by attacks
Attacks used affects effectiveness
Ensemble adversarial training is effective against black box transfer attacks

Regularization Techniques

Regularize the confidence of model’s confidence level in prediction
Adversarial Logit Pairing, Clean Logit Pairing, Logit Squeezing

Certification & Guarantees

Direct and approximation methods to find certain guarantee of adversarial examples within input space
- Direct methods are computationally intensive and limited in scope: Reluplex
Reformulation of the approximation method can make model more robust
- Convex approximation as an upper bound to minimize

Other Techniques

Network distillation
- Overcome by stronger attacks
- Another model is trained on the prediction of a model
Adversarial Detection
- Classifies adversarial images from ‘clean’ images
- Overcome by including the detector into the attack’s objective function

Other Techniques

Input reconstruction
- Scrub adversarial images ‘clean’
- Overcome by attacks via expectation of transform
Ensemble of defenses
- Ensemble of models of the above defenses
- Can be overcome if the underlying defense is weak

Uncertainty Modeling

Allows model to express degree of certainty in its prediction: “Know when they do not know”
Gaussian Process Hybrid Deep Neural Networks
- Expresses latent variable as a Gaussian distribution with mean and covariance, encoded in RBF kernels

New Model Architectures

“Capsule” network for image which is shown to be more resistant against adversarial examples
Model architecture’s inductive bias might better represent the real data distribution

Challenges & Discussion

Definition of an adversarial example
- Studies limited to Lp in images
- No standard definition for discrete domains like NLP
Standard of robustness evaluation
- Benchmarks like Cleverhans
- Certification & guarantees

Challenges & Discussion

Ultimate robust model
- Adversarial examples exist whenever there is classification error
Adversarial attacks and defenses in other domains
- NLP
- Other neural network architecture

Made with Slides.com