Machine Learning Testing
Survey, Landscapes and Horizons
Components
Required Conditions
ML Items
Testing Activities
Required Conditions
Correctness
Robustness
Fairness
Privacy
Efficiency
ML Items
Framework
Data
Learning Algorithm
Testing Activities
Test Input Generation
Test Oracle Identification
Test Adequacy Evaluation
Bug Triage
Testing Properties
Correctness : Metric (Accuracy, F1, ... )
Overfitting Degree
Robustness (Noisy input, stress)
Adversarial Robustness
local
global
...
Testing Properties
Security
Model Extraction
Data Privacy
Linkage Attack
Efficiency
Slow / Infinite Loop
VGG-19 in mobile device
Interpretability
Transparency
Post-hoc explanations
Robustness
Correctness in the presence of noise
DeepFool
Point-wise robustness
Adversarial Frequency
Adversarial Severity
distance between input and adversarial example
DeepSafe
A cluster should have same label
Overfitting
Cross-validation
fails if test data is unrepresentative of potential unseen data
Perturbed Model Validation (PMV)
Both underfitting and overfitting are less sensitive to noise
Generate adversarial examples from test data
If error increases on adversarial examples
Overfitting
Efficiency
Training data reduction approaches
Smaller subset of data
Faster ML testing
Test Oracle
Determining if a test has passed
Pseudo Oracle
Metamorphic Relations
Test Adequacy
Coverage
Test Input Generation
Natural Inputs and Adversarial Inputs
Perturbed Natural inputs
Neuron Coverage
Transformed images
Detected 1000 erroneous behaviours in AV
GAN
Image-to-image Transformation
Simulate Weather conditions
Test Input Generation
Inputs for text classification
Grammar
Distance between inputs
NLI
Mutate sentences for robustness testing
DeepCheck
DNN -> Program
LIME, DeepConcolic
Metamorphic Test Oracles
sin(x) = sin(pi - x)
Same input in different forms must yield same outputs
Course-grained Data Transformation
Enlarge dataset, Change data order
Fine-grained Data Transformation
Mutate attributes, pixels
...
Metamorphic Test Oracles
Perturbed Model Validation (PMV)
inject noise into training data
Overfitting degree is less sensitive to noise
Image Transformation to weather conditions
Steering angle shouldn't change significantly
Classification consistency among similar images
Metamorphic Test Oracles
Metamorphic relations between datasets
P(training_data) == P(new_data)
Automate Metamorphic relations to detect bugs
Amsterdam
Corduroy
Cross-referencing as Oracle
Differential Testing
2 similar applications should respond to an input with similar outputs
Mirror program
Program that represents the training data
Similar behaviour on test data
Metrics for Oracle Design
Flickering Test
along the boundary of the road
Nested car boundary
Test Adequacy
Test Coverage
Degree to which src code is executed by test suite
Neuron Coverage
(ratio) Unique neurons activated by test inputs
Activation => output of neuron > threshold
MC/DC coverage
Change of boolean variable
sign, value, distance
for change in test input
Test Adequacy
Layer-level coverage
Top hyperactive neurons and their combinations
Surprise Adequacy
Kernel Density Estimation
approximate likelihood of system having seen similar input during training
Neuron Activation trace (vector)
Distance between input and training data
Mutation Testing
Inject faults
Minor perturbations on decision boundaries of DNN
Mutation score
(ratio) #instances where results changed
Rule-based Adequacy Test
Training should be reproducible
All features should be useful
A simpler equally successful model shouldn't exist
Test Prioritization
Based on
Cross-entropy
Surprisal
Bayesian Uncertainty
Adversarial inputs are better
Sampling technique guided by last layer of neurons
Debug and Repair
Training on generated inputs improves correctness
Resample to influence faulty neurons
tfbg
debugger for tf models
Analyzer, NodeStepper, RunStepper
PALM
Meta model : partitions training data
Sub-models : approximate patterns in partitions
Which training data impacts prediction most?
Testing Frameworks
CNN testing
Security Testing
Fairness Testing
ML Testing Components
Bugs in Training data
Data Linter
Miscoded data
Outliers
Duplicate/missing examples
MODE : Resample to influence faulty neurons
Test Data
Small network that identifies adversarial examples
Insufficiency of test data is a data bug
ML Testing Components
Bugs in Data
Skew Detection
KDE
Neural Activation trace
Data Bug Detection Framework
Data Validation System
Integral part of Google's TFX
Metric : distance between training data and new data
ActiveClean : Iterative data cleaning
BoostClean : Domain value violation
...
Data Bug Detection Framework
Automatic 'Unit' Testing
Common, user-defined quality constraints for data testing
AlphaClean
Greedy tree search to automatically tune the parameters of data cleaning pipelines
Bug Detection in Learning Program
Tensorflow API update
Most common bug
WALA : Static Analysis of tensor behaviour in tf
Bug Detection in Frameworks
Incorrect implementation of algorithms
Differential testing across multiple implementations
Metamorphic testing
normalize/scale input data
MutPy
Inject mutants to simulate implementation bugs
Autonomous Driving
Disengagement
64% - bugs in ML system
44% Poor Image classification performance
20% Control and Decision system bug
DeepRoad
realistic images for rainy and snowy conditions
DeepBillboard
Trigger steering errors
Continuous and Realistic physical world tests
Machine Translation
Metamorphic Relations
Some changes should not affect the overall structure of translated output
Over Translation : duplicates
Under translation : Missing phrases
NLI
Generate Sentence Mutants
Rule-based adversaries
Semantic Understanding Test
Mutation by swapping sentences
Accuracy should be equal for contradiction
Accuracy should decrease for entailment
Challenges & Opportunities
Made with Slides.com