Machine Learning Testing
Survey, Landscapes and Horizons
Components
- Required Conditions
- ML Items
- Testing Activities
Required Conditions
- Correctness
- Robustness
- Fairness
- Privacy
- Efficiency
ML Items
- Framework
- Data
- Learning Algorithm
Testing Activities
- Test Input Generation
- Test Oracle Identification
- Test Adequacy Evaluation
- Bug Triage
Testing Properties
- Correctness : Metric (Accuracy, F1, ... )
- Overfitting Degree
- Robustness (Noisy input, stress)
- Adversarial Robustness
- local
- global
- Adversarial Robustness
- ...
Testing Properties
- Security
- Model Extraction
- Data Privacy
- Linkage Attack
- Efficiency
- Slow / Infinite Loop
- VGG-19 in mobile device
- Interpretability
- Transparency
- Post-hoc explanations
Robustness
- Correctness in the presence of noise
- DeepFool
- Point-wise robustness
- Adversarial Frequency
- Adversarial Severity
- distance between input and adversarial example
- DeepSafe
- A cluster should have same label
Overfitting
- Cross-validation
- fails if test data is unrepresentative of potential unseen data
- Perturbed Model Validation (PMV)
- Both underfitting and overfitting are less sensitive to noise
- Generate adversarial examples from test data
- If error increases on adversarial examples
- Overfitting
- If error increases on adversarial examples
Efficiency
- Training data reduction approaches
- Smaller subset of data
- Faster ML testing
Test Oracle
- Determining if a test has passed
- Pseudo Oracle
- Metamorphic Relations
Test Adequacy
- Coverage
Test Input Generation
- Natural Inputs and Adversarial Inputs
- Perturbed Natural inputs
- Neuron Coverage
- Transformed images
- Detected 1000 erroneous behaviours in AV
- GAN
- Image-to-image Transformation
- Simulate Weather conditions
Test Input Generation
- Inputs for text classification
- Grammar
- Distance between inputs
- NLI
- Mutate sentences for robustness testing
- DeepCheck
- DNN -> Program
- LIME, DeepConcolic
Metamorphic Test Oracles
- sin(x) = sin(pi - x)
- Same input in different forms must yield same outputs
- Course-grained Data Transformation
- Enlarge dataset, Change data order
- Fine-grained Data Transformation
- Mutate attributes, pixels
- ...
Metamorphic Test Oracles
-
Perturbed Model Validation (PMV)
- inject noise into training data
- Overfitting degree is less sensitive to noise
- Image Transformation to weather conditions
- Steering angle shouldn't change significantly
- Classification consistency among similar images
Metamorphic Test Oracles
-
Metamorphic relations between datasets
- P(training_data) == P(new_data)
-
Automate Metamorphic relations to detect bugs
- Amsterdam
- Corduroy
Cross-referencing as Oracle
-
Differential Testing
- 2 similar applications should respond to an input with similar outputs
-
Mirror program
- Program that represents the training data
- Similar behaviour on test data
Metrics for Oracle Design
- Flickering Test
- along the boundary of the road
- Nested car boundary
Test Adequacy
- Test Coverage
- Degree to which src code is executed by test suite
- Neuron Coverage
- (ratio) Unique neurons activated by test inputs
- Activation => output of neuron > threshold
- MC/DC coverage
- Change of boolean variable
- sign, value, distance
- for change in test input
Test Adequacy
- Layer-level coverage
- Top hyperactive neurons and their combinations
- Surprise Adequacy
- Kernel Density Estimation
- approximate likelihood of system having seen similar input during training
- Neuron Activation trace (vector)
- Distance between input and training data
- Kernel Density Estimation
Mutation Testing
- Inject faults
- Minor perturbations on decision boundaries of DNN
- Mutation score
- (ratio) #instances where results changed
Rule-based Adequacy Test
- Training should be reproducible
- All features should be useful
- A simpler equally successful model shouldn't exist
Test Prioritization
- Based on
- Cross-entropy
- Surprisal
- Bayesian Uncertainty
- Adversarial inputs are better
- Sampling technique guided by last layer of neurons
Debug and Repair
- Training on generated inputs improves correctness
- Resample to influence faulty neurons
- tfbg
- debugger for tf models
- Analyzer, NodeStepper, RunStepper
- PALM
- Meta model : partitions training data
- Sub-models : approximate patterns in partitions
- Which training data impacts prediction most?
Testing Frameworks
- CNN testing
- Security Testing
- Fairness Testing
ML Testing Components
- Bugs in Training data
- Data Linter
- Miscoded data
- Outliers
- Duplicate/missing examples
- MODE : Resample to influence faulty neurons
- Data Linter
- Test Data
- Small network that identifies adversarial examples
- Insufficiency of test data is a data bug
ML Testing Components
- Bugs in Data
- Skew Detection
- KDE
- Neural Activation trace
- Skew Detection
Data Bug Detection Framework
- Data Validation System
- Integral part of Google's TFX
- Metric : distance between training data and new data
- ActiveClean : Iterative data cleaning
- BoostClean : Domain value violation
- ...
Data Bug Detection Framework
- Automatic 'Unit' Testing
- Common, user-defined quality constraints for data testing
- AlphaClean
- Greedy tree search to automatically tune the parameters of data cleaning pipelines
Bug Detection in Learning Program
- Tensorflow API update
- Most common bug
- WALA : Static Analysis of tensor behaviour in tf
Bug Detection in Frameworks
- Incorrect implementation of algorithms
- Differential testing across multiple implementations
- Metamorphic testing
- normalize/scale input data
- MutPy
- Inject mutants to simulate implementation bugs
Autonomous Driving
- Disengagement
- 64% - bugs in ML system
- 44% Poor Image classification performance
- 20% Control and Decision system bug
- 64% - bugs in ML system
- DeepRoad
- realistic images for rainy and snowy conditions
- DeepBillboard
- Trigger steering errors
- Continuous and Realistic physical world tests
Machine Translation
- Metamorphic Relations
- Some changes should not affect the overall structure of translated output
- Over Translation : duplicates
- Under translation : Missing phrases
NLI
- Generate Sentence Mutants
- Rule-based adversaries
- Semantic Understanding Test
- Mutation by swapping sentences
- Accuracy should be equal for contradiction
- Accuracy should decrease for entailment
Challenges & Opportunities
Machine L
By Suriyadeepan R
Machine L
- 1,037