Machine Learning Testing

Survey, Landscapes and Horizons

Components

  • Required Conditions
  • ML Items
  • Testing Activities

Required Conditions

  • Correctness
  • Robustness
  • Fairness
  • Privacy
  • Efficiency

ML Items

  • Framework
  • Data
  • Learning Algorithm

Testing Activities

  • Test Input Generation
  • Test Oracle Identification
  • Test Adequacy Evaluation
  • Bug Triage

Testing Properties

  • Correctness : Metric (Accuracy, F1, ... )
  • Overfitting Degree
  • Robustness (Noisy input, stress)
    • Adversarial Robustness
      • local 
      • global
  • ...

Testing Properties

  • Security
    • Model Extraction
  • Data Privacy 
    • Linkage Attack
  • Efficiency
    • Slow / Infinite Loop
    • VGG-19 in mobile device
  • Interpretability
    • Transparency
    • Post-hoc explanations

Robustness

  • Correctness in the presence of noise
  • DeepFool
    • Point-wise robustness
    • Adversarial Frequency
    • Adversarial Severity
      • distance between input and adversarial example
  • DeepSafe
    • A cluster should have same label

Overfitting

  • Cross-validation
    • fails if test data is unrepresentative of potential unseen data
  • Perturbed Model Validation (PMV)
    • Both underfitting and overfitting are less sensitive to noise
  • Generate adversarial examples from test data
    • If error increases on adversarial examples
      • Overfitting

Efficiency

  • Training data reduction approaches
    • Smaller subset of data
    • Faster ML testing

Test Oracle

  • Determining if a test has passed
  • Pseudo Oracle
  • Metamorphic Relations

Test Adequacy

  • Coverage

Test Input Generation

  • Natural Inputs and Adversarial Inputs
  • Perturbed Natural inputs
  • Neuron Coverage
  • Transformed images
    • Detected 1000 erroneous behaviours in AV
  • GAN
    • Image-to-image Transformation
    • Simulate Weather conditions

Test Input Generation

  • Inputs for text classification
    • Grammar
    • Distance between inputs
  • NLI 
    • Mutate sentences for robustness testing
  • DeepCheck
    • DNN -> Program
  • LIME, DeepConcolic

Metamorphic Test Oracles

  • sin(x) = sin(pi -  x)
  • Same input in different forms must yield same outputs
  • Course-grained Data Transformation
    • Enlarge dataset, Change data order
  • Fine-grained Data Transformation
    • Mutate attributes, pixels
  • ...

Metamorphic Test Oracles

  • Perturbed Model Validation (PMV)
    • ​inject noise into training data
    • Overfitting degree is less sensitive to noise
  • Image Transformation to weather conditions
    • Steering angle shouldn't change significantly
  • Classification consistency among similar images

Metamorphic Test Oracles

  • Metamorphic relations between datasets
    • ​P(training_data) == P(new_data)
    • Automate Metamorphic relations to detect bugs
      • Amsterdam
      • Corduroy

Cross-referencing as Oracle

  • Differential Testing
    • ​2 similar applications should respond to an input with similar outputs
  • ​Mirror program
    • ​​Program that represents the training data
    • Similar behaviour on test data

Metrics for Oracle Design

  • Flickering Test
    • along the boundary of the road
  • Nested car boundary

Test Adequacy

  • Test Coverage
    • Degree to which src code is executed by test suite
  • Neuron Coverage
    • (ratio) Unique neurons activated by test inputs
    • Activation => output of neuron > threshold
  • MC/DC coverage
    • Change of boolean variable
    • sign, value, distance
    • for change in test input

Test Adequacy

  • Layer-level coverage
    • Top hyperactive neurons and their combinations
  • Surprise Adequacy
    • Kernel Density Estimation
      • approximate likelihood of system having seen similar input during training
    • Neuron Activation trace (vector)
      • Distance between input and training data

Mutation Testing

  • Inject faults
  • Minor perturbations on decision boundaries of DNN
  • Mutation score
    • (ratio) #instances where results changed

Rule-based Adequacy Test

  • Training should be reproducible
  • All features should be useful
  • A simpler equally successful model shouldn't exist

Test Prioritization

  • Based on 
    • Cross-entropy
    • Surprisal
    • Bayesian Uncertainty
  • Adversarial inputs are better
  • Sampling technique guided by last layer of neurons

Debug and Repair

  • Training on generated inputs improves correctness
  • Resample to influence faulty neurons
  • tfbg
    • debugger for tf models
    • Analyzer, NodeStepper, RunStepper
  • PALM
    • Meta model : partitions training data
    • Sub-models : approximate patterns in partitions
    • Which training data impacts prediction most?

Testing Frameworks

  • CNN testing
  • Security Testing
  • Fairness Testing

ML Testing Components

  • Bugs in Training data
    • Data Linter
      • Miscoded data
      • Outliers
      • Duplicate/missing examples
    • MODE : Resample to influence faulty neurons
  • Test Data
    • Small network that identifies adversarial examples
    • Insufficiency of test data is a data bug

ML Testing Components

  • Bugs in Data
    • Skew Detection
      • KDE
      • Neural Activation trace

Data Bug Detection Framework

  • Data Validation System
    • Integral part of Google's TFX
    • Metric : distance between training data and new data
  • ActiveClean : Iterative data cleaning
  • BoostClean : Domain value violation
  • ...

Data Bug Detection Framework

  • Automatic 'Unit' Testing
    • Common, user-defined quality constraints for data testing
  • AlphaClean
    • Greedy tree search to automatically tune the parameters of data cleaning pipelines

Bug Detection in Learning Program

  • Tensorflow API update
    • Most common bug
  • WALA : Static Analysis of tensor behaviour in tf

Bug Detection in Frameworks

  • Incorrect implementation of algorithms
  • Differential testing across multiple implementations
  • Metamorphic testing
    • normalize/scale input data
  • MutPy
    • Inject mutants to simulate implementation bugs

Autonomous Driving

  • Disengagement
    • 64% - bugs in ML system
      • 44% Poor Image classification performance
      • 20% Control and Decision system bug
  • DeepRoad
    • realistic images for rainy and snowy conditions
  • DeepBillboard
    • Trigger steering errors
    • Continuous and Realistic physical world tests

Machine Translation

  • Metamorphic Relations
    • Some changes should not affect the overall structure of translated output
  • Over Translation : duplicates
  • Under translation : Missing phrases

NLI

  • Generate Sentence Mutants
    • Rule-based adversaries
  • Semantic Understanding Test
  • Mutation by swapping sentences
    • Accuracy should be equal for contradiction
    • Accuracy should decrease for entailment

Challenges & Opportunities

Made with Slides.com