Stockwell
Reduce the impact of intermittent tests
Teammates
  • Geoff Brown (:gbrown)
  • Joel Maher (:jmaher)
  • William Lachance (:wlach)
What is the impact?
  • Sheriff time to star failures
    
  • Developer distractions
  • more tooling needed
  • increased load on limited resources
  • How do intermittents impact your job?
August 29th 2014: OrangeFactor 2.49
August 29th 2016: OrangeFactor 26.98

 

What happened?
What has changed?
  • Hired sheriffs
  • More platforms
  • More configurations (e10s, asan)
  • More tests and test suites
  • Many changes in Firefox
What do we run?
  • 19 build/config types
  • 1.05M possible tests/push
    
  • 490K tests run/push on average
  • 11 failures / push (OF=11.0)
How many intermittents?
  • Between 700 and 950 bugs / week
  • For 6 months (april-september):
    
    • 7332 bugs occurred / 249279 failures
    • 3310 bugs occurred <10 times
    • 6018(82%) low frequency = 14% failures
    • 560(7%) high frequency = 68% failures
What is intermittent?
  • High frequency >=50 times/week
  • Medium frequency 10<x<50 times/week
  • Low frequency <=10 times/week
  • What is your definition of intermittent?
    
What fails?
  • test timeouts
  • test failures
  • harness/task timeouts
    
  • Firefox crash/leak/assertion/hang
  • harness/infrastructure
Bad tests?
  • Majority of fixes are test fixes
  • 178 mochitests do not run with --repeat
    
  • many uses of setTimeout()
    
  • poor use of api's
  • old tests written for old Firefox
Do we care?
  • Talked to dozens of engineers
  • Everyone wants to help 
    
  • Not all intermittents have a clear owner
    
  • Engineers have deliverables
  • Engineers don't want to waste time
    
  • What prevents you from fixing intermittent tests?
Experiments in Q4
  • quarantine jobs
  • test-lint jobs 
    
  • manual triage
    
  • OrangeFactor enhancements
    
Quarantine jobs
  • Always orange, long run times
    
  • Difficult to hack manifests
    
  • Leaks/Crashes/etc. still in other jobs
    
  • These would be ignored, unclear of value
    
Test Lint
  • Run extra tests on new/edited test cases
    
  • Did this for mochitests- 178 failures
    
  • Improves trust in tests
    
  • Will deploy in Q1 for mochitest
    
  • What causes you to not trust tests?
Manual Triage
  • In 2 weeks dropped OF from 23 to 11
    
  • Many patterns between bugs
    
  • Added info to make bugs actionable
    
  • Will continue to do this in Q1
    
Orange Factor++
  • bugzilla comments improved
    
    • relative frequency
    • ranking and priority
  • updated dev.tree-alerts to highlight number of high/mid/low frequency bugs
    
The Master Plan
  • Accept the fact that intermittents are here to stay
    
  • Develop a positive relationship with intermittent failures
    
  • Intermittent test failures are not seen on treeherder
  • On January 4, 2018- what would you expect to see?
Q1 Plan
  • P1 intermittents >=30 times/week
    
  • Make triaging easier
    
    • doing it full time
    • finding test owners
    • component filters on OrangeFactor
  • Increase confidence in tests/bugs
    • test-lint jobs
    • more data in bugs
  • More Experiments
More Experiments?
Don't you have enough data?


What experiments should we be doing?
Q1+ - More Experiments
Dashboards - more data for you
  • Triage bugs by component in OF
  • Disabled bugs in your component
  • New bugs in your component
Q1+ - More Experiments
Triage++
  • Identify common actionable data
  • List of data to include in new bugs
  • Create tools for getting common data
    
  • Identify spikes in occurrences faster
Q2+ - More Experiments
Reduce Noise / Better Tests
  • Improve auto classification
  • Consider ignoring low frequency failures
  • Look at rr chaos mode for the lint jobs
    
  • Best practices for writing, reviewing
Our Expectations
  • Assume good intent and common goals
  • Actionable bug == fix it!
  • disabling tests can be a good thing
    
Q&A
Goal: Reduce the impact of intermittents
  • What is your definition of intermittent?
    
  • How do intermittents impact your job?
    
  • What prevents you from fixing intermittent tests?
  • What causes you to not trust tests?
  • On January 4, 2018- what would you expect to see?
  • What experiments should we be doing?

Stockwell

By Joel Maher

Stockwell

  • 404
Loading comments...

More from Joel Maher