Testing for data scientists
- Data scientist at Dato
- Sports analytics fan & consultant
- Blog at thespread.us
- Why testing for data scientists?
- Kinds of tests
- Quick review of unit testing
- Unique problems for data scientists
- Promising Python packages
Life of a data scientist
Work looks like this
Data is messy...
... so is your code
Assumptions at every step.
Software development skills for data scientists
- Writing modular, reusable code
- Version control
What is testing?
"Getting Started Testing"
When to test?
- Model -- the Wild West
Testing helps you:
- Find bugs
- Check your assumptions
(by making them explicit)
- Write simpler code
- Work with others
- Be surprised less
Most relevant kinds of tests
(for data scientists)
- Unit tests
- Regression (why?!?) tests
- Integration tests
- Test one "unit" of code
- No dependencies on
other code you've written
- Don't require access
to databases, APIs, etc.
The standard Python unit testing landscape
- unittest, unittest2
A simple function...
... what could go wrong?
def mean(values): return sum(values) / len(values)
A simple test
def test_mean(): assert(mean([1, 2, 3, 4, 5]) == 2)
- Less boilerplate
- Fewer classes
- Gets you testing quickly
- Easy to interpret errors
When & what to test?
- When you change code, add a test.
- Test the outcome, not the implementation
- When you find a bug, add a test.
- Help identify complexity
- Don't test code that's already tested!
Write failing tests first,
fix code until tests pass.
Testing for data science can be a little different...
... deterministic answers may not exist
Laziness (not the good kind)
- Extract data once, build many models
- Data is representative of the future
- Using specific samples to spot-check
Better ways to test
Test properties, not specific values
Make assumptions about data shape & type
For "defensive" data analysis
"The raison d’être for engarde is the fact of life that data are messy."
Great for ETL on changing data
- Built with pandas
- Very lightweight
- Use for functions that
accept & return a Dataframe
- Just add decorators!
(or use DataFrame.pipe)
David R. MacIver
Property-based testing inspired
by Haskell's Quickcheck
How it works
Generate data randomly
according to some specs
(and be slightly diabolical about it)
- Very flexible
- Plugin support
- Finds corner cases fast
- Ideal for code that will be
accepting input "from the wild"
- Works with existing testing frameworks
- Works with Faker
- Has a datetime plugin
- Tests many Python WTFs
- Experimental NumPy support
Declare feature schemas & test them
- Designed with ML and sklearn in mind
- Can build feature creation & testing pipelines
- Supports experiments for testing variety of
features and evaluating many models,
storing results to a database
Model testing is the wild west
engarde: is_monotonic(), within_n_std(), within_set()
Don't test algorithms you haven't personally implemented
scikit-learn, SciPy, NumPy have excellent test suites
Testing algorithms you have implemented
pandas, SciPy, NumPy have excellent testing methods
Numerical computing is tricky.
Try to use existing tools as much as possible.
What I didn't cover
- Testing MCMC code
- Follow @tdhopper
- Continuous integration
- See @digitallogic's answer here:
Copy of Testing for data scientists