Probabilistic Programming
A Brief introduction to
Probabilistic Programming and Python
EuroSciPy - University of Cambridge August 2015
All opinions my own
Who am I?
I work as a Data Scientist for a large Telecommunications Company
- Masters in Mathematics
- Interned at Amazon
- Was a consultant for a while
- Occasional contributor to Pandas and other projects
- Co-organizer of the Data Science Meetup in Luxembourg
- Member of Royal Statistical Society and NumFOCUS
- @springcoil
What is Probabilistic Programming
- Basically using random variables instead of variables
- Allows you to create a generative story rather than a black box
- A different tool to Machine Learning
- A different paradigm to frequentist statistics
- Forces you to be explicit about your 'subjective' assumptions
- Source: Olivier Grisel
- Source: Olivier Grisel
Bayesian Statistics
- I studied Mathematics, and encountered in textbooks Bayesians
- This is a hard area to do by pen and paper, and most integrals can't be solved in exact form
- Thankfully there was an invention of Monte Carlo Simulations
- These simulations are used to approximate your likelihood function
Some terminology
Attribution: Quantopian blog
How do you pick your prior?
- This is a bit of an art
- You generally base the prior on experience
- As you add more data this matters less and less
Huh but isn't Probabilistic Programming just Stan and BUGS?
No in Python you have PyMC3
- A complete rewrite of PyMC2 now in 'Beta' status
- Based upon Theano
- Computational techniques for handling gradients
- Automatic Differentiation and GPU speedup
- Theano - is also used in deep learning!
- Currently there is a project to port 'BMH' from PyMC2 to PyMC3
- I gave a thorough tutorial on this - my github
- Key authors: John Salvatier, Thomas Wiecki, Chris Fonnesbeck
Case study: Rugby Analytics
I wanted to do a model of the Six Nations last year.
I wanted to build an understandable model to predict the winner
Key Info: Inferring the 'strength' of each team.
We only have scoring data, which is noisy hence Bayesian Stats
What did I do?
1. I picked Gamma as a prior for all teams
2. I used a Hierarchical Model because I wanted home advantage to be stronger for stronger teams based
3. From this I was able to create a novel model based only on historical results and scoring intensity
4. I simulated the likelihood function using MCMC
Run the model
What actually happened
- The model incorrectly predicted that England would come out on top.
- Ireland actually won by points difference of 6 points.
- It really came down to the wire!
- "Prediction is difficult especially about the future"
- One of the problems is what we call 'over-shrinkage' and you can delve into the results to see what the errors are, my model was within the errors.
- Hat tip: Thanks to Abraham Flaxman and the PyMC3 on helping me port this from PyMC2 to PyMC3
Lessons learned
- I can build an explainable model using PyMC2 and PyMC3
-
Generative stories help you build up interest with your colleagues
- Communication is the 'last mile' problem of Data Science
- PyMC3 is cool please use it and please contribute
Wanna learn more?
BMH
PyMC3
peadarcoyle@googlemail.com
ProbabilisticProgramming
By springcoil
ProbabilisticProgramming
A discussion of Probabilistic programming
- 3,221