Probabilistic Programming

A Brief introduction to
Probabilistic Programming and Python

EuroSciPy - University of Cambridge August 2015

All opinions my own

Who am I?

I work as a Data Scientist for a large Telecommunications Company

• Masters in Mathematics
• Interned at Amazon
• Was a consultant for a while
• Occasional contributor to Pandas and other projects
• Co-organizer of the Data Science Meetup in Luxembourg
• Member of Royal Statistical Society and NumFOCUS
• @springcoil

What is Probabilistic Programming

• Basically using random variables instead of variables
• Allows you to create a generative story rather than a black box
• A different tool to Machine Learning
• A different paradigm to frequentist statistics

Bayesian Statistics

• I studied Mathematics, and encountered in textbooks Bayesians
• This is a hard area to do by pen and paper, and most integrals can't be solved in exact form
• Thankfully there was an invention of Monte Carlo Simulations
• These simulations are used to approximate your likelihood function

How do you pick your prior?

• This is a bit of an art
• You generally base the prior on experience
• As you add more data this matters less and less

No in Python you have PyMC3

• A complete rewrite of PyMC2 now in 'Beta' status
• Based upon Theano
•  Computational techniques for handling gradients
• Automatic Differentiation and GPU speedup
• Theano - is also used in deep learning!
• Currently there is a project to port 'BMH' from PyMC2 to PyMC3
• I gave a thorough tutorial on this - my github
• Key authors: John Salvatier, Thomas Wiecki, Chris Fonnesbeck

Case study: Rugby Analytics

I wanted to do a model of the Six Nations last year.

I wanted to build an understandable model to predict the winner

Key Info: Inferring the 'strength' of each team.

We only have scoring data, which is noisy hence Bayesian Stats

What did I do?

1. I picked Gamma as a prior for all teams

2. I used a Hierarchical Model because I wanted home advantage to be stronger for stronger teams based

3. From this I was able to create a novel model based only on historical results and scoring intensity

4. I simulated the likelihood function using MCMC

What actually happened

• The model incorrectly predicted that England would come out on top.
• Ireland actually won by points difference of 6 points.
• It really came down to the wire!
• "Prediction is difficult especially about the future"
• One of the problems is what we call 'over-shrinkage' and you can delve into the results to see what the errors are, my model was within the errors.
• Hat tip: Thanks to Abraham Flaxman and the PyMC3 on helping me port this from PyMC2 to PyMC3

Lessons learned

• I can build an explainable model using PyMC2 and PyMC3

• Communication is the 'last mile' problem of Data Science

By springcoil

ProbabilisticProgramming

A discussion of Probabilistic programming

• 2,030