Probabilistic Programming

A Brief introduction to 
Probabilistic Programming and Python

EuroSciPy - University of Cambridge August 2015

All opinions my own

Who am I?

 I work as a Data Scientist for a large Telecommunications Company

  • Masters in Mathematics
  • Interned at Amazon
  • Was a consultant for a while
  • Occasional contributor to Pandas and other projects
  • Co-organizer of the Data Science Meetup in Luxembourg
  • Member of Royal Statistical Society and NumFOCUS
  • @springcoil

What is Probabilistic Programming

  • Basically using random variables instead of variables
  • Allows you to create a generative story rather than a black box
  • A different tool to Machine Learning
  • A different paradigm to frequentist statistics
  • Forces you to be explicit about your 'subjective' assumptions

Bayesian Statistics

  • I studied Mathematics, and encountered in textbooks Bayesians
  • This is a hard area to do by pen and paper, and most integrals can't be solved in exact form
  • Thankfully there was an invention of Monte Carlo Simulations
  • These simulations are used to approximate your likelihood function

Some terminology

Attribution: Quantopian blog

How do you pick your prior?

  • This is a bit of an art
  • You generally base the prior on experience 
  • As you add more data this matters less and less

Huh but isn't Probabilistic Programming just Stan and BUGS?

No in Python you have PyMC3

  • A complete rewrite of PyMC2 now in 'Beta' status
  • Based upon Theano 
  •  Computational techniques for handling gradients
  • Automatic Differentiation and GPU speedup
  • Theano - is also used in deep learning!
  • Currently there is a project to port 'BMH' from PyMC2 to PyMC3
  • I gave a thorough tutorial on this - my github
  • Key authors: John Salvatier, Thomas Wiecki, Chris Fonnesbeck 

Case study: Rugby Analytics

I wanted to do a model of the Six Nations last year.

I wanted to build an understandable model to predict the winner

Key Info: Inferring the 'strength' of each team.

We only have scoring data, which is noisy hence Bayesian Stats 

What did I do?

1. I picked Gamma as a prior for all teams

2. I used a Hierarchical Model because I wanted home advantage to be stronger for stronger teams based

3. From this I was able to create a novel model based only on historical results and scoring intensity 

4. I simulated the likelihood function using MCMC

Run the model

What actually happened

  • The model incorrectly predicted that England would come out on top.
  • Ireland actually won by points difference of 6 points. 
  • It really came down to the wire!
  • "Prediction is difficult especially about the future"
  • One of the problems is what we call 'over-shrinkage' and you can delve into the results to see what the errors are, my model was within the errors. 
  • Hat tip: Thanks to Abraham Flaxman and the PyMC3 on helping me port this from PyMC2 to PyMC3

Lessons learned

  • I can build an explainable model using PyMC2 and PyMC3

  • Generative stories help you build up interest with your colleagues

  • Communication is the 'last mile' problem of Data Science

  • PyMC3 is cool please use it and please contribute

Wanna learn more?




By springcoil


A discussion of Probabilistic programming

  • 2,030
Loading comments...

More from springcoil