Data Products

Or how to get models into production

PyData track at PyCon Italy

Friday 17th of April 2015

peadarcoyle@googlemail.com

All opinions my own

Who am I?

I work as a Data Scientist for a large Telecommunications Company

Masters in Mathematics

Specialized in Statistics and Machine Learning

Interned at Amazon

Was a consultant for a while

I've been an analytics product architect on one product

Occasional contributor to Pandas and other projects

@springcoil

We can't agree what data science is

I think a data scientist is someone with enough programming ability to leverage their mathematical skills and domain specific knowledge to turn data into solutions.

The solution should ideally be a product

To help the business most

I believe that data science offers the most value when the models are in production.

Some of us call this a 'Data Product'

In this talk I will explain how to use ScienceOps from Yhat to build a model in production
Why should Amazon or Google get all the fun? Or competitive advantage?

The last mile problem

Sean Taylor at Facebook calls this the 'last mile problem'.

Or how do you translate the insight into something people use?

It is hard to incorporate
data into day to

day operations.

Data scientists are not software engineers

Although it is not acknowledged by some!

Producing models in code is not the same as producing a good web application, you need domain specific knowledge of model building and the challenges that presents.

R and D != Engineering

Many software engineers think that data science is just an engineering problem.

However, the scoping of a model building task is hard, you never quite know how to scope it effectively.

Takeaway: Make sure your stakeholders are ready for such high risk and high reward projects

Hiring data scientists is hard...

Why?

The data science process involves something like OSEMIC

Obtain

Scrub

Explore

Model

Interpret

Communicate

Building the model involved porting code from Matlab and understanding a new domain specific problem.

The API data sources were messy and hard to understand

Case study: Problem description

A client was working on a visualization tool and needed to provide the results of a differential equation in a usable form to users.

The research problem was already done - so after code was prototyped in Python - what next?

One key ingredient was that the results of the 'mathematical engine' had to be incorporated quickly into a Ruby on Rails/ Javascript based product.

The challenge therefore is one of interoperability

Possible Solutions (and their problems)

Port code to Java -----> Cross language validation

PMML ----> Doesn't have great language support

Batch Jobs -------> High maintenance and config

Write models in Ruby --> Turned out ruby doesn't have an ODE solver

More tools, more work, more time

My first solution

Teach Math....

So I did what all data scientists do when stuck...

I found these guys

I could use stuff from YHatHQ to build a model as a service...

This is a much better solution!

I used Science Ops from YHatHQ

Key Tenets
1. Work with the tools you already know

2. Iterate quickly

3. Low touch

4. No rewriting code

Code!

http://bit.ly/1J3T4qf


import numpy as np
A1 = bs * ( astr * N ) ** 2
A2 = c1 / tdS
A3 = ( 1 + bs ) * ( A4 * N ) ** 2
A4 = A1 * z0
A5 = A3 * z0
A6 = C
A7 = 0.5 * ( ( c2 / tt ) + ( c1 / tdS ) )
A8 = ( c2 / tt ) - ( c1 / tdS )
def dX_dt(X, t=0):
    """ Return the triple ODE calculations """
    return array([ - A1 * X[2] + A4,
    - A2 * X[1] + A3 * X[2] - A5,
      X[0] - X[1] ])
 
 from scipy import integrate
 t = linspace(0, 35, 1000)              # time
 X0 = array([0, 1, 0])                     # initials conditions
 X, infodict = integrate.odeint(dX_dt, X0, t, full_output=True)
 infodict['message']

What are the key takeaways?

1. The 'magic quickly' problem

2. Lack of a shared language between software engineers and data scientists - but investing in the right tooling by using open standards allows success.

3. To help data scientists and analysts succeed your business needs to be prepared to invest in tooling

Magic quickly

https://xkcd.com/1425/
Research is not engineering!

Lack of a shared language

Statisticians and software engineers don't necessarily have a shared language.

Services like Science Ops help bridge the gap.

"Watch for high skew and kurtosis"

Think about your team balance in your projects. Math folk versus coders.

Invest in tooling

For your analysts and data scientists to succeed you need to invest in infrastructure to empower them.
Think carefully how you want your company to spend its innovation tokens and take advantage of the excellent tools available like ScienceOps and AWS.
I think there is great scope for entrepreneurs to take advantage of this arbitrage opportunity and build good tooling to empower data scientists by building platforms.
Contribute to Open Source Software such as the PyData stack!

Alternatives to YhatHQ

(that I know of)

Lessons learned

I can write a model in Python and have it deployed!
Software Engineers aren't data scientists and shouldn't be expected to write models in code.
Models only provide value when they are in production
Getting information from stakeholders is really valuable in improving models.

Successes

Within a few months it was possible to have an analytics product in production, using information consumed from a variety of API's.
I have no idea how else - maybe using PMML that I could deploy models.
Total development time took 3 months, with 5 people. Only two (including myself) were working fulltime on this project.
That development time includes time for us to learn the domain specific knowledge like models, API sources, etc.

Other kinds of data science Products

Credit risk modelling
Customer attrition modelling
Recommendation engines
Airline delay analysis
The list goes on....

Wanna learn more?

www.yhathq.com

peadarcoyle@googlemail.com

Models in Production

By springcoil

Models in Production

How to create a Data Product on a budget :)

2,150

springcoil

springcoil

Data Products

Who am I?

We can't agree what data science is

To help the business most

The last mile problem

It is hard to incorporate data into day to

day operations.

Data scientists are not software engineers

R and D != Engineering

Hiring data scientists is hard...

Why?

Case study: Problem description

Possible Solutions (and their problems)

More tools, more work, more time

My first solution

So I did what all data scientists do when stuck...

I found these guys

I could use stuff from YHatHQ to build a model as a service...

This is a much better solution!

Code!

What are the key takeaways?

Magic quickly

Lack of a shared language

Invest in tooling

Alternatives to YhatHQ

(that I know of)

Lessons learned

Successes

Other kinds of data science Products

Wanna learn more?

www.yhathq.com

peadarcoyle@googlemail.com

Models in Production

More from springcoil

It is hard to incorporate
data into day to