Data Products
Or how to get models into production
All opinions my own
Who am I?
- Masters in Mathematics
- Specialized in Statistics and Machine Learning
- Interned at Amazon
- Was a consultant for a while
- I've been an analytics product architect on one product
- Occasional contributor to Pandas and other projects
- @springcoil
We can't agree what data science is
I think a data scientist is someone with enough programming ability to leverage their mathematical skills and domain specific knowledge to turn data into solutions.
The solution should ideally be a product
To help the business most
- I believe that data science offers the most value when the models are in production.
- Some of us call this a 'Data Product'
- In this talk I will explain how to use ScienceOps from Yhat to build a model in production
- Why should Amazon or Google get all the fun? Or competitive advantage?
The last mile problem
Or how do you translate the insight into something people use?
It is hard to incorporate
data into day to
day operations.
Data scientists are not software engineers
R and D != Engineering
Hiring data scientists is hard...
Why?
The data science process involves something like OSEMIC
Obtain
Scrub
Explore
Model
Interpret
Communicate
Building the model involved porting code from Matlab and understanding a new domain specific problem.
The API data sources were messy and hard to understand
Case study: Problem description
Possible Solutions (and their problems)
Port code to Java -----> Cross language validation
PMML ----> Doesn't have great language support
Batch Jobs -------> High maintenance and config
More tools, more work, more time
My first solution
So I did what all data scientists do when stuck...
I found these guys
I could use stuff from YHatHQ to build a model as a service...
This is a much better solution!
I used Science Ops from YHatHQ
Key Tenets
1. Work with the tools you already know
2. Iterate quickly
3. Low touch
4. No rewriting code
Code!
import numpy as npA1 = bs * ( astr * N ) ** 2A2 = c1 / tdSA3 = ( 1 + bs ) * ( A4 * N ) ** 2A4 = A1 * z0A5 = A3 * z0A6 = CA7 = 0.5 * ( ( c2 / tt ) + ( c1 / tdS ) )A8 = ( c2 / tt ) - ( c1 / tdS )def dX_dt(X, t=0):""" Return the triple ODE calculations """return array([ - A1 * X[2] + A4,- A2 * X[1] + A3 * X[2] - A5,X[0] - X[1] ])from scipy import integratet = linspace(0, 35, 1000) # timeX0 = array([0, 1, 0]) # initials conditionsX, infodict = integrate.odeint(dX_dt, X0, t, full_output=True)infodict['message']
What are the key takeaways?
Magic quickly
https://xkcd.com/1425/
Research is not engineering!
Lack of a shared language
Statisticians and software engineers don't necessarily have a shared language.
Services like Science Ops help bridge the gap.
"Watch for high skew and kurtosis"
Think about your team balance in your projects. Math folk versus coders.
Invest in tooling
- For your analysts and data scientists to succeed you need to invest in infrastructure to empower them.
- Think carefully how you want your company to spend its innovation tokens and take advantage of the excellent tools available like ScienceOps and AWS.
- I think there is great scope for entrepreneurs to take advantage of this arbitrage opportunity and build good tooling to empower data scientists by building platforms.
- Contribute to Open Source Software such as the PyData stack!
Alternatives to YhatHQ
(that I know of)
Lessons learned
- I can write a model in Python and have it deployed!
-
Software Engineers aren't data scientists and shouldn't be expected to write models in code.
- Models only provide value when they are in production
- Getting information from stakeholders is really valuable in improving models.
Successes
- Within a few months it was possible to have an analytics product in production, using information consumed from a variety of API's.
- I have no idea how else - maybe using PMML that I could deploy models.
- Total development time took 3 months, with 5 people. Only two (including myself) were working fulltime on this project.
- That development time includes time for us to learn the domain specific knowledge like models, API sources, etc.
Other kinds of data science Products
- Credit risk modelling
- Customer attrition modelling
- Recommendation engines
- Airline delay analysis
- The list goes on....
Wanna learn more?
www.yhathq.com
peadarcoyle@googlemail.com
Models in Production
By springcoil
Models in Production
How to create a Data Product on a budget :)
- 1,785