The 14th Annual Scientific Computing with Python Conference (SciPy) is scheduled for July 6-12, 2015 in Austin, Texas.
#SciPy2015
http://www.oompu.com
-
The New York Times
-
Open Source Developer
-
Python Scientific Ecosystem
-
Q&A
#SciPy2015
http://www.oompu.com
Who am I?
#SciPy2015
http://www.oompu.com
Park Surk
NEXCORE Platform Expert @SK
Python Evangelist
Travel Swimming SnowBoarding
#SciPy2015
http://www.oompu.com
What is SciPy ?
#SciPy2015
http://www.oompu.com
SciPy
“SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python.”
#SciPy2015
http://www.oompu.com
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
Enhanced Interactive Console
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
Enhanced Interactive Console
Symboic Mathematics
SciPy
“these are some of the core packages.”
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
Enhanced Interactive Console
Symboic Mathematics
Data structures & analysis
Contributors
#SciPy2015
http://www.oompu.com
Data Science @ NYT
#SciPy2015
http://www.oompu.com
Newspapering
#SciPy2015
http://www.oompu.com
Newspapering
#SciPy2015
http://www.oompu.com
1851
Newspapering
#SciPy2015
http://www.oompu.com
1851
1996
Newspapering
#SciPy2015
http://www.oompu.com
1851
1996
Newspapering
#SciPy2015
http://www.oompu.com
1851
1996
2008
Now
#SciPy2015
http://www.oompu.com
Every publisher is now a startup
Technology-Enabled Journalism
How a 164-year old content company became data-driven
#SciPy2015
http://www.oompu.com
Learnings
#SciPy2015
http://www.oompu.com
Supervised learning
Unsupervised Learning
Reinforcement Learning
Learnings
#SciPy2015
http://www.oompu.com
Supervised learning
Which subscribers are going to cancel a subscription?
How many copies of the newspaper tommorrow?
Learnings
#SciPy2015
http://www.oompu.com
Unsupervised Learning
What is the hot issue related the President Obama today?
Who does what on The New York Times web site?
Learnings
#SciPy2015
http://www.oompu.com
Reinforcement Learning
A/B Test
Google case study : http://fastml.com/ab-testing-with-bayesian-bandits-in-google-analytics/
Learnings
#SciPy2015
http://www.oompu.com
Supervised learning
Unsupervised Learning
Reinforcement Learning
Which subscribers are going to cancel a subscription?
How many copies of the newspaper tommorrow?
What is the hot issue related the President Obama today?
Who does what on The New York Times web site?
A/B Test
Google case study : http://fastml.com/ab-testing-with-bayesian-bandits-in-google-analytics/
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Explore
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Explore
Learning
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Explore
Learning
Test
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Explore
Learning
Test
Optimizing
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Explore
Learning
Test
Optimizing
Reporting
Actual Integration with
3 learnings
#SciPy2015
http://www.oompu.com
Supervised learning :
Unsupervised Learning :
Reinforcement Learning :
Explore
Learning
Test
Optimizing
Reporting
Common Requirements
in Data Science
#SciPy2015
http://www.oompu.com
Common Requirements
in Data Science
#SciPy2015
http://www.oompu.com
1. People
2. Ideas
3. Things
Common Requirements
in Data Science
#SciPy2015
http://www.oompu.com
3. Things : What does DS Team deliver?
#Build data prototypes #Build APIS
#Collaboration with people&team #Impact Roadmaps
Common Requirements
in Data Science
#SciPy2015
http://www.oompu.com
2. Ideas : Data Skills
#Data Engineering #Data Science #Data Visualization
#Data Product #Data Multiliteracies #Data Embeds
Common Requirements
in Data Science
#SciPy2015
http://www.oompu.com
1. People
New MindSet > New ToolSet
Common Requirements
in Data Science
#SciPy2015
http://www.oompu.com
1. People
2. Ideas : Data Skills
3. Things : What does DS Team deliver?
New MindSet > New ToolSet
#Data Engineering #Data Science #Data Visualization
#Data Product #Data Multiliteracies #Data Embeds
#Build data prototypes #Build APIS
#Collaboration with people&team #Impact Roadmaps
http://daeilkim.com/refinery.html
Wes McKinney's Stories
#SciPy2015
http://www.oompu.com
as a Python programmer
#SciPy2015
http://www.oompu.com
- Python Projects
: SM(StatsModels : Statistics in Python)
: Pandas
- Companies
: AQR
: AppNexus
: Cloudera
- Books
: 'Python for Data Analysis'
#SciPy2015
http://www.oompu.com
<pre-2007>
- Mathematician.
- No exposure to Python, SQL, R(or any analytics for that matter)
- Rude awakening
<first job : AQR(Quant Hedge Fund)>
- A quant finance operation that lived and breathed SQL and Excel
- Production systems in C++, Java, Visual Basic, and C# .NET
- Some PhD-level researchers used MATLAB for research(as was common in finance/economics department)
#SciPy2015
http://www.oompu.com
<2008 : Productivity frustrations>
- First year several analytics and statistical data analysis projects
: A huge amount of SQL
: Some data
: A little bit of R
: and TONS of Excel
- Projects felt like 5% conceptualization, 95% tedium
#SciPy2015
http://www.oompu.com
<Python in early 2008: different times>
- A bleeding edge stack
: Numpy 1.0.4
: SciPy 0.6.0
: matlplotlib 0.9.12
: iPython 0.8.4, SVN history begins 2/2008
: Cython 0.9.8
- The scientific Python community seamed mainly focused on attracting MATLAB, HPC, and scientific lab users
#SciPy2015
http://www.oompu.com
<2008 : Things SciPythonists didn't care too much about>
- Relational data or SQL
- Missing data handling
- Statistics and econometrics(first StatsModels release : 2011)
- Statistic graphics
- Machine Learning(scikit-learn 0.1:2/2010)
- Analytics and business intelligence
#SciPy2015
http://www.oompu.com
<Taking a gamble>
- Decided to give Python a shot for AQR projects after seeing part of MASS R package ported in scipy.stats.models by Jonathan Taylor at Stanford
- proto-pandas first version built in April 2008
: focused on porting an R project to Python
- May '08 : Embedded python interpreter in a legacy C++ system
- 5/2008 ~12/2008 : Skunkworks Python ports and evangelism across company
#SciPy2015
http://www.oompu.com
<Why did Python work out?>
- Batteries included
: other SciPy packages
- Interoperability with C++
: Embedding Python interpreter
: Wrapping C++ in Python C extensinos
- productive user interface
: Python language
: IPython + matplotlib
#SciPy2015
http://www.oompu.com
<End 2009 : Pandas!>
- AQR lets me open source pandas 0.1 on Christmas, 2009
<2010~2011 : Python's data growing pains>
- pandas did not evolve much after its initial release
- No consensus of momentum behind any project for analytics / data wrangling
- AQR --> Duke statistical Science
- AQR sponsors bug fixes and new features in pandas
#SciPy2015
http://www.oompu.com
<May 2011 : Getting inspired>
- 2011-05-13 : Enthoght Datarray Summit
: Discuss how to enable Python to become more useful statistical computing
: Me : "Library fragmentation is destructive; integration is better"
: Data structures, missing data, and data wrangling tools
- 2011-05-13 ~ 2011-06-03: Python finance consulting engagement
: Realized that Python data tools sorely needed in industry
: But not nearly mature enough yet
#SciPy2015
http://www.oompu.com
<Making pandas a better tool>
- Consulting at AppNexus(NYC ad tech company) opened eyes to new problems
- June 2011 ~ December 2012
: Fix some pandas design issues
: Build out data wrangling capabilities(hierarchical indexes, etc...)
: Create "killer apps"(time series capabilities)
: Evangelize and collaborate with other projects
#SciPy2015
http://www.oompu.com
<Making a book happen>
- Python for Data Analysis
- A chicken-and-egg problem
- Fernando Perez, Brian Granger, and John Hunter had been toying with the idea of a "SciPy Book" for a couple years
- Decided to forge my own path in Nov 2011
: Writing took about 9 months
: Helped motivate me to "finish" parts of pandas
- 50,000 copies in circulation
#SciPy2015
http://www.oompu.com
<Clarity and software engineering>
- Progress in software not just about hard work
- Solving the right problems
: ...in the right order
: ...while wasting little time/energy on non-impactful issues
: ...while being faced with real world concerns(80/20 rule)
- Taking the time to develop a clear vision and scope for a project is a major factor in its success or failure
#SciPy2015
http://www.oompu.com
<It took a village>
- Fernando perez & Brian Granger(IPathon)
- Skipper Seabold & Josef Perktold(StatsModels)
- Eric Jones (Enthought)
- Travis oliphant & Peter Wang(Enthought & Continuum)
- John Hunter(matplotlib)
- ... and many others
#SciPy2015
http://www.oompu.com
<serendipity>
An unlikely train ride
SEA ->PDX
November 18, 2011
My seatmate was computational bio professor and 5-year PSF member Titus Brown
And he would later assist the IPython team in their Sloan Foundation $1MM grant in 2012
<Some words about John Hunter(1968~2012)>
#SciPy2015
http://www.oompu.com
<Business ventures 2012~2014>
- 2012 : Lambda Foundry
: Support and develop pandas
: Explored creating a commercial Python financial Toolkit
- 2013~2014 : DataPad
: "Google Drive" for Analytics / BI
: With Chang She (MIT --> AQR --> pandas)
: Silicon Valley VC-backed
: Acquired by Cloudera in September 2014
#SciPy2015
http://www.oompu.com
<Cloudera>
- Sort of "the Red Hat of Big Data"
- The leading open source Hadoop Platform
- Supporting and developing a little over 20 Apache-licensed open source projects
- A dream job
: Full time open source development
: Solving hard data problems faced by the world's largest companies
- P.S. we're hiring engineers in Austin + Bay Area
#SciPy2015
http://www.oompu.com
<What interested in right now>
- Ways to enable collaboration on data tools across programming languages
- Domain Specific language design and compilation
- Improving the Python-on-Hadoop experience
- LLVM + Code generation
#SciPy2015
http://www.oompu.com
<The Great Data Tool Decoupling>
- Thesis : over time, user interfaces, data storage, and execution engines will decouple and specialize
- In fact, you should really want this to happen
: Share systems among languages
: Reduce fragmentation and "lock-in"
: Shift developer focus to usability
- Prediction : we'll be there by 2025; sooner if we all get our act together
Visualization
Computation
Shell
<Early on>
Visualization
Computation
Shell
<Today>
Visualization
Computation
Shell
<Today>
Python
Matplotlib
Integration
Scipy
Jupyter
#SciPy2015
http://www.oompu.com
Python
Scientific Ecosystem
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
<Many more tools>
- Performance : Numba, Weave, Numexpr, Theano ...
- Visualization : Bokeh, Seaborn, Plotly, Chaco, mpld3, ggplot, MayaVi, vincent, toyplot, HoloViews ...
- Data Structures & Computation : Blaze, Dask, DistArray, XRay, Graphlab, SciDBppy, pySpark ...
- Packaging & Distribution : pip/wheels, conda, EPD, Canopy, Anaconda ...
<Recent Development>
1. Foundation
- Python 3
2. Visualization
- Matplotlib 1.4 , 2.0
- Seaborn = Matplotlib + Pandas + statistical visualization
- Bokeh = Powerful Interactive Visualization, HTML5, Javascript lib
3. Arrays & Data Structures
- Xray = NumPy + Pandas
- Dask = lightweight tool for general parallelized array storage and computation
4. Computation & Performance
- Numba = with a simple decorator, Python JIT compiles to LLVM and excutes at near C/Fortran speed
5. Distribution & Packaging
- Anaconda
<IPython & Jupyter>
So much happening ...
- The IPython/Jupyter split
- Widgets = awesome
- Docker-based backends
- Jupyter Hub
- new $6M grant 2015 July first week
<Why Python?>
<Why Python?>
- Python was created in the 1980s as a teaching language, and to bridge the gap between the shell and C.
Shell
C
<Why Python?>
- Guido Van Rossum "I thought we'd write small Python programs, maybe 10 lines, maybe 5, maybe 500 lines - that would be a big one"
Write Small
Be Big
<Why Python?>
- Python is not a scientific programming language
: Why did a "toy language" become the core of a scientific stack?
Python
Scientific Programming Language
!=
<Why Python?>
- Python is a glue
- Python glues together this hodge-podge of scientific tools.
<Why Python?>
- high-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.
<Why Python?>
- it is speed of development, not necessarily speed of execution. that has driven Python's popularity.
Speed of Development
Speed of Execution
<Why Python?>
Why don't you use C instead of Python? it's so much faster!
<Why Python?>
Why don't you use C instead of Python? it's so much faster!
Why don't you commute by airplane instead of by car?
it's so much faster!
<Why Python?>
- Python was created in the 1980s as a teaching language, and to bridge the gap between the shell and C.
- Guido Van Rossum "I thought we'd write small Python programs, maybe 10 lines, maybe 5, maybe 500 lines - that would be a big one"
- Python is not a scientific programming language
: Why did a "toy language" become the core of a scientific stack?
- Python is a glue
- Python glues together this hodge-podge of scientific tools.
- high-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.
- it is speed of development, not necessarily speed of execution. that has driven Python's popularity.
- "Why don't you use C instead of Python? it's so much faster!"
: "Why don't you commute by airplane instead of by car? it's so much faster!"
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
<But this efficiency depends on the Scientific Stack...>
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
1995 : "Numeric" was an early Python scientific array library. largely written by Jim Hugunin. Numeric -> NumPy
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Multipack
1998 : "Multipack" built on Numeric, was a set of wrappers of Fortran packages written by Travis Oliphant. Multipack -> SciPy
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
Multipack
2002 : "Numarray" was created by Perry Greenfield, Paul Dubois, and others to address fundamental deficiencies in Nemeric for larger datasets
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
2006 : In a herculean effort to head-off this split in the community. Travis oliphant incorporated best parts of Numeric + Numarray into "Numpy"
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
2000 : Eric Jones, Travis oliphant. Pearu Peterson, and others spun multipack into the "SciPy" package. aiming for a full Python MatLab replacement.
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
2001 : Fernando Perez started the "IPython" projects, aiming for a mathematica-style environment for Scientific Python
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Matplotlib
2002 : John Hunter wanted an open MatLab replacement, and started "matplotlib" as an effort at MatLab-style visualization
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Notebook
Matplotlib
2012 : The Ipython team released the "IPython Notebook" and the world has never been the same
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Notebook
Matplotlib
Pandas
2009 : Wes McKinney began "Pandas", eventually drawing-in much larger Python user-base. especially in industry data science.
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Notebook
Matplotlib
Pandas
Scikits
2009 : With SciPy's sheer size making fast development difficult. community decided to promote "scikits" as an avenue for more specialized algorithms.
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Notebook
Matplotlib
Pandas
Scikits
Conda
2012 : Continuum releases "conda". a package manager for scientific computing.
<Lessons Learned>
1. No centralized leadership! What is "core" in the ecosystem evolves & up to the community
- Evolving computational core : Numba?
: Just as Cython matured to become a core piece. perhaps Numba might as well? How might a JIT-enabled SciPy, sklearns, pandas, etc. look?
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
<Lessons Learned>
2. To be most useful as an ecosystem, we must be willing for packages to adapt to the changing landscape.
- Evolving computational core : Pandas?
: Modern data is sparse, heterogeneous, and labeled, and NumPy arrays don't measure up : let's make Pandas a core dependency!
- Evolving computational core : pandas, Seaborn --> matplotlib
: With Pandas core dependency. what elements of Seaborn & Pandas could be moved into matplotlib?
- Evovling the core : SciPy
: SciPy's monolithic design was driven by packaging & distribution difficulties.
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
Seaborn
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
<Lessons Learned>
3. interoperability with core pieces of other languages has been key to the success of the SciPy stack(e.g. C/Fortran libraries, new Jupyter framework
- Universal Plotting Serialization?
: Much of modern interactive plotting (d3, HTML5, Bokeh, ggvis, mpld3, etc) involves generating & processing plot serializations
: matplotlib -> {JSON} -> javascript --> plotting at web
: Doing this natively in matplotlib would open up extensibility!
- Universal DataFrames?
: R, Python, Julia use C/Fotran Memory Block
: R, Python, Julia use RDataFrame , Pandas, Dataframe.jl
: in the future R, Python, Julia use ...so called...Uber DataFrame ?
matplotlib -> {JSON} -> javascript -> plotting at web
Universal Plotting Serialization
Universal DataFrames
R : RDataFrame
Python : Pandas
Julia : Dataframe.jl
Uber DataFrame ???
<Lessons Learned>
4. The stack was built from both continuity(e.g. Numeric/Numarray->NumPy) and brand-new efforts(e.g. matplotlib, Pandas). Don't discount either approach!
- Considering the Future of Matplotlob (Usual compliaints about Matplotlib)
: Non-optimal stylistic defaults -> matplotlib 2.0
: Non-optimal API -> Seaborn, ggplot
: Difficulty exporting interactive plots -> Serialization to mpld3/Bokeh
: Difficulty with large datasets ->???
- Lesson from Numeric/Numarray, etc
:Stick with matplotlib & modify it(e.g serialization to VisPy? Numba-driven backend? new backend architecture? etc.)
- Lesson from Pandas & Matplotlib, etc: : Start something from scratch; features will draw users!(e.g. VisPy, Bokeh, Something new?)
Don't discount either approach!!!
Continuity
Brand-new Effort
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Notebook
Matplotlib
Pandas
Scikits
Conda
2006 : In a herculean effort to head-off this split in the community. Travis oliphant incorporated best parts of Numeric + Numarray into "Numpy"
Continuity
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
Numeric
Numarray
NumPy
Multipack
SciPy
IPython
Notebook
Matplotlib
Pandas
Scikits
Conda
Brand-new Effort
2009 : Wes McKinney began "Pandas", eventually drawing-in much larger Python user-base. especially in industry data science.
2002 : John Hunter wanted an open MatLab replacement, and started "matplotlib" as an effort at MatLab-style visualization
What's It All Mean?
#SciPy2015
http://www.oompu.com
build Data Science Team
#SciPy2015
http://www.oompu.com
Glue all decoupled systems
#SciPy2015
http://www.oompu.com
Make Data-Driven Dicision
#SciPy2015
http://www.oompu.com
Impact Roadmaps
#SciPy2015
http://www.oompu.com
Learn Learning
with Python
#SciPy2015
http://www.oompu.com
parksurk@gmail.com
Park Surk
Private MOOCs
http://www.oompu.com
http://www.facebook.com/parksurk
SciPy 2015 Conference Review
By SURK PARK
SciPy 2015 Conference Review
A talk about SciPy2015 conference Review related SK Global expertise Sharing Program
- 3,296