#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
Enhanced Interactive Console
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
Enhanced Interactive Console
Symboic Mathematics
#SciPy2015
http://www.oompu.com
Base N-Dimensional Array Pakage
Fundamental Lib for scientific computing
Comprehensive 2D Plotting
Enhanced Interactive Console
Symboic Mathematics
Data structures & analysis
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
Which subscribers are going to cancel a subscription?
How many copies of the newspaper tommorrow?
#SciPy2015
http://www.oompu.com
What is the hot issue related the President Obama today?
Who does what on The New York Times web site?
#SciPy2015
http://www.oompu.com
A/B Test
Google case study : http://fastml.com/ab-testing-with-bayesian-bandits-in-google-analytics/
#SciPy2015
http://www.oompu.com
Which subscribers are going to cancel a subscription?
How many copies of the newspaper tommorrow?
What is the hot issue related the President Obama today?
Who does what on The New York Times web site?
A/B Test
Google case study : http://fastml.com/ab-testing-with-bayesian-bandits-in-google-analytics/
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#Build data prototypes #Build APIS
#Collaboration with people&team #Impact Roadmaps
#SciPy2015
http://www.oompu.com
#Data Engineering #Data Science #Data Visualization
#Data Product #Data Multiliteracies #Data Embeds
#SciPy2015
http://www.oompu.com
New MindSet > New ToolSet
#SciPy2015
http://www.oompu.com
New MindSet > New ToolSet
#Data Engineering #Data Science #Data Visualization
#Data Product #Data Multiliteracies #Data Embeds
#Build data prototypes #Build APIS
#Collaboration with people&team #Impact Roadmaps
http://daeilkim.com/refinery.html
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
- Python Projects
: SM(StatsModels : Statistics in Python)
: Pandas
- Companies
: AQR
: AppNexus
: Cloudera
- Books
: 'Python for Data Analysis'
#SciPy2015
http://www.oompu.com
- Mathematician.
- No exposure to Python, SQL, R(or any analytics for that matter)
- Rude awakening
- A quant finance operation that lived and breathed SQL and Excel
- Production systems in C++, Java, Visual Basic, and C# .NET
- Some PhD-level researchers used MATLAB for research(as was common in finance/economics department)
#SciPy2015
http://www.oompu.com
- First year several analytics and statistical data analysis projects
: A huge amount of SQL
: Some data
: A little bit of R
: and TONS of Excel
- Projects felt like 5% conceptualization, 95% tedium
#SciPy2015
http://www.oompu.com
- A bleeding edge stack
: Numpy 1.0.4
: SciPy 0.6.0
: matlplotlib 0.9.12
: iPython 0.8.4, SVN history begins 2/2008
: Cython 0.9.8
- The scientific Python community seamed mainly focused on attracting MATLAB, HPC, and scientific lab users
#SciPy2015
http://www.oompu.com
- Relational data or SQL
- Missing data handling
- Statistics and econometrics(first StatsModels release : 2011)
- Statistic graphics
- Machine Learning(scikit-learn 0.1:2/2010)
- Analytics and business intelligence
#SciPy2015
http://www.oompu.com
- Decided to give Python a shot for AQR projects after seeing part of MASS R package ported in scipy.stats.models by Jonathan Taylor at Stanford
- proto-pandas first version built in April 2008
: focused on porting an R project to Python
- May '08 : Embedded python interpreter in a legacy C++ system
- 5/2008 ~12/2008 : Skunkworks Python ports and evangelism across company
#SciPy2015
http://www.oompu.com
- Batteries included
: other SciPy packages
- Interoperability with C++
: Embedding Python interpreter
: Wrapping C++ in Python C extensinos
- productive user interface
: Python language
: IPython + matplotlib
#SciPy2015
http://www.oompu.com
- AQR lets me open source pandas 0.1 on Christmas, 2009
- pandas did not evolve much after its initial release
- No consensus of momentum behind any project for analytics / data wrangling
- AQR --> Duke statistical Science
- AQR sponsors bug fixes and new features in pandas
#SciPy2015
http://www.oompu.com
- 2011-05-13 : Enthoght Datarray Summit
: Discuss how to enable Python to become more useful statistical computing
: Me : "Library fragmentation is destructive; integration is better"
: Data structures, missing data, and data wrangling tools
- 2011-05-13 ~ 2011-06-03: Python finance consulting engagement
: Realized that Python data tools sorely needed in industry
: But not nearly mature enough yet
#SciPy2015
http://www.oompu.com
- Consulting at AppNexus(NYC ad tech company) opened eyes to new problems
- June 2011 ~ December 2012
: Fix some pandas design issues
: Build out data wrangling capabilities(hierarchical indexes, etc...)
: Create "killer apps"(time series capabilities)
: Evangelize and collaborate with other projects
#SciPy2015
http://www.oompu.com
- Python for Data Analysis
- A chicken-and-egg problem
- Fernando Perez, Brian Granger, and John Hunter had been toying with the idea of a "SciPy Book" for a couple years
- Decided to forge my own path in Nov 2011
: Writing took about 9 months
: Helped motivate me to "finish" parts of pandas
- 50,000 copies in circulation
#SciPy2015
http://www.oompu.com
- Progress in software not just about hard work
- Solving the right problems
: ...in the right order
: ...while wasting little time/energy on non-impactful issues
: ...while being faced with real world concerns(80/20 rule)
- Taking the time to develop a clear vision and scope for a project is a major factor in its success or failure
#SciPy2015
http://www.oompu.com
- Fernando perez & Brian Granger(IPathon)
- Skipper Seabold & Josef Perktold(StatsModels)
- Eric Jones (Enthought)
- Travis oliphant & Peter Wang(Enthought & Continuum)
- John Hunter(matplotlib)
- ... and many others
#SciPy2015
http://www.oompu.com
An unlikely train ride
SEA ->PDX
November 18, 2011
My seatmate was computational bio professor and 5-year PSF member Titus Brown
And he would later assist the IPython team in their Sloan Foundation $1MM grant in 2012
#SciPy2015
http://www.oompu.com
- 2012 : Lambda Foundry
: Support and develop pandas
: Explored creating a commercial Python financial Toolkit
- 2013~2014 : DataPad
: "Google Drive" for Analytics / BI
: With Chang She (MIT --> AQR --> pandas)
: Silicon Valley VC-backed
: Acquired by Cloudera in September 2014
#SciPy2015
http://www.oompu.com
- Sort of "the Red Hat of Big Data"
- The leading open source Hadoop Platform
- Supporting and developing a little over 20 Apache-licensed open source projects
- A dream job
: Full time open source development
: Solving hard data problems faced by the world's largest companies
- P.S. we're hiring engineers in Austin + Bay Area
#SciPy2015
http://www.oompu.com
- Ways to enable collaboration on data tools across programming languages
- Domain Specific language design and compilation
- Improving the Python-on-Hadoop experience
- LLVM + Code generation
#SciPy2015
http://www.oompu.com
- Thesis : over time, user interfaces, data storage, and execution engines will decouple and specialize
- In fact, you should really want this to happen
: Share systems among languages
: Reduce fragmentation and "lock-in"
: Shift developer focus to usability
- Prediction : we'll be there by 2025; sooner if we all get our act together
Python
Matplotlib
Scipy
Jupyter
#SciPy2015
http://www.oompu.com
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
- Performance : Numba, Weave, Numexpr, Theano ...
- Visualization : Bokeh, Seaborn, Plotly, Chaco, mpld3, ggplot, MayaVi, vincent, toyplot, HoloViews ...
- Data Structures & Computation : Blaze, Dask, DistArray, XRay, Graphlab, SciDBppy, pySpark ...
- Packaging & Distribution : pip/wheels, conda, EPD, Canopy, Anaconda ...
1. Foundation
- Python 3
2. Visualization
- Matplotlib 1.4 , 2.0
- Seaborn = Matplotlib + Pandas + statistical visualization
- Bokeh = Powerful Interactive Visualization, HTML5, Javascript lib
3. Arrays & Data Structures
- Xray = NumPy + Pandas
- Dask = lightweight tool for general parallelized array storage and computation
4. Computation & Performance
- Numba = with a simple decorator, Python JIT compiles to LLVM and excutes at near C/Fortran speed
5. Distribution & Packaging
- Anaconda
So much happening ...
- The IPython/Jupyter split
- Widgets = awesome
- Docker-based backends
- Jupyter Hub
- new $6M grant 2015 July first week
- Python was created in the 1980s as a teaching language, and to bridge the gap between the shell and C.
- Guido Van Rossum "I thought we'd write small Python programs, maybe 10 lines, maybe 5, maybe 500 lines - that would be a big one"
- Python is not a scientific programming language
: Why did a "toy language" become the core of a scientific stack?
- Python is a glue
- Python glues together this hodge-podge of scientific tools.
- high-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.
- it is speed of development, not necessarily speed of execution. that has driven Python's popularity.
- Python was created in the 1980s as a teaching language, and to bridge the gap between the shell and C.
- Guido Van Rossum "I thought we'd write small Python programs, maybe 10 lines, maybe 5, maybe 500 lines - that would be a big one"
- Python is not a scientific programming language
: Why did a "toy language" become the core of a scientific stack?
- Python is a glue
- Python glues together this hodge-podge of scientific tools.
- high-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.
- it is speed of development, not necessarily speed of execution. that has driven Python's popularity.
- "Why don't you use C instead of Python? it's so much faster!"
: "Why don't you commute by airplane instead of by car? it's so much faster!"
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
1995 : "Numeric" was an early Python scientific array library. largely written by Jim Hugunin. Numeric -> NumPy
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
1998 : "Multipack" built on Numeric, was a set of wrappers of Fortran packages written by Travis Oliphant. Multipack -> SciPy
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2002 : "Numarray" was created by Perry Greenfield, Paul Dubois, and others to address fundamental deficiencies in Nemeric for larger datasets
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2006 : In a herculean effort to head-off this split in the community. Travis oliphant incorporated best parts of Numeric + Numarray into "Numpy"
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2000 : Eric Jones, Travis oliphant. Pearu Peterson, and others spun multipack into the "SciPy" package. aiming for a full Python MatLab replacement.
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2001 : Fernando Perez started the "IPython" projects, aiming for a mathematica-style environment for Scientific Python
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2002 : John Hunter wanted an open MatLab replacement, and started "matplotlib" as an effort at MatLab-style visualization
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2012 : The Ipython team released the "IPython Notebook" and the world has never been the same
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2009 : Wes McKinney began "Pandas", eventually drawing-in much larger Python user-base. especially in industry data science.
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2009 : With SciPy's sheer size making fast development difficult. community decided to promote "scikits" as an avenue for more specialized algorithms.
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2012 : Continuum releases "conda". a package manager for scientific computing.
1. No centralized leadership! What is "core" in the ecosystem evolves & up to the community
- Evolving computational core : Numba?
: Just as Cython matured to become a core piece. perhaps Numba might as well? How might a JIT-enabled SciPy, sklearns, pandas, etc. look?
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
2. To be most useful as an ecosystem, we must be willing for packages to adapt to the changing landscape.
- Evolving computational core : Pandas?
: Modern data is sparse, heterogeneous, and labeled, and NumPy arrays don't measure up : let's make Pandas a core dependency!
- Evolving computational core : pandas, Seaborn --> matplotlib
: With Pandas core dependency. what elements of Seaborn & Pandas could be moved into matplotlib?
- Evovling the core : SciPy
: SciPy's monolithic design was driven by packaging & distribution difficulties.
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
Seaborn
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
PyMC
NIPY
3. interoperability with core pieces of other languages has been key to the success of the SciPy stack(e.g. C/Fortran libraries, new Jupyter framework
- Universal Plotting Serialization?
: Much of modern interactive plotting (d3, HTML5, Bokeh, ggvis, mpld3, etc) involves generating & processing plot serializations
: matplotlib -> {JSON} -> javascript --> plotting at web
: Doing this natively in matplotlib would open up extensibility!
- Universal DataFrames?
: R, Python, Julia use C/Fotran Memory Block
: R, Python, Julia use RDataFrame , Pandas, Dataframe.jl
: in the future R, Python, Julia use ...so called...Uber DataFrame ?
4. The stack was built from both continuity(e.g. Numeric/Numarray->NumPy) and brand-new efforts(e.g. matplotlib, Pandas). Don't discount either approach!
- Considering the Future of Matplotlob (Usual compliaints about Matplotlib)
: Non-optimal stylistic defaults -> matplotlib 2.0
: Non-optimal API -> Seaborn, ggplot
: Difficulty exporting interactive plots -> Serialization to mpld3/Bokeh
: Difficulty with large datasets ->???
- Lesson from Numeric/Numarray, etc
:Stick with matplotlib & modify it(e.g serialization to VisPy? Numba-driven backend? new backend architecture? etc.)
- Lesson from Pandas & Matplotlib, etc: : Start something from scratch; features will draw users!(e.g. VisPy, Bokeh, Something new?)
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2006 : In a herculean effort to head-off this split in the community. Travis oliphant incorporated best parts of Numeric + Numarray into "Numpy"
-----------------------------------------------------------------------------------------
1995
2005
2015
2010
2000
2009 : Wes McKinney began "Pandas", eventually drawing-in much larger Python user-base. especially in industry data science.
2002 : John Hunter wanted an open MatLab replacement, and started "matplotlib" as an effort at MatLab-style visualization
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com
#SciPy2015
http://www.oompu.com