The 14th Annual Scientific Computing with Python Conference (SciPy) is scheduled for July 6-12, 2015 in Austin, Texas.

#SciPy2015

http://www.oompu.com

  • The New York Times

  • Open Source Developer

  • Python Scientific Ecosystem

  • Q&A

#SciPy2015

http://www.oompu.com

Who am I?

#SciPy2015

http://www.oompu.com

Park Surk

NEXCORE Platform Expert @SK 

Python Evangelist

Travel Swimming SnowBoarding

#SciPy2015

http://www.oompu.com

What is SciPy ?

#SciPy2015

http://www.oompu.com

SciPy

SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python.

#SciPy2015

http://www.oompu.com

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

Base N-Dimensional Array Pakage

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

Base N-Dimensional Array Pakage

Fundamental Lib for scientific computing

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

Base N-Dimensional Array Pakage

Fundamental Lib for scientific computing

Comprehensive 2D Plotting

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

Base N-Dimensional Array Pakage

Fundamental Lib for scientific computing

Comprehensive 2D Plotting

Enhanced Interactive Console

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

Base N-Dimensional Array Pakage

Fundamental Lib for scientific computing

Comprehensive 2D Plotting

Enhanced Interactive Console

Symboic Mathematics

SciPy

“these are some of the core packages.

#SciPy2015

http://www.oompu.com

Base N-Dimensional Array Pakage

Fundamental Lib for scientific computing

Comprehensive 2D Plotting

Enhanced Interactive Console

Symboic Mathematics

Data structures & analysis

Contributors

#SciPy2015

http://www.oompu.com

Data Science @ NYT  

#SciPy2015

http://www.oompu.com

Newspapering

#SciPy2015

http://www.oompu.com

Newspapering

#SciPy2015

http://www.oompu.com

1851

Newspapering

#SciPy2015

http://www.oompu.com

1851

1996

Newspapering

#SciPy2015

http://www.oompu.com

1851

1996

Newspapering

#SciPy2015

http://www.oompu.com

1851

1996

2008

Now

#SciPy2015

http://www.oompu.com

Every publisher is now a startup

Technology-Enabled Journalism

How a 164-year old content company became data-driven

#SciPy2015

http://www.oompu.com

Learnings

#SciPy2015

http://www.oompu.com

Supervised learning

Unsupervised Learning

Reinforcement Learning

Learnings

#SciPy2015

http://www.oompu.com

Supervised learning

Which subscribers are going to cancel a subscription?

How many copies of the newspaper tommorrow?

Learnings

#SciPy2015

http://www.oompu.com

Unsupervised Learning

What is the hot issue related the President Obama today?

Who does what on The New York Times web site?

Learnings

#SciPy2015

http://www.oompu.com

Reinforcement Learning

A/B Test

Learnings

#SciPy2015

http://www.oompu.com

Supervised learning

Unsupervised Learning

Reinforcement Learning

Which subscribers are going to cancel a subscription?

How many copies of the newspaper tommorrow?

What is the hot issue related the President Obama today?

Who does what on The New York Times web site?

A/B Test

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Explore

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Explore

Learning

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Explore

Learning

Test

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Explore

Learning

Test

Optimizing

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Explore

Learning

Test

Optimizing

Reporting

Actual Integration with

3 learnings

#SciPy2015

http://www.oompu.com

Supervised learning :

Unsupervised Learning :

Reinforcement Learning :

Explore

Learning

Test

Optimizing

Reporting

Common Requirements

in Data Science

#SciPy2015

http://www.oompu.com

Common Requirements

in Data Science

#SciPy2015

http://www.oompu.com

1. People

2. Ideas  

3. Things 

Common Requirements

in Data Science

#SciPy2015

http://www.oompu.com

3. Things : What does DS Team deliver?

#Build data prototypes                             #Build APIS

#Collaboration with people&team         #Impact Roadmaps

Common Requirements

in Data Science

#SciPy2015

http://www.oompu.com

2. Ideas : Data Skills 

#Data Engineering                 #Data Science                   #Data Visualization

#Data Product                        #Data Multiliteracies         #Data Embeds

Common Requirements

in Data Science

#SciPy2015

http://www.oompu.com

1. People

New MindSet > New ToolSet

Common Requirements

in Data Science

#SciPy2015

http://www.oompu.com

1. People

2. Ideas : Data Skills 

3. Things : What does DS Team deliver?

New MindSet > New ToolSet

#Data Engineering                 #Data Science                   #Data Visualization

#Data Product                        #Data Multiliteracies         #Data Embeds

#Build data prototypes                             #Build APIS

#Collaboration with people&team         #Impact Roadmaps

http://daeilkim.com/refinery.html

Wes McKinney's Stories

#SciPy2015

http://www.oompu.com

as a Python programmer

#SciPy2015

http://www.oompu.com

- Python Projects
  : SM(StatsModels : Statistics in Python)
  : Pandas

- Companies
  : AQR
  : AppNexus
  : Cloudera

- Books
  : 'Python for Data Analysis'

#SciPy2015

http://www.oompu.com

<pre-2007>

- Mathematician.
- No exposure to Python, SQL, R(or any analytics for that matter)
- Rude awakening

<first job : AQR(Quant Hedge Fund)>

- A quant finance operation that lived and breathed SQL and Excel

- Production systems in C++, Java, Visual Basic, and C# .NET

- Some PhD-level researchers used MATLAB for research(as was common in finance/economics department)

#SciPy2015

http://www.oompu.com

<2008 : Productivity frustrations>

- First year several analytics and statistical data analysis projects
  : A huge amount of SQL
  : Some data
  : A little bit of R
  : and TONS of Excel
- Projects felt like 5% conceptualization, 95% tedium

#SciPy2015

http://www.oompu.com

<Python in early 2008: different times>

- A bleeding edge stack
  : Numpy 1.0.4
  : SciPy 0.6.0
  : matlplotlib 0.9.12
  : iPython 0.8.4, SVN history begins 2/2008
  : Cython 0.9.8
- The scientific Python community seamed mainly focused on attracting MATLAB, HPC, and scientific lab users

#SciPy2015

http://www.oompu.com

<2008 : Things SciPythonists didn't care too much about>

- Relational data or SQL
- Missing data handling
- Statistics and econometrics(first StatsModels release : 2011)
- Statistic graphics
- Machine Learning(scikit-learn 0.1:2/2010)
- Analytics and business intelligence

#SciPy2015

http://www.oompu.com

<Taking a gamble>

- Decided to give Python a shot for AQR projects after seeing part of MASS R package ported in scipy.stats.models by Jonathan Taylor at Stanford
- proto-pandas first version built in April 2008
: focused on porting an R project to Python
- May '08 : Embedded python interpreter in a legacy C++ system
- 5/2008 ~12/2008 : Skunkworks Python ports and evangelism across company

#SciPy2015

http://www.oompu.com

<Why did Python work out?>

- Batteries included

  : other SciPy packages
- Interoperability with C++
  : Embedding Python interpreter
  : Wrapping C++ in Python C extensinos
- productive user interface
  : Python language
  : IPython + matplotlib

#SciPy2015

http://www.oompu.com

<End 2009 : Pandas!>

- AQR lets me open source pandas 0.1 on Christmas, 2009

<2010~2011 : Python's data growing pains>

- pandas did not evolve much after its initial release
- No consensus of momentum behind any project for analytics / data wrangling
- AQR --> Duke statistical Science
- AQR sponsors bug fixes and new features in pandas

#SciPy2015

http://www.oompu.com

<May 2011 : Getting inspired>

- 2011-05-13 : Enthoght Datarray Summit
  : Discuss how to enable Python to become more useful statistical computing
  : Me : "Library fragmentation is destructive; integration is better"
  : Data structures, missing data, and data wrangling tools
- 2011-05-13 ~ 2011-06-03: Python finance consulting engagement
  : Realized that Python data tools sorely needed in industry
  : But not nearly mature enough yet

#SciPy2015

http://www.oompu.com

<Making pandas a better tool>

- Consulting at AppNexus(NYC ad tech company) opened eyes to new problems
- June 2011 ~ December 2012
  : Fix some pandas design issues
  : Build out data wrangling capabilities(hierarchical indexes, etc...)
  : Create "killer apps"(time series capabilities)
  : Evangelize and collaborate with other projects

#SciPy2015

http://www.oompu.com

<Making a book happen>

- Python for Data Analysis
- A chicken-and-egg problem
- Fernando Perez, Brian Granger, and John Hunter had been toying with the idea of a "SciPy Book" for a couple years
- Decided to forge my own path in Nov 2011
  : Writing took about 9 months
  : Helped motivate me to "finish" parts of pandas
- 50,000 copies in circulation

#SciPy2015

http://www.oompu.com

<Clarity and software engineering>

- Progress in software not just about hard work
- Solving the right problems
: ...in the right order
: ...while wasting little time/energy on non-impactful issues
: ...while being faced with real world concerns(80/20 rule)
- Taking the time to develop a clear vision and scope for a project is a major factor in its success or failure

#SciPy2015

http://www.oompu.com

<It took a village>

- Fernando perez & Brian Granger(IPathon)
- Skipper Seabold & Josef Perktold(StatsModels)
- Eric Jones (Enthought)
- Travis oliphant & Peter Wang(Enthought & Continuum)
- John Hunter(matplotlib)
- ... and many others

#SciPy2015

http://www.oompu.com

<serendipity>

An unlikely train ride
SEA ->PDX
November 18, 2011
My seatmate was computational bio professor and 5-year PSF member Titus Brown
And he would later assist the IPython team in their Sloan Foundation $1MM grant in 2012

<Some words about John Hunter(1968~2012)>

#SciPy2015

http://www.oompu.com

<Business ventures 2012~2014>

- 2012 : Lambda Foundry
  : Support and develop pandas
  : Explored creating a commercial Python financial Toolkit
- 2013~2014 : DataPad
  : "Google Drive" for Analytics / BI
  : With Chang She (MIT --> AQR --> pandas)
  : Silicon Valley VC-backed
  : Acquired by Cloudera in September 2014

#SciPy2015

http://www.oompu.com

<Cloudera>

- Sort of "the Red Hat of Big Data"
- The leading open source Hadoop Platform
- Supporting and developing a little over 20 Apache-licensed open source projects

- A dream job
  : Full time open source development
  : Solving hard data problems faced by the world's largest companies

- P.S. we're hiring engineers in Austin + Bay Area

#SciPy2015

http://www.oompu.com

<What interested in right now>

- Ways to enable collaboration on data tools across programming languages
- Domain Specific language design and compilation
- Improving the Python-on-Hadoop experience
- LLVM + Code generation

#SciPy2015

http://www.oompu.com

<The Great Data Tool Decoupling>

- Thesis : over time, user interfaces, data storage, and execution engines will decouple and specialize
- In fact, you should really want this to happen  
  : Share systems among languages
  : Reduce fragmentation and "lock-in"
  : Shift developer focus to usability
- Prediction : we'll be there by 2025; sooner if we all get our act together

Visualization

Computation

Shell

<Early on>

Visualization

Computation

Shell

<Today>

Visualization

Computation

Shell

<Today>

Python

Matplotlib

Integration

Scipy

Jupyter

#SciPy2015

http://www.oompu.com

Python

Scientific Ecosystem

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

PyMC

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

PyMC

NIPY

<Many more tools>

- Performance : Numba, Weave, Numexpr, Theano ...
- Visualization : Bokeh, Seaborn, Plotly, Chaco, mpld3, ggplot, MayaVi, vincent, toyplot, HoloViews ...
- Data Structures & Computation : Blaze, Dask, DistArray, XRay, Graphlab, SciDBppy, pySpark ...
- Packaging & Distribution : pip/wheels, conda, EPD, Canopy, Anaconda ...

<Recent Development>

1. Foundation
- Python 3
2. Visualization
- Matplotlib 1.4 , 2.0
- Seaborn = Matplotlib + Pandas + statistical visualization
- Bokeh = Powerful Interactive Visualization, HTML5, Javascript lib
3. Arrays & Data Structures
- Xray = NumPy + Pandas
- Dask = lightweight tool for general parallelized array storage and computation
4. Computation & Performance
- Numba = with a simple decorator, Python JIT compiles to LLVM and excutes at near C/Fortran speed
5. Distribution & Packaging
- Anaconda

 

<IPython & Jupyter>

So much happening ...
- The IPython/Jupyter split
- Widgets = awesome
- Docker-based backends
- Jupyter Hub
- new $6M grant 2015 July first week

<Why Python?>

<Why Python?>

- Python was created in the 1980s as a teaching language, and to bridge the gap between the shell and C.

Shell

C

<Why Python?>

- Guido Van Rossum "I thought we'd write small Python programs, maybe 10 lines, maybe 5, maybe 500 lines - that would be a big one"

Write Small

Be Big

<Why Python?>

- Python is not a scientific programming language
: Why did a "toy language" become the core of a scientific stack?

Python

Scientific Programming Language

!=

<Why Python?>

- Python is a glue
- Python glues together this hodge-podge of scientific tools.

<Why Python?>

- high-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.

<Why Python?>

- it is speed of development, not necessarily speed of execution. that has driven Python's popularity.

Speed of Development

Speed of Execution

<Why Python?>

Why don't you use C instead of Python? it's so much faster!

<Why Python?>

Why don't you use C instead of Python? it's so much faster!

Why don't you commute by airplane instead of by car?

it's so much faster!

<Why Python?>

- Python was created in the 1980s as a teaching language, and to bridge the gap between the shell and C.
- Guido Van Rossum "I thought we'd write small Python programs, maybe 10 lines, maybe 5, maybe 500 lines - that would be a big one"
- Python is not a scientific programming language
: Why did a "toy language" become the core of a scientific stack?
- Python is a glue
- Python glues together this hodge-podge of scientific tools.
- high-level syntax wraps low-level C/Fortran libraries, which is (mostly) where the computation happens.
- it is speed of development, not necessarily speed of execution. that has driven Python's popularity.
- "Why don't you use C instead of Python? it's so much faster!"
: "Why don't you commute by airplane instead of by car? it's so much faster!"

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

<But this efficiency depends on the Scientific Stack...>

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

1995 : "Numeric" was an early Python scientific array library. largely written by Jim Hugunin. Numeric -> NumPy

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Multipack

1998 : "Multipack" built on Numeric, was a set of wrappers of Fortran packages written by Travis Oliphant. Multipack -> SciPy

 

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

Multipack

2002 : "Numarray" was created by Perry Greenfield, Paul Dubois, and others to address fundamental deficiencies in Nemeric for larger datasets

 

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

2006 : In a herculean effort to head-off this split in the community. Travis oliphant incorporated best parts of Numeric + Numarray into "Numpy"

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

2000 : Eric Jones, Travis oliphant. Pearu Peterson, and others spun multipack into the "SciPy" package. aiming for a full Python MatLab replacement.

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

2001 : Fernando Perez started the "IPython" projects, aiming for a mathematica-style environment for Scientific Python

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Matplotlib

2002 : John Hunter wanted an open MatLab replacement, and started "matplotlib" as an effort at MatLab-style visualization

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Notebook

Matplotlib

2012 : The Ipython team released the "IPython Notebook" and the world has never been the same

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Notebook

Matplotlib

Pandas

2009 : Wes McKinney began "Pandas", eventually drawing-in much larger Python user-base. especially in industry data science.

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Notebook

Matplotlib

Pandas

Scikits

2009 : With SciPy's sheer size making fast development difficult. community decided to promote "scikits" as an avenue for more specialized algorithms.

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Notebook

Matplotlib

Pandas

Scikits

Conda

2012 : Continuum releases "conda". a package manager for scientific computing.

<Lessons Learned>

1. No centralized leadership! What is "core" in the ecosystem evolves & up to the community
- Evolving computational core : Numba?
: Just as Cython matured to become a core piece. perhaps Numba might as well? How might a JIT-enabled SciPy, sklearns, pandas, etc. look?

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

PyMC

NIPY

<Lessons Learned>

2. To be most useful as an ecosystem, we must be willing for packages to adapt to the changing landscape.
- Evolving computational core : Pandas?
: Modern data is sparse, heterogeneous, and labeled, and NumPy arrays don't measure up : let's make Pandas a core dependency!
- Evolving computational core : pandas, Seaborn --> matplotlib
: With Pandas core dependency. what elements of Seaborn & Pandas could be moved into matplotlib?
- Evovling the core : SciPy
: SciPy's monolithic design was driven by packaging & distribution difficulties.

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

PyMC

NIPY

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

PyMC

NIPY

Seaborn

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

----------------------------------------------------------------------------------

PyMC

NIPY

<Lessons Learned>

3. interoperability with core pieces of other languages has been key to the success of the SciPy stack(e.g. C/Fortran libraries, new Jupyter framework
- Universal Plotting Serialization?
: Much of modern interactive plotting (d3, HTML5, Bokeh, ggvis, mpld3, etc) involves generating & processing plot serializations
: matplotlib -> {JSON} -> javascript --> plotting at web
: Doing this natively in matplotlib would open up extensibility!
- Universal DataFrames?
: R, Python, Julia use C/Fotran Memory Block
: R, Python, Julia use RDataFrame , Pandas, Dataframe.jl
: in the future R, Python, Julia use ...so called...Uber DataFrame ?

matplotlib -> {JSON} -> javascript -> plotting at web

Universal Plotting Serialization

Universal DataFrames

R : RDataFrame

Python : Pandas

Julia : Dataframe​.jl

Uber DataFrame​ ???

<Lessons Learned>

4. The stack was built from both continuity(e.g. Numeric/Numarray->NumPy) and brand-new efforts(e.g. matplotlib, Pandas). Don't discount either approach!
- Considering the Future of Matplotlob (Usual compliaints about Matplotlib)
: Non-optimal stylistic defaults -> matplotlib 2.0
: Non-optimal API -> Seaborn, ggplot
: Difficulty exporting interactive plots -> Serialization to mpld3/Bokeh
: Difficulty with large datasets ->???
- Lesson from Numeric/Numarray, etc
:Stick with matplotlib & modify it(e.g serialization to VisPy? Numba-driven backend? new backend architecture? etc.)
- Lesson from Pandas & Matplotlib, etc: : Start something from scratch; features will draw users!(e.g. VisPy, Bokeh, Something new?)

Don't discount either approach!!! 

Continuity

Brand-new Effort

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Notebook

Matplotlib

Pandas

Scikits

Conda

2006 : In a herculean effort to head-off this split in the community. Travis oliphant incorporated best parts of Numeric + Numarray into "Numpy"

Continuity

-----------------------------------------------------------------------------------------

1995
2005
2015
2010
2000

Numeric

Numarray

NumPy

Multipack

SciPy

IPython

Notebook

Matplotlib

Pandas

Scikits

Conda

Brand-new Effort

2009 : Wes McKinney began "Pandas", eventually drawing-in much larger Python user-base. especially in industry data science.

2002 : John Hunter wanted an open MatLab replacement, and started "matplotlib" as an effort at MatLab-style visualization

What's It All Mean?

#SciPy2015

http://www.oompu.com

build Data Science Team

#SciPy2015

http://www.oompu.com

Glue all decoupled systems

#SciPy2015

http://www.oompu.com

Make Data-Driven Dicision

#SciPy2015

http://www.oompu.com

Impact Roadmaps

#SciPy2015

http://www.oompu.com

Learn Learning

with Python

#SciPy2015

http://www.oompu.com

parksurk@gmail.com

Park Surk

Private MOOCs

http://www.oompu.com

http://www.facebook.com/parksurk

Made with Slides.com