analyticS and visualization
A PYPTUG presentation
by Francois Dion ( @f_dion )
page left intentionally blank
beyond the usual suspects
- SAS (and SPSS etc)
is python a contender?
First up, Matlab
Thanks to numpy, scipy, matplotlib(.pylab)
SAS and the like (SPSS, minitab to a much lesser extent) have been in use for years, but there are new kids on the block that are changing things.
So, let's first talk about trends
trends in data science
- There is still a strong growth of R users
- There is a strong growth of Python users
- Businesses want:
- increased productivity
- decreased cost
- R & Python both fit that bill
- There is a shift away from closed source
- cost reason
- improved inter-connectivity
- GPU & storage acceleration
- the cloud ->
R >= SAS
Since 2010, R has displaced everybody else
top 8 data mining/analytics tools used in 2010 (2013)
Party like it's 2013
matplotlib, numpy, scipy
pandas. over 60000 installs per month.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.
Mix and match
pandas does not implement significant modelling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modelling environment, but we are well on our way toward that goal.
In the meantime, if you want to use R from Python:
Or if you are simply missing ggplot2, see:
more than pretty pictures
and cute pandas
_ = ( 255, lambda V ,B,c :c and Y(V*V+B,B, c -1)if(abs(V)<6)else ( 2+c-4*abs(V)**-0.4)/i ) ;v, x=1500,1000;C=range(v*x );import struct;P=struct.pack;M,\ j ='<QIIHHHH',open('M.bmp','wb').write for X in j('BM'+P(M,v*x*3+26,26,12,v,x,1,24))or C: i ,Y=_;j(P('BBB',*(lambda T:(T*80+T**9 *i-950*T **99,T*70-880*T**18+701* T **9 ,T*i**(1-T**45*2)))(sum( [ Y(0,(A%3/3.+X%v+(X/v+ A/3/3.-x/2)/1j)*2.5 /x -2.7,i)**2 for \ A in C [:9]]) /9) ) )
- PIL - http://www.pythonware.com/products/pil - Python Imaging Library provides basic image handling and processing for various image types including jpg, gif, tiff, and bmp. Reads and writes graphics files. Allows pixel-by-pixel data access and has functions for cropping and transposing an image. Also has various filters built-in.
- matplotlib - http://matplotlib.sourceforge.net/ - matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala matlab or mathematica), web application servers, and six graphical user interface toolkits.
Speaking of Matplotlib
from matplotlib import pyplot as plt import numpy as np plt.xkcd() fig = plt.figure() ax = fig.add_subplot(1, 1, 1) ax.spines['right'].set_color('none') ax.spines['top'].set_color('none') plt.xticks() plt.yticks() ax.set_ylim([-30, 10]) data = np.ones(100) data[70:] -= np.arange(30) plt.annotate( 'THE DAY I REALIZED\nI COULD COOK BACON\nWHENEVER I WANTED', xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10)) plt.plot(data) plt.xlabel('time') plt.ylabel('my overall health')
stove ownership (xkcd 418)
Glumpy - http://code.google.com/p/glumpy - a small python library that uses OpenGL for the rapid vizualization of (mainly two dimensional) numpy arrays. Not so much for nice figures for inclusion in a scientific article, more for rapid vizualization of your running simulation.
pyqtgraph - http://luke.campagnola.me/code/pyqtgraph/ - Pure-python graphics and GUI library for scientific/engineering applications based on PyQt and numpy. This library provides fast plotting and image/video display, multidimensional image slicing, volumetric / isosurface rendering, interactive data manipulation tools, and a variety of Qt widgets including an editable property tree, visual programming flowchart, and gradient editor.
- VTK - http://vtk.org/ - is an open source, freely available software system for 3D computer graphics, image processing, and visualization used by thousands of researchers and developers around the world. It has a very good python interface.
WrapITK - http://code.google.com/p/wrapitk/ - interface ITK http://itk.org and several languages, with a particular focus on python. ITK module used with python interpreter is particulary useful for quick and easy prototyping of image analysis procedures. Some glue classes allow to efficiently pass data to others modules like NumPy or VTK.
Plotly - https://plot.ly/- is a collaborative graphing and analytics platform. The web app has an online Python sandbox - NumPy supported - and grid for data analysis. The Plotly graphing library produces graphs that are interactive, publication quality, and browser-based. Graphs can be styled with Python or a GUI, shared, embedded, and exported.
D3.js data to graphs using HTML, CSS and SVG
d3py is for the python side of things
There is even a web based editor.
A Python to Vega translator
The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.
Loop de loopPerhaps most importantly, Vincent groks Pandas DataFrames and Series in an intuitive way.
Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.
analyticS and visualization
A PYPTUG presentationby Francois Dion ( @f_dion )
Your data is not big
Tera scale? Giga scale? Hahaha.
Exa scale? Peta scale? OK, it is, what next?
Roll your own?
Look at hardware acceleration
your data is big
Time to look at Manta:
- Amazon EMR
- Hadoop streaming
(you'll be doing java too...)
analytics and visualization
By Francois Dion