analyticS and visualization

Twitter Communications Network as captured and visualized by Netlytic, one of the the analytic tool being developed at the Dalhousie Social Media Lab. CREDIT: Social Media Lab

A PYPTUG presentation

by Francois Dion  ( @f_dion )





page left intentionally blank

beyond the usual suspects


  • Matlab
  • SAS (and SPSS etc)
  • R

is python a contender?


First up, Matlab

Thanks to numpy, scipy, matplotlib(.pylab)

ipython notebook

(demo matplotlib)

Learn:


see also pythonxy, canopy



Next, SAS


SAS and the like (SPSS, minitab to a much lesser extent) have been in use for years, but there are new kids on the block that are changing things.


So, let's first talk about  trends

trends in data science


  • There is still a strong growth of R users
  • There is a strong growth of Python users
  • Businesses want:
    •  increased productivity
    • decreased cost
  • R & Python both fit that bill
  • There is a shift away from closed source
    • cost reason
    • improved inter-connectivity
    • GPU & storage acceleration
    • the cloud ->


R >= SAS

Since 2010, R has displaced everybody else

top 8 data mining/analytics tools used in 2010 (2013)

Party like it's 2013


matplotlib, numpy, scipy

pandas. over 60000 installs per month.

PANDAS

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

Pandas demo


let's check it out

& demo

Mix and match

pandas does not implement significant modelling functionality outside of linear and panel regression; for this, look to statsmodels and scikit-learn. More work is still needed to make Python a first class statistical modelling environment, but we are well on our way toward that goal.


In the meantime, if you want to use R from Python:

http://rpy.sourceforge.net/

Or if you are simply missing ggplot2, see:

http://blog.yhathq.com/posts/ggplot-for-python.html

more than pretty pictures

and cute pandas

_                                      =   (
                                        255,
                                      lambda
                               V       ,B,c
                             :c   and Y(V*V+B,B,  c
                               -1)if(abs(V)<6)else
               (              2+c-4*abs(V)**-0.4)/i
                 )  ;v,      x=1500,1000;C=range(v*x
                  );import  struct;P=struct.pack;M,\
            j  ='<QIIHHHH',open('M.bmp','wb').write
for X in j('BM'+P(M,v*x*3+26,26,12,v,x,1,24))or C:
            i  ,Y=_;j(P('BBB',*(lambda T:(T*80+T**9
                  *i-950*T  **99,T*70-880*T**18+701*
                 T  **9     ,T*i**(1-T**45*2)))(sum(
               [              Y(0,(A%3/3.+X%v+(X/v+
                               A/3/3.-x/2)/1j)*2.5
                             /x   -2.7,i)**2 for  \
                               A       in C
                                      [:9]])
                                        /9)
                                       )   )
The modules, starting with the classics

  • PIL - http://www.pythonware.com/products/pil - Python Imaging Library provides basic image handling and processing for various image types including jpg, gif, tiff, and bmp. Reads and writes graphics files. Allows pixel-by-pixel data access and has functions for cropping and transposing an image. Also has various filters built-in.
  • matplotlib - http://matplotlib.sourceforge.net/ - matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala matlab or mathematica), web application servers, and six graphical user interface toolkits.

Speaking of Matplotlib

from matplotlib import pyplot as plt
import numpy as np

plt.xkcd()
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
plt.xticks([])
plt.yticks([])
ax.set_ylim([-30, 10])
data = np.ones(100)
data[70:] -= np.arange(30)
plt.annotate(
    'THE DAY I REALIZED\nI COULD COOK BACON\nWHENEVER I WANTED',
    xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10))

plt.plot(data)

plt.xlabel('time')
plt.ylabel('my overall health')


stove ownership (xkcd 418)

../../_images/xkcd_00.png

  • Glumpy - http://code.google.com/p/glumpy - a small python library that uses OpenGL for the rapid vizualization of (mainly two dimensional) numpy arrays. Not so much for nice figures for inclusion in a scientific article, more for rapid vizualization of your running simulation.

  • pyqtgraph - http://luke.campagnola.me/code/pyqtgraph/ - Pure-python graphics and GUI library for scientific/engineering applications based on PyQt and numpy. This library provides fast plotting and image/video display, multidimensional image slicing, volumetric / isosurface rendering, interactive data manipulation tools, and a variety of Qt widgets including an editable property tree, visual programming flowchart, and gradient editor.

  • VTK - http://vtk.org/ - is an open source, freely available software system for 3D computer graphics, image processing, and visualization used by thousands of researchers and developers around the world. It has a very good python interface.
  • WrapITK - http://code.google.com/p/wrapitk/ - interface ITK http://itk.org and several languages, with a particular focus on python. ITK module used with python interpreter is particulary useful for quick and easy prototyping of image analysis procedures. Some glue classes allow to efficiently pass data to others modules like NumPy or VTK.

WEB based

  • Plotly - https://plot.ly/- is a collaborative graphing and analytics platform. The web app has an online Python sandbox - NumPy supported - and grid for data analysis. The Plotly graphing library produces graphs that are interactive, publication quality, and browser-based. Graphs can be styled with Python or a GUI, shared, embedded, and exported.

  • PyAlgoViz - http://pyalgoviz.appspot.com/ - Python Algorithm Visualizations done in Python running in the browser.  As you can see, you can interface Python with javascript visualization libraries. And speaking of javascript...


D3


D3.js data to graphs using HTML, CSS and SVG
d3py is for the python side of things

vega

Vega is a visualization grammar, a declarative format for creating, saving and sharing visualization designs.
With Vega you can describe data visualizations in a JSON format, and generate interactive views using either HTML5 Canvas or SVG.

There is even a web based editor.

VINCENT

A Python to Vega translator

The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.

Loop de loop

Perhaps most importantly, Vincent groks Pandas DataFrames and Series in an intuitive way.

bokeh

Bokeh is a Python interactive visualization library for large datasets that natively uses the latest web technologies. Its goal is to provide elegant, concise construction of novel graphics in the style of Protovis/D3, while delivering high-performance interactivity over large data to thin clients.



demo

Questions?

analyticS and visualization

Twitter Communications Network as captured and visualized by Netlytic, one of the the analytic tool being developed at the Dalhousie Social Media Lab. CREDIT: Social Media Lab

A PYPTUG presentation

by Francois Dion  ( @f_dion )

Your data is not big


Tera scale? Giga scale? Hahaha.


Exa scale? Peta scale? OK, it is, what next?


Roll your own?

Look at hardware acceleration


Cloud based?

your data is big


Time to look at Manta:

  • computes directly on object store
  • native R sdk
  • native Python (and mantash)


Or Hadoop:

  • Amazon EMR
  • Hadoop streaming
  • mrjob

(you'll be doing java too...)

analytics and visualization

By Francois Dion

analytics and visualization

  • 6,062