PYTHON
FOR
DATA SCIENCE

Darius Aliulis
Data Scientist & Engineer @ DATA DOG / SATALIA

A little about me

  • Data scientist & software engineer @ DATA DOG / SATALIA
     
  • Assistant lecturer @ KTU
    Business Big Data Analytics Master's study programme
     
  • Formal background in Applied Mathematics (MSc), studied
    @ Kaunas University of Technology (KTU)
    @ Technical University of Denmark (DTU)
     
  • Interested in:
    + Solving hard problems
    + Operational machine learning

    + Streaming analytics
    + Functional & reactive programming

What about you?

  • Who are looking into machine learning, data analytics?
  • Who are looking into [big] data engineering?
  • Who are R users?
  • Who are Matlab users?
  • Who are Python users?
  • Who are Java users?
  • Who are Scala users?
  • Clojure?
  • Haskell?

Goals

  • Provide [an opinionated] guide to data science in Python
     
  • Give advice on Python libraries for particular use cases
     
  • Enable a quicker start with / switch to scientific Python
     

Talk outline

  • Motivation to use Python for Data Science
     
  • Command line interfaces
     
  • Machine Learning
     
  • Parallel & Distributed Computing
     
  • Visualization

Python

Python is a programming language that lets you work quickly
and integrate systems more effectively.

 

www.python.org

Python

  • strongly typed
  • dynamic
  • imperative
  • object oriented
  • multi-paradigm - to an extent

    programming language

Python

In [1]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Functional Python ?

shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?

shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?

shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?

shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?

shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?

shamelessly borrowed from talk by Alexey Kachayev

Why Python ?

Why use a general purpose language for scientific computing and/or data science when there are specialized languages like

R

Matlab

?

Why Python ?

I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language.

 

John D. Cook

Why Python ?

Linear algebra operations [finally] feel similar to other languages thanks to PEP 465:

 

A dedicated infix operator for
matrix multiplication

Why Python ?

since Python 3.5 linear algebra operators are equivalent to the ones in other languages

# dot product, element-wise multiplication 
# between vectors or matrices A, B

# Python
A @ B
A * B

# R
A %*% B
A * B

# Matlab, Julia
A * B 
A .* B

Not always Python

Command line interfaces

simply use

click

click features

  • is lazily composable without restrictions
  • fully follows the Unix command line conventions
  • supports loading values from environment variables out of the box
  • supports for prompting of custom values
  • is fully nestable and composable
  • works the same in Python 2 and 3
  • supports file handling out of the box
  • comes with useful common helpers (getting terminal dimensions, ANSI colors, fetching direct keyboard input, screen clearing, finding config paths, launching apps and editors, etc.)

command line interface creation kit alternatives

Data Science libraries

Core data science libraries:

  • numpy - n-dimensional array data structures
  • scipy - fundamental library for scientific computing
  • pandas - R-like [but much faster] dataframe data structures
  • scikit-learn - machine learning
  • matplotlib - visualization (but please use R for static plots)
  • gensim - for topic models
  • networkx - graph analytics

Machine Learning

Question to you:
what is machine learning?

Machine Learning Pipelines

from Machine Learning Pipelines presentation by Evan SparkS

Machine Learning Pipelines

from Machine Learning Pipelines presentation by Evan SparkS

Machine Learning Pipelines

from Machine Learning Pipelines presentation by Evan SparkS

Parallel Machine Learning

joblib - internally used by scikit-learn

 

details in interactive discussion

Distributed Machine Learning

PySpark - Python API layer for Apache Spark

 

details in interactive discussion

Visualization

matplotlib is the core visualization library for python

  • low level
  • non-intuitive

 

opt for declarative visualization libraries

  • ggplot from yhat
  • seaborn
  • altair

or use R!

 

comparisons (1) & (2)

 

Wrap up

  • Command line interfaces - use click
     
  • Data Wrangling - use Pandas, stdlib
     
  • Machine Learning - use sklearn / pyspark pipelines
     
  • simple parallelization - use joblib
     
  • visualization - use R :) ... or a declarative python lib

 

Thank you!

Let's talk now!

 

python-datascience

By Darius Aliulis

python-datascience

Talk on usage of Python for different aspects of Data Science @ Infoshow 2017

  • 287