PYTHON
FOR
DATA SCIENCE
Darius Aliulis
Data Scientist & Engineer @ DATA DOG / SATALIA
A little about me
- Data scientist & software engineer @ DATA DOG / SATALIA
- Assistant lecturer @ KTU
Business Big Data Analytics Master's study programme
- Formal background in Applied Mathematics (MSc), studied
@ Kaunas University of Technology (KTU)
@ Technical University of Denmark (DTU)
- Interested in:
+ Solving hard problems
+ Operational machine learning
+ Streaming analytics
+ Functional & reactive programming
What about you?
- Who are looking into machine learning, data analytics?
- Who are looking into [big] data engineering?
- Who are R users?
- Who are Matlab users?
- Who are Python users?
- Who are Java users?
- Who are Scala users?
- Clojure?
- Haskell?
Goals
- Provide [an opinionated] guide to data science in Python
- Give advice on Python libraries for particular use cases
- Enable a quicker start with / switch to scientific Python
Talk outline
- Motivation to use Python for Data Science
- Command line interfaces
- Machine Learning
- Parallel & Distributed Computing
- Visualization
Python
Python is a programming language that lets you work quickly
and integrate systems more effectively.
www.python.org
Python
- strongly typed
- dynamic
- imperative
- object oriented
- multi-paradigm - to an extent
programming language
Python
In [1]: import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Functional Python ?
shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?
shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?
shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?
shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?
shamelessly borrowed from talk by Alexey Kachayev

Functional Python ?
shamelessly borrowed from talk by Alexey Kachayev

Why Python ?
Why use a general purpose language for scientific computing and/or data science when there are specialized languages like
R
Matlab
?
Why Python ?
I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language.
John D. Cook
Why Python ?
Linear algebra operations [finally] feel similar to other languages thanks to PEP 465:
A dedicated infix operator for
matrix multiplication
Why Python ?
since Python 3.5 linear algebra operators are equivalent to the ones in other languages
# dot product, element-wise multiplication
# between vectors or matrices A, B
# Python
A @ B
A * B
# R
A %*% B
A * B
# Matlab, Julia
A * B
A .* BNot always Python

Command line interfaces
simply use
click
click features
- is lazily composable without restrictions
- fully follows the Unix command line conventions
- supports loading values from environment variables out of the box
- supports for prompting of custom values
- is fully nestable and composable
- works the same in Python 2 and 3
- supports file handling out of the box
- comes with useful common helpers (getting terminal dimensions, ANSI colors, fetching direct keyboard input, screen clearing, finding config paths, launching apps and editors, etc.)
command line interface creation kit alternatives
Data Science libraries
Core data science libraries:
- numpy - n-dimensional array data structures
- scipy - fundamental library for scientific computing
- pandas - R-like [but much faster] dataframe data structures
- scikit-learn - machine learning
- matplotlib - visualization (but please use R for static plots)
- gensim - for topic models
- networkx - graph analytics
Machine Learning
Question to you:
what is machine learning?
Machine Learning Pipelines
from Machine Learning Pipelines presentation by Evan SparkS

Machine Learning Pipelines
from Machine Learning Pipelines presentation by Evan SparkS

Machine Learning Pipelines
from Machine Learning Pipelines presentation by Evan SparkS

Parallel Machine Learning
Distributed Machine Learning
Visualization
Wrap up
- Command line interfaces - use click
- Data Wrangling - use Pandas, stdlib
- Machine Learning - use sklearn / pyspark pipelines
- simple parallelization - use joblib
- visualization - use R :) ... or a declarative python lib
Thank you!
Let's talk now!
python-datascience
By Darius Aliulis
python-datascience
Talk on usage of Python for different aspects of Data Science @ Infoshow 2017
- 287