Data Science Toolchain
presented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Python
Open Source Community
Package
Web Service
Good Readability
Machine Learning

R
Open Source Community
Built-in Statistics Package
Standalone computing & data analysis
Slower than Python

High Performance
Big Data
Poor Visualization, Modeling
Java

Matlab & Octave
Powerful built-in math functions
Simple Data Visualization tool
Prototyping
Jupyter Notebook

Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Data Science Roadmap

Data Science Toolchains
Data Collection
Data Visualization
Data Storage
Algorithm & Modeling
Data Collection
-
Using API: Facebook, Wikipedia
-
Web Scraper
Web Scraper

Web Scraper
HTTP request + HTML Parser

HTTP: python-requests
Better than built-in urllib
Sessions with Cookie Persistence
Thread-safety
HTTP: python-requests

HTTP: python-requests

Web page

Parser!
Regular Expression?

BeautifulSoup
HTML/XML parser
用 BeautifulSoup
抓取 Ptt 標題
HTML parser

More Powerful Tool?

Scrapy
An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.
Scrapy
$ scrapy startproject tutorial


Scrapy

path: /scrapy/dmoz.py crawler name: dmoz
Scrapy

Scrapy
$ scrapy crawl dmoz
Scrapy

robots.txt

youtube.com/robots.txt

"I believe that visualization is one of the most powerful means of achieving personal goals."
-Harvey Mackay
Data Visualization
Data Visualization
-
Matplotlib, ggplot2
-
D3.js
-
Bokeh
-
Tableau
-
PlotDB
-
Leaflet



Matplotlib
ggplot2




D3.js
Bokeh
Python, R, Scala, Julia
Interactive
Jupyter Notebook

資料視覺化的商業軟體

Tableau
商業化分析軟體 (有試用期)
不需要撰寫 code,人性化的操作介面
可處理多種 Data Source
圖表種類較少
慢


利用 Data Visulization 呈現故事
操作:選樣板,上傳資料,拖曳
可以改 Code
台灣人做的!





Programming
Using GeoJSON with Leaflet
將資訊移出程式碼, Configurable
讓資料的操作結構化
Using GeoJSON with Leaflet































S3
-
1. Key-value
-
2. Permission
-
3. Data Visualization
-
4. Big Data (Spark)
-


Algorithm
&
Modeling
Algorithm & Modeling
python-numpy + python-pandas + scikit-learn
libsvm
spark-Mlib
Weka
Deep Learning
Numpy + Pandas
+ Scikit-learn
Numpy
能夠處理多維矩陣運算
逼近 C 的運算效能
提供線性代數常用運算式
Numpy - data structure
ndarray (n-dim array)
ndim
size
shape
dtype

Numpy
generate matrix




Numpy
generate matrix

Numpy
generate matrix


Numpy
generate matrix


Numpy
generate matrix
Numpy - linalg


基於 numpy 發展而來
資料型態 Series, DataFrame
善於處理各種形式的資料: csv, json ...
缺值 nan 處理
數據合併

Series - 一維資料

Series - 一維資料

Series - 一維資料

Series - 一維資料
DataFrame - 多維 Series

Pandas - import


Pandas - import

Pandas - import

Pandas - import
Pandas - NaN



Pandas - NaN


Pandas - NaN
Pandas - operation
Merge
Grouping
Reshaping
. . .


Dataset
Feature Engineering
Modeling
Evaluation
LIBSVM
-
C
-
Easy to use
LIBSVM - install

$ git clone
LIBSVM - install

$ make
LIBSVM - workflow

LIBSVM - data format


label 資料的分類
index 資料欄位, attribute 順序
value 資料數值, attribute 的數值
LIBSVM - data format

LIBSVM - toy

MLlib


MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能

MLlib


MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能

Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
MLlib

Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and Pipelines
MLlib

MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能



MLlib



Weka
Java library
Big Data
Support GUI

Deep Learning
Theano
Pylearn2
Keras
Tensorflow
Caffe
Deeplearning4J
...
Theano
Base on Numpy
Implemented by Cython
Dynamic C code generation
GPU & CUDA
tensor, math expression
A CPU and GPU Math Compiler in Python
Theano tutorial: http://www.slideshare.net/SergiiGavrylov/theano-tutorial
Keras
-
底層使用 Theano, Tensorflow
-
Support GPU
-
簡單且快速的製作出 prototype
High-level neural networks library





去哪裡尋找適合的 Tool ?
Homework
-
在 Github 建立一個 repo 分享你用過 Data science 的相關工具
-
Database, Social Network Analytics, ML library, Deep Learning Platform ...
-
上課提過的也沒關係
-
作業內容需包含
-
READM.md: 這個 Repo 的指引
-
Demo Code
-
工具的說明文件
-
-
怎麼繳交? Google 表單填寫繳交資料
作業疑問 email: ita3051@gmail.com 陳杰翰
https://goo.gl/forms/PQPz8u2glyunQvfM2
data-science-toolchain
By Jie-Han Chen
data-science-toolchain
Keras 部分的解說是李弘毅老師的投影片,老師的投影片說明得非常清楚
- 1,409