Data Science Toolchain
presented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Python
Open Source Community
Package
Web Service
Good Readability
Machine Learning
R
Open Source Community
Built-in Statistics Package
Standalone computing & data analysis
Slower than Python
High Performance
Big Data
Poor Visualization, Modeling
Java
Matlab & Octave
Powerful built-in math functions
Simple Data Visualization tool
Prototyping
Jupyter Notebook
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Data Science Roadmap
Data Science Toolchains
Data Collection
Data Visualization
Data Storage
Algorithm & Modeling
Data Collection
-
Using API: Facebook, Wikipedia
-
Web Scraper
Web Scraper
Web Scraper
HTTP request + HTML Parser
HTTP: python-requests
Better than built-in urllib
Sessions with Cookie Persistence
Thread-safety
HTTP: python-requests
HTTP: python-requests
Web page
Parser!
Regular Expression?
BeautifulSoup
HTML/XML parser
用 BeautifulSoup
抓取 Ptt 標題
HTML parser
More Powerful Tool?
Scrapy
An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.
Scrapy
$ scrapy startproject tutorial
Scrapy
path: /scrapy/dmoz.py crawler name: dmoz
Scrapy
Scrapy
$ scrapy crawl dmoz
Scrapy
robots.txt
youtube.com/robots.txt
"I believe that visualization is one of the most powerful means of achieving personal goals."
-Harvey Mackay
Data Visualization
Data Visualization
-
Matplotlib, ggplot2
-
D3.js
-
Bokeh
-
Tableau
-
PlotDB
-
Leaflet
Matplotlib
ggplot2
D3.js
Bokeh
Python, R, Scala, Julia
Interactive
Jupyter Notebook
資料視覺化的商業軟體
Tableau
商業化分析軟體 (有試用期)
不需要撰寫 code,人性化的操作介面
可處理多種 Data Source
圖表種類較少
慢
利用 Data Visulization 呈現故事
操作:選樣板,上傳資料,拖曳
可以改 Code
台灣人做的!
Programming
Using GeoJSON with Leaflet
將資訊移出程式碼, Configurable
讓資料的操作結構化
Using GeoJSON with Leaflet
S3
-
1. Key-value
-
2. Permission
-
3. Data Visualization
-
4. Big Data (Spark)
-
Algorithm
&
Modeling
Algorithm & Modeling
python-numpy + python-pandas + scikit-learn
libsvm
spark-Mlib
Weka
Deep Learning
Numpy + Pandas
+ Scikit-learn
Numpy
能夠處理多維矩陣運算
逼近 C 的運算效能
提供線性代數常用運算式
Numpy - data structure
ndarray (n-dim array)
ndim
size
shape
dtype
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy
generate matrix
Numpy - linalg
基於 numpy 發展而來
資料型態 Series, DataFrame
善於處理各種形式的資料: csv, json ...
缺值 nan 處理
數據合併
Series - 一維資料
Series - 一維資料
Series - 一維資料
Series - 一維資料
DataFrame - 多維 Series
Pandas - import
Pandas - import
Pandas - import
Pandas - import
Pandas - NaN
Pandas - NaN
Pandas - NaN
Pandas - operation
Merge
Grouping
Reshaping
. . .
Dataset
Feature Engineering
Modeling
Evaluation
LIBSVM
-
C
-
Easy to use
LIBSVM - install
$ git clone
LIBSVM - install
$ make
LIBSVM - workflow
LIBSVM - data format
label 資料的分類
index 資料欄位, attribute 順序
value 資料數值, attribute 的數值
LIBSVM - data format
LIBSVM - toy
MLlib
MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
MLlib
MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
MLlib
Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and Pipelines
MLlib
MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
MLlib
Weka
Java library
Big Data
Support GUI
Deep Learning
Theano
Pylearn2
Keras
Tensorflow
Caffe
Deeplearning4J
...
Theano
Base on Numpy
Implemented by Cython
Dynamic C code generation
GPU & CUDA
tensor, math expression
A CPU and GPU Math Compiler in Python
Theano tutorial: http://www.slideshare.net/SergiiGavrylov/theano-tutorial
Keras
-
底層使用 Theano, Tensorflow
-
Support GPU
-
簡單且快速的製作出 prototype
High-level neural networks library
去哪裡尋找適合的 Tool ?
Homework
-
在 Github 建立一個 repo 分享你用過 Data science 的相關工具
-
Database, Social Network Analytics, ML library, Deep Learning Platform ...
-
上課提過的也沒關係
-
作業內容需包含
-
READM.md: 這個 Repo 的指引
-
Demo Code
-
工具的說明文件
-
-
怎麼繳交? Google 表單填寫繳交資料
作業疑問 email: ita3051@gmail.com 陳杰翰
https://goo.gl/forms/PQPz8u2glyunQvfM2
data-science-toolchain
By Jie-Han Chen
data-science-toolchain
Keras 部分的解說是李弘毅老師的投影片,老師的投影片說明得非常清楚
- 1,362