Data Science Toolchain
presented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Python
Open Source Community
Package
Web Service
Good Readability
Machine Learning
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3360517/python-logo.png)
R
Open Source Community
Built-in Statistics Package
Standalone computing & data analysis
Slower than Python
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3360534/logo-r.png)
High Performance
Big Data
Poor Visualization, Modeling
Java
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3360598/Java_programming_language_logo.svg.png)
Matlab & Octave
Powerful built-in math functions
Simple Data Visualization tool
Prototyping
Jupyter Notebook
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3360708/_____2016-12-24___12.38.17.png)
Language & Software
Python
R
Java
Matlab
Octave
Jupyter Notebook
Data Science Roadmap
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3360559/RoadToDataScientist1.png)
Data Science Toolchains
Data Collection
Data Visualization
Data Storage
Algorithm & Modeling
Data Collection
-
Using API: Facebook, Wikipedia
-
Web Scraper
Web Scraper
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361251/_____2016-12-24___5.46.49.png)
Web Scraper
HTTP request + HTML Parser
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361311/web-crawler.jpg)
HTTP: python-requests
Better than built-in urllib
Sessions with Cookie Persistence
Thread-safety
HTTP: python-requests
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361254/_____2016-12-24___5.55.41.png)
HTTP: python-requests
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365238/_____2016-12-28___10.13.38.png)
Web page
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361290/_____2016-12-24___9.53.27.png)
Parser!
Regular Expression?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361292/bs4.jpg)
BeautifulSoup
HTML/XML parser
用 BeautifulSoup
抓取 Ptt 標題
HTML parser
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361310/_____2016-12-24___11.41.27.png)
More Powerful Tool?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361358/scrapy.png)
Scrapy
An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.
Scrapy
$ scrapy startproject tutorial
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361720/_____2016-12-25___4.22.18.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363558/scrapy-diagram-1.png)
Scrapy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361750/_____2016-12-25___5.19.35.png)
path: /scrapy/dmoz.py crawler name: dmoz
Scrapy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361754/_____2016-12-25___5.22.13.png)
Scrapy
$ scrapy crawl dmoz
Scrapy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365212/_____2016-12-28___9.48.23.png)
robots.txt
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361848/_____2016-12-25___8.22.53.png)
youtube.com/robots.txt
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361781/Harvey-Mackay.png)
"I believe that visualization is one of the most powerful means of achieving personal goals."
-Harvey Mackay
Data Visualization
Data Visualization
-
Matplotlib, ggplot2
-
D3.js
-
Bokeh
-
Tableau
-
PlotDB
-
Leaflet
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361915/scatter_demo21.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361916/_____2016-12-25___9.50.37.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361917/bachelors_degrees_by_gender.png)
Matplotlib
ggplot2
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361930/3519-69.jpg)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361937/price-vs-carat.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361940/_____2016-12-25___10.22.01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361942/d3.png)
D3.js
Bokeh
Python, R, Scala, Julia
Interactive
Jupyter Notebook
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3361965/bokeh_logo.png)
資料視覺化的商業軟體
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362000/1300px_Tableau_Software_logo.png)
Tableau
商業化分析軟體 (有試用期)
不需要撰寫 code,人性化的操作介面
可處理多種 Data Source
圖表種類較少
慢
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362000/1300px_Tableau_Software_logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362714/_____2016-12-26___10.27.44.png)
利用 Data Visulization 呈現故事
操作:選樣板,上傳資料,拖曳
可以改 Code
台灣人做的!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362766/_____2016-12-26___10.48.40.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362770/_____2016-12-26___10.54.16.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362770/_____2016-12-26___10.54.16.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362772/_____2016-12-26___10.55.40.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362780/_____2016-12-26___10.59.16.png)
Programming
Using GeoJSON with Leaflet
將資訊移出程式碼, Configurable
讓資料的操作結構化
Using GeoJSON with Leaflet
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3362902/_____2016-12-27___1.09.24.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363571/489px-MySQL.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363572/postgresql-logo1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363571/489px-MySQL.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363572/postgresql-logo1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363577/2505008-1280px-CouchDB.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363578/MongoDB-Logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363571/489px-MySQL.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363572/postgresql-logo1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363577/2505008-1280px-CouchDB.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363578/MongoDB-Logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363585/redis318x260_1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363588/big_voltdb_logo_2014.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363571/489px-MySQL.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363572/postgresql-logo1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363577/2505008-1280px-CouchDB.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363578/MongoDB-Logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363585/redis318x260_1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363588/big_voltdb_logo_2014.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363590/_____2016-12-27___4.09.25.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363591/Cassandra_logo.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363571/489px-MySQL.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363572/postgresql-logo1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363577/2505008-1280px-CouchDB.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363578/MongoDB-Logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363585/redis318x260_1.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363588/big_voltdb_logo_2014.gif)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363590/_____2016-12-27___4.09.25.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363591/Cassandra_logo.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363593/neo4j_logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363772/s3_image.png)
S3
-
1. Key-value
-
2. Permission
-
3. Data Visualization
-
4. Big Data (Spark)
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363776/AWS-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3363797/brain.png)
Algorithm
&
Modeling
Algorithm & Modeling
python-numpy + python-pandas + scikit-learn
libsvm
spark-Mlib
Weka
Deep Learning
Numpy + Pandas
+ Scikit-learn
Numpy
能夠處理多維矩陣運算
逼近 C 的運算效能
提供線性代數常用運算式
Numpy - data structure
ndarray (n-dim array)
ndim
size
shape
dtype
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364620/_____2016-12-28___10.09.00.png)
Numpy
generate matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364626/_____2016-12-28___10.15.46.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365479/_____2016-12-29___12.52.01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364631/_____2016-12-28___10.19.14.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364632/_____2016-12-28___10.20.04.png)
Numpy
generate matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364634/_____2016-12-28___10.22.02.png)
Numpy
generate matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364639/_____2016-12-28___10.25.48.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364640/_____2016-12-28___10.25.53.png)
Numpy
generate matrix
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364645/_____2016-12-28___10.33.01.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364647/_____2016-12-28___10.35.14.png)
Numpy
generate matrix
Numpy - linalg
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364686/_____2016-12-28___11.51.12.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364728/pandas_logo.png)
基於 numpy 發展而來
資料型態 Series, DataFrame
善於處理各種形式的資料: csv, json ...
缺值 nan 處理
數據合併
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364745/_____2016-12-28___1.12.06.png)
Series - 一維資料
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364747/_____2016-12-28___1.14.09.png)
Series - 一維資料
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364740/_____2016-12-28___1.06.42.png)
Series - 一維資料
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364743/_____2016-12-28___1.06.57.png)
Series - 一維資料
DataFrame - 多維 Series
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364752/_____2016-12-28___1.19.34.png)
Pandas - import
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364760/_____2016-12-28___1.39.41.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364761/_____2016-12-28___1.40.02.png)
Pandas - import
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364771/_____2016-12-28___1.50.13.png)
Pandas - import
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364784/_____2016-12-28___2.12.35.png)
Pandas - import
Pandas - NaN
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364778/_____2016-12-28___1.55.27.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364779/_____2016-12-28___1.57.38.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364780/_____2016-12-28___1.57.45.png)
Pandas - NaN
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364779/_____2016-12-28___1.57.38.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364781/_____2016-12-28___1.57.51.png)
Pandas - NaN
Pandas - operation
Merge
Grouping
Reshaping
. . .
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364842/scikit-learn-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364842/scikit-learn-logo.png)
Dataset
Feature Engineering
Modeling
Evaluation
LIBSVM
-
C
-
Easy to use
LIBSVM - install
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364003/_____2016-12-27___11.38.22.png)
$ git clone
LIBSVM - install
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364004/_____2016-12-27___11.40.10.png)
$ make
LIBSVM - workflow
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364025/_____2016-12-28___12.05.14.png)
LIBSVM - data format
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364045/_____2016-12-28___12.14.30.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364025/_____2016-12-28___12.05.14.png)
label 資料的分類
index 資料欄位, attribute 順序
value 資料數值, attribute 的數值
LIBSVM - data format
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365302/_____2016-12-28___10.50.51.png)
LIBSVM - toy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364979/_____2016-12-28___6.16.08.png)
MLlib
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365013/_____2016-12-28___6.46.43.png)
MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
MLlib
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365004/_____2016-12-28___6.40.30.png)
MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
MLlib
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and Pipelines
MLlib
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
MLlib
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364996/_____2016-12-28___6.28.47.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364997/_____2016-12-28___6.28.52.png)
MLlib
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364839/spark-logo-hd.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3364979/_____2016-12-28___6.16.08.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365285/_____2016-12-28___10.40.53.png)
Weka
Java library
Big Data
Support GUI
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365328/_____2016-12-28___11.09.55.png)
Deep Learning
Theano
Pylearn2
Keras
Tensorflow
Caffe
Deeplearning4J
...
Theano
Base on Numpy
Implemented by Cython
Dynamic C code generation
GPU & CUDA
tensor, math expression
A CPU and GPU Math Compiler in Python
Theano tutorial: http://www.slideshare.net/SergiiGavrylov/theano-tutorial
Keras
-
底層使用 Theano, Tensorflow
-
Support GPU
-
簡單且快速的製作出 prototype
High-level neural networks library
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365425/_____2016-12-29___12.11.14.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365427/_____2016-12-29___12.14.42.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365439/_____2016-12-29___12.21.50.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365440/_____2016-12-29___12.23.10.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/382096/images/3365443/_____2016-12-29___12.23.20.png)
去哪裡尋找適合的 Tool ?
Homework
-
在 Github 建立一個 repo 分享你用過 Data science 的相關工具
-
Database, Social Network Analytics, ML library, Deep Learning Platform ...
-
上課提過的也沒關係
-
作業內容需包含
-
READM.md: 這個 Repo 的指引
-
Demo Code
-
工具的說明文件
-
-
怎麼繳交? Google 表單填寫繳交資料
作業疑問 email: ita3051@gmail.com 陳杰翰
https://goo.gl/forms/PQPz8u2glyunQvfM2
data-science-toolchain
By Jie-Han Chen
data-science-toolchain
Keras 部分的解說是李弘毅老師的投影片,老師的投影片說明得非常清楚
- 1,406