Data Science Toolchain

presented by Jie-Han Chen

slide: https://goo.gl/1hXBGk

Language & Software

  • Python

  • R

  • Java

  • Matlab

  • Octave

  • Jupyter Notebook

Python

  • Open Source Community

  • Package

  • Web Service

  • Good Readability

  • Machine Learning

R

  • Open Source Community

  • Built-in Statistics Package

  • Standalone computing & data analysis

  • Slower than Python

  • High Performance

  • Big Data

  • Poor Visualization, Modeling

Java

Matlab & Octave

  • Powerful built-in math functions

  • Simple Data Visualization tool

  • Prototyping

Jupyter Notebook

  • Support 40+ programming language.  
    eg: Python, R, Scala...

  • Excellent for sharing your experiments

  • Markdown, Latex

  • example1

  • example2

Language & Software

  • Python

  • R

  • Java

  • Matlab

  • Octave

  • Jupyter Notebook

Data Science Roadmap

Data Science Toolchains

  • Data Collection

  • Data Visualization

  • Data Storage

  • Algorithm & Modeling

Data Collection

  • Using API: Facebook, Wikipedia

  • Web Scraper

Web Scraper

Web Scraper

HTTP request + HTML Parser

HTTP: python-requests

  • Better than built-in urllib

  • Sessions with Cookie Persistence

  • Thread-safety

HTTP: python-requests

HTTP: python-requests

Web page

Parser!

Regular Expression?

BeautifulSoup

HTML/XML parser

用 BeautifulSoup

抓取 Ptt 標題

HTML parser

More Powerful Tool?

Scrapy

 

     An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.

Scrapy

$ scrapy startproject tutorial

Scrapy

path: /scrapy/dmoz.py crawler name: dmoz

Scrapy

Scrapy

$ scrapy crawl dmoz

Scrapy

robots.txt

youtube.com/robots.txt

"I believe that visualization is one of the most powerful means of achieving personal goals."

                                                 -Harvey Mackay

Data Visualization

Data Visualization

  • Matplotlib, ggplot2

  • D3.js

  • Bokeh

  • Tableau

  • PlotDB

  • Leaflet

Matplotlib

ggplot2

D3.js

Bokeh

  • Python, R, Scala, Julia

  • Interactive

  • Jupyter Notebook

資料視覺化的商業軟體

Tableau

  • 商業化分析軟體 (有試用期)

  • 不需要撰寫 code,人性化的操作介面

  • 可處理多種 Data Source

  • 圖表種類較少

  • 雲端社群,分享資料

  • 利用 Data Visulization 呈現故事

  • 操作:選樣板,上傳資料,拖曳

  • 可以改 Code

  • 台灣人做的!

Programming

Using GeoJSON with Leaflet

  • 將資訊移出程式碼, Configurable

  • 讓資料的操作結構化

Using GeoJSON with Leaflet

S3

  • 1. Key-value

  • 2. Permission

  • 3. Data Visualization

  • 4. Big Data (Spark)

  •  

Algorithm

&

Modeling

Algorithm & Modeling

  • python-numpy + python-pandas + scikit-learn

  • libsvm

  • spark-Mlib

  • Weka

  • Deep Learning

Numpy + Pandas

+ Scikit-learn

Numpy

  • 能夠處理多維矩陣運算

  • 逼近 C 的運算效能

  • 提供線性代數常用運算式

Numpy - data structure

ndarray (n-dim array)

  • ndim

  • size

  • shape

  • dtype

Numpy

generate matrix

Numpy

generate matrix

Numpy

generate matrix

Numpy

generate matrix

Numpy

generate matrix

Numpy - linalg

  • 基於 numpy 發展而來

  • 資料型態 Series, DataFrame

  • 善於處理各種形式的資料: csv, json ...

  • 缺值 nan 處理

  • 數據合併

Series - 一維資料

Series - 一維資料

Series - 一維資料

Series - 一維資料

DataFrame - 多維 Series

Pandas - import

Pandas - import

Pandas - import

Pandas - import

Pandas - NaN

Pandas - NaN

Pandas - NaN

Pandas - operation

  • Merge

  • Grouping

  • Reshaping

  • . . .

  • Dataset

  • Feature Engineering

  • Modeling

  • Evaluation

LIBSVM

LIBSVM - install

$ git clone 

LIBSVM - install

$ make

LIBSVM - workflow

LIBSVM - data format

  • label 資料的分類

  • index 資料欄位, attribute 順序

  • value 資料數值, attribute  的數值

LIBSVM - data format

LIBSVM - toy

MLlib

MLlib

  • 分散式學習架構, Hadoop

  • 支援 Java, Scala, R, Python

  • 提供完善的工具鏈

  • 高速的運算效能

MLlib

MLlib

  • 分散式學習架構, Hadoop

  • 支援 Java, Scala, R, Python

  • 提供完善的工具鏈

  • 高速的運算效能

  • Classification: logistic regression, naive Bayes,...

  • Regression: generalized linear regression, survival regression,...

  • Decision trees, random forests, and gradient-boosted trees

  • Recommendation: alternating least squares (ALS)

  • Clustering: K-means, Gaussian mixtures (GMMs),...

  • Topic modeling: latent Dirichlet allocation (LDA)

  • Frequent itemsets, association rules, and sequential pattern mining

MLlib

  • Feature transformations: standardization, normalization, hashing,...

  • ML Pipeline construction

  • Model evaluation and hyper-parameter tuning

  • ML persistence: saving and loading models and Pipelines

MLlib

MLlib

  • 分散式學習架構, Hadoop

  • 支援 Java, Scala, R, Python

  • 提供完善的工具鏈

  • 高速的運算效能

MLlib

Weka

  • Java library

  • Big Data

  • Support GUI

Deep Learning

  • Theano

  • Pylearn2

  • Keras

  • Tensorflow

  • Caffe

  • Deeplearning4J

  • ...

Theano

  • Base on Numpy

  • Implemented by Cython

  • Dynamic C code generation

  • GPU & CUDA

  • tensor, math expression

A CPU and GPU Math Compiler in Python

Keras

  • 底層使用 Theano, Tensorflow

  • Support GPU

  • 簡單且快速的製作出 prototype

High-level neural networks library

去哪裡尋找適合的 Tool ?

Homework

  • 在 Github 建立一個 repo 分享你用過 Data science 的相關工具

  • Database, Social Network Analytics, ML library, Deep Learning Platform ...

  • 上課提過的也沒關係

  • 作業內容需包含

    • READM.md: 這個 Repo 的指引

    • Demo Code

    • 工具的說明文件

  • 怎麼繳交? Google 表單填寫繳交資料

 

 作業疑問 email: ita3051@gmail.com 陳杰翰

https://goo.gl/forms/PQPz8u2glyunQvfM2​

Made with Slides.com