presented by Jie-Han Chen
slide: https://goo.gl/1hXBGk
Python
R
Java
Matlab
Octave
Jupyter Notebook
Open Source Community
Package
Web Service
Good Readability
Machine Learning
Open Source Community
Built-in Statistics Package
Standalone computing & data analysis
Slower than Python
High Performance
Big Data
Poor Visualization, Modeling
Powerful built-in math functions
Simple Data Visualization tool
Prototyping
Python
R
Java
Matlab
Octave
Jupyter Notebook
Data Collection
Data Visualization
Data Storage
Algorithm & Modeling
Using API: Facebook, Wikipedia
Web Scraper
Better than built-in urllib
Sessions with Cookie Persistence
Thread-safety
Regular Expression?
HTML/XML parser
An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.
$ scrapy startproject tutorial
path: /scrapy/dmoz.py crawler name: dmoz
$ scrapy crawl dmoz
Matplotlib, ggplot2
D3.js
Bokeh
Tableau
PlotDB
Leaflet
Python, R, Scala, Julia
Interactive
Jupyter Notebook
資料視覺化的商業軟體
商業化分析軟體 (有試用期)
不需要撰寫 code,人性化的操作介面
可處理多種 Data Source
圖表種類較少
慢
利用 Data Visulization 呈現故事
操作:選樣板,上傳資料,拖曳
可以改 Code
台灣人做的!
將資訊移出程式碼, Configurable
讓資料的操作結構化
1. Key-value
2. Permission
3. Data Visualization
4. Big Data (Spark)
python-numpy + python-pandas + scikit-learn
libsvm
spark-Mlib
Weka
Deep Learning
能夠處理多維矩陣運算
逼近 C 的運算效能
提供線性代數常用運算式
ndarray (n-dim array)
ndim
size
shape
dtype
基於 numpy 發展而來
資料型態 Series, DataFrame
善於處理各種形式的資料: csv, json ...
缺值 nan 處理
數據合併
Merge
Grouping
Reshaping
. . .
Dataset
Feature Engineering
Modeling
Evaluation
C
Easy to use
$ git clone
$ make
label 資料的分類
index 資料欄位, attribute 順序
value 資料數值, attribute 的數值
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and Pipelines
分散式學習架構, Hadoop
支援 Java, Scala, R, Python
提供完善的工具鏈
高速的運算效能
Java library
Big Data
Support GUI
Theano
Pylearn2
Keras
Tensorflow
Caffe
Deeplearning4J
...
Base on Numpy
Implemented by Cython
Dynamic C code generation
GPU & CUDA
tensor, math expression
Theano tutorial: http://www.slideshare.net/SergiiGavrylov/theano-tutorial
底層使用 Theano, Tensorflow
Support GPU
簡單且快速的製作出 prototype
High-level neural networks library
在 Github 建立一個 repo 分享你用過 Data science 的相關工具
Database, Social Network Analytics, ML library, Deep Learning Platform ...
上課提過的也沒關係
作業內容需包含
READM.md: 這個 Repo 的指引
Demo Code
工具的說明文件
怎麼繳交? Google 表單填寫繳交資料
作業疑問 email: ita3051@gmail.com 陳杰翰
https://goo.gl/forms/PQPz8u2glyunQvfM2