MLDM Monday--spideR Series  
RCurl, XML & Encoding
Yi-Hsi Lee (EC)
who am i?
- Education:
- Ph. D. in Finance, NSYSU, TaiWAN.
- Ph. D. Candidate in MS&E, CSU, China.
- Experience:
- Assistant Professor (P/T), NKFUST & NKU.
- Managing Director, Chihfeng Financial Advisors Ltd.
 
- Manager, Folion Financial Technology Inc.
- Software Engineer, U-Vision Biotech Inc.
- System Analyzer, Taiyuan Ltd.
- Programming Skill:
- Basic, COBOL, C++/STL, SAS, Splus, Matlab, PHP, Mathematica, R, Lingo, EViews, Stata.
Why am I here?
 
                                                       
         
來談談之前你寫的那支 spideR           啊!好期待又怕受傷害!
碰到的 Encoding 問題?
帶著鋼盔 (呆呆) 向前衝!
Agenda
- 
DIY your own 
spideR is VERY EASY! 
 (Trust Me, You Can Make It! --> R Spirit)
- R is Easy & Powerful!
- Taiwan R User Group is Great!
- Keeping Yourself out of Trouble
- Legal Issues
- Stealthy Rules
- Encoding Issues and General Solution (In My Opinion)
- OS / Software / Data / URL / Server
- Some Other Useful Tips
Goal & Tools
- Goal:
- Scraping the news from a China website by R (spideR)
- Tools: R
- RCurl (getURL)
- XML (readHTMLTable / XPath)
- RMessenger (Powered by Wush Wu)
- Platform OS:
- Windows / Mac / Linux (Ubuntu)
- Time (Coding) : 7 Days
- Time (Implement) : 1 Month (In fact, ...)
- 
# of Query :     1,538,992
 
- # of Obs. : 25,828,673
KEEPING YOURSELF OUT OF TROUBle
- Legal Issues
- Copyright
- Assume “All Rights Reserved”
- Trespass to Chattels !!!
- Traditional Trespass: unauthorized use of real property (land or real estate) [新聞]
- Trespass to Chattels: prevents or impairs an owner’s use of or access to personal property
Keeping yourself OUT OF TROUBLE
- 
Stealthy Rules  
- Do It Right at the First Time
- Be Kind to Your Resources
- Not Placing an Undue Load on a Target Server
- Targeting Multiple Servers instead of Relying on a Single Source
- Use Proxies (But ... Be Careful). You may try TOR.
- Use Utility Computers, ex. Statlab (國家高速網路與計算機中心)
- Run It During Busy Hours / Days
- Behaving Like Humans
- Use Random, Intra-fetch Delays
- Don’t Run It at the Same Time Each Day
- To Limit the Downloads at an Absolute Minimum Amount
sop for spider
- Step 1: Analyzing a Form (Reverse Engineering)
- Step 2: Automating Form Submission & Parsing Query Results
- Form Emulation
- Parsing Query Results (X)
- Parsing with Basic String Functions
- Parsing with Regular Expressions
- Encoding Problemsss
- 
Fault-Tolerant Mechanism (X)
 
- Managing Large Amounts of Data (X)
Step 1: Form Emulation
- Form Elements
- Data Fields (Text / Select / Radio / Checkbox / ...)
- Method (GET / POST / Multipart Encoding)
- Form Handler
- Event Triger (Submit / onClick / onMouseOut / ...)
Chrome DevTools
WebbotsSpiderScreenScrapers 
Form Analyzer
(DEMO)
A look back at ...

Encoding
- OS (Windows / Mac / Linux / ...)
- Software (RStudio / Revolution R / ...)
- Data (File / Database)
- URL
- Server
i hate "encoding"!!!
我愛你
UTF-8 (檔首無 BOM): 我愛你
Big5: ??雿?
GB2312: 鎴戞剾浣
encoding
Encoding - Windows
- OS: Utf-8 / Big5
- Software: RStudio
- Sys.getlocale()
- [1]  "LC_COLLATE=Chinese (Traditional)_Taiwan.950;
 LC_CTYPE=Chinese (Traditional)_Taiwan.950;
 LC_MONETARY=Chinese (Traditional)_Taiwan.950;
 LC_NUMERIC=C;
 LC_TIME=Chinese (Traditional)_Taiwan.950"
 
Encoding - mac
- Software: RStudio
- Sys.getlocale()
- [1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8"
Encoding - Ubuntu
- Software: RStudio
- Sys.getlocale()
- [1]  "LC_CTYPE=zh_TW.UTF-8;
 LC_NUMERIC=C;
 LC_TIME=en_SG.UTF-8;
 LC_COLLATE=zh_TW.UTF-8;
 LC_MONETARY=en_SG.UTF-8;
 LC_MESSAGES=zh_TW.UTF-8;
 LC_PAPER=C;
 LC_NAME=C;
 LC_ADDRESS=C;
 LC_TELEPHONE=C;
 LC_MEASUREMENT=en_SG.UTF-8;
 LC_IDENTIFICATION=C"
Tip: General Encoding Solution
 (in my opinion)
- 
## [開始處] 先轉成 "C"
 Sys.setlocale(category = "LC_ALL", locale = "C") # Q: C?
- 
## 中間依據需求統一轉成特定編碼
 myStrVec = read.table("myFile.csv", sep=",", ... , encoding="UTF-8")
 myStr <- iconv(myStrVec[i], from="UTF-8", to="gb2312")
 myURL <- URLencode(myStr)
 myRes <- getURL(myURL, ... , .encoding='gb2312')
 myRes <- readHTMLTable(myRes, encoding='gb2312', which=7)
- ## [結尾處] 再轉回 ""
 Sys.setlocale(category = "LC_ALL", locale = "") # "" 為原本設定
Key Functions
- 
Sys.info()["sysname"]     # {base}  查詢 OS 的類型
 .Platform$OS.type # {base}
 
- 
Sys.getlocale()                      # {base}  查詢系統的 Encoding
 sessionInfo() # {utils}
- Sys.setlocale() # {base} 設定系統的 Encoding
- Sys.setlocale(category = 'LC_ALL', locale = 'C')
- read.table() # {utils} 以特定 Encoding 載入資料
- myStrVec <- read.table(myFile , ... , encoding='UTF-8')
- Encoding() # {base} 載入資料檔後轉換 Encoding
- Encoding(myStrVec) <- 'gb2312'
- iconv() # {base} 載入資料檔後轉換 Encoding
- myURL <- iconv(myStrVec[i] , from='UTF-8', to='gb2312')
KEY FUNCTIONS
- URLencode() # {utils} 對 URL 字串進行轉碼 (Query 傳送前)
- myURL <- URLencode(myURL)
- getURL() # {RCurl} 承接檢索結果 (Query 傳送後)
- myRes <- getURL(myURL, ... , .encoding='gb2312')
- readLines() # {base} 承接檢索結果 (Query 傳送後)
- myRes <- readLines(myURL, encoding='gb2312')
- readHTMLTable() # {XML} 解析並取出檢索結果中的特定表格
- myRes <- readHTMLTable(myRes, encoding='gb2312', which=7)
change encoding
- Code
ChangeEncoding <- function(Flag) { # Flag: 開頭 0;結尾 1 if (Flag == 0) { switch(Sys.info()[['sysname']], # .Platform$OS.type Windows = {Sys.setlocale(category = "LC_ALL", locale = "C")}, Mac = {Sys.setlocale(category = "LC_ALL", locale = "C")}, Linux = {Sys.setlocale(category = "LC_ALL", locale = "C")})} else if (Flag == 1) { switch(Sys.info()[['sysname']], Windows = {Sys.setlocale(category = "LC_ALL", locale = "")}, Mac = {Sys.setlocale(category = "LC_ALL", locale = "")}, Linux = {Sys.setlocale(category = "LC_ALL", locale = "")})} else { break } }
Tips: TRY CATCH
- 
tryCatch({
 StatementA
 }, error = function(Err) {
 StatementB
 myLog <<- "W"
 }, warning = function(War) {
 StatementC
 myLog <<- "E"
 }, finally = {
 StatementD
 })
- <<- # assign(... , envir = .GlobalEnv)
- 20130520 MLDM Monday R Data Import/Export (by Wush)
tips: rmessenger
- RMessenger (Powered by Wush)
- Demo:
- ## Google Talk 設定區
- MessFrom <- "abc@gmail.com"
- MessFromPass <- "123"
- MessTo <- "efg@gmail.com"
- ## 提取電腦名稱
- SysName <- Sys.info()["nodename"]
- ## 發送 Google Talk 訊息
- MessContent <- paste("SysName:",SysName,"; NumDate:",d,"; \t",date(),"\n")
- sendXMPPMessage(jid = MessFrom, password = MessFromPass, to = MessTo, message = MessContent)
Tips: randomly sleep
- runif() {stats} + Sys.sleep() {base}
Tips: data purification
Summary
- Taiwan R User Group is GREAT! Join Us Now!
- Encoding - A General Solution In My Opinion
- 
Sys.setlocale(category = 'LC_ALL', locale = 'C')
 ...
 Sys.setlocale(category = 'LC_ALL', locale = '')
- Keep Yourself Out of Trouble
- Legal Issues
- Stealthy Rules
Thanks for your attention

References

- COS 統計之都
- 
http://cos.name/cn/topic/17816 (RCurl 引介)
 
- 
Stack Overflow (很好的技術論壇)
 
- inside-R (比較美觀的 R 文件)
- Others
Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL
 (中文版)
 (中文版)Acknowledgement
MLDM Monday -- spideR Series
By ecleetw
MLDM Monday -- spideR Series
- 3,826

 
   
  