Taiwan R User Group

MLDM Monday--spideR Series 

RCurl, XMLEncoding
Yi-Hsi Lee (EC)

who am i?

  • Education:
    • Ph. D. in Finance, NSYSU, TaiWAN.
    • Ph. D. Candidate in MS&E, CSU, China.
  • Experience:
    • Assistant Professor (P/T),  NKFUST NKU.
    • Managing Director, Chihfeng Financial Advisors Ltd.
    • Manager, Folion Financial Technology Inc.
    • Software Engineer, U-Vision Biotech Inc.
    • System Analyzer,  Taiyuan Ltd.
  • Programming Skill:
    • Basic, COBOL, C++/STL, SAS, Splus,  Matlab, PHP, Mathematica, R, Lingo, EViews, Stata.

Why am I here?


                                                                         
來談談之前你寫的那支 spideR           啊!好期待又怕受傷害!
碰到的 Encoding 問題?

帶著鋼盔 (呆呆) 向前衝!

Agenda

  • DIY your own  spideR is VERY EASY! 
    (Trust Me, You Can Make It! --> R Spirit)
  • Keeping Yourself out of Trouble
    • Legal Issues
    • Stealthy Rules
  • Encoding Issues and General Solution (In My Opinion)
    • OS / Software /  Data /  URL / Server
  • Some Other Useful Tips

Goal & Tools

  • Goal:
    • Scraping the news from a China website by R (spideR) 
  • Tools: R
  • Platform OS:
    •  Windows / Mac / Linux (Ubuntu) 
  • Time (Coding)           : 7 Days
  • Time (Implement) : 1 Month (In fact, ...)
  • # of Query :     1,538,992
  • # of Obs.      : 25,828,673 

    KEEPING YOURSELF OUT OF TROUBle

    • Legal Issues
      • Copyright
        • Assume “All Rights Reserved”
      • Trespass to Chattels !!!
        • Traditional Trespass: unauthorized use of real property (land or real estate) [新聞]
        • Trespass to Chattels: prevents or impairs an owner’s use of or access to personal property

    Keeping yourself OUT OF TROUBLE

    • Stealthy Rules
      • Do It Right at the First Time
      • Be Kind to Your Resources
        • Not Placing an Undue Load on a Target Server
        • Targeting Multiple Servers instead of Relying on a Single Source
      • Use Proxies (But ... Be Careful). You may try TOR
      • Use Utility Computers, ex. Statlab (國家高速網路與計算機中心)
      • Run It During Busy Hours / Days
      • Behaving Like Humans
        • Use Random, Intra-fetch Delays
        • Don’t Run It at the Same Time Each Day
        • To Limit the Downloads at an Absolute Minimum Amount

    sop for spider

    • Step 1: Analyzing a Form (Reverse Engineering)
    • Step 2: Automating Form Submission & Parsing Query Results
      • Form Emulation
      • Parsing Query Results (X) 
        • Parsing with Basic String Functions 
        • Parsing with Regular Expressions 
      • Encoding Problemsss
    [Advanced Issues] 
      • Fault-Tolerant Mechanism (X)
      • Managing Large Amounts of Data (X)

    Step 1: Form Emulation

    • Form Elements
      • Data Fields (Text / Select / Radio / Checkbox / ...)
      • Method (GET / POST / Multipart Encoding)
      • Form Handler
      • Event Triger (Submit / onClick / onMouseOut / ...)

      Chrome DevTools


      WebbotsSpiderScreenScrapers
      Form Analyzer


      A look back at ...


      Encoding

      • OS (Windows / Mac / Linux / ...)
      • Software (RStudio / Revolution R / ...)
      • Data (File / Database)
      • URL
      • Server

      i hate "encoding"!!!

      我愛你
      UTF-8 (檔首無 BOM): 我愛你
      Big5: ??雿?
      GB2312: 鎴戞剾浣

      encoding


      Encoding - Windows

      • OS: Utf-8 / Big5 
      • Software: RStudio
        • Sys.getlocale()
          • [1]  "LC_COLLATE=Chinese (Traditional)_Taiwan.950;
            LC_CTYPE=Chinese (Traditional)_Taiwan.950;
            LC_MONETARY=Chinese (Traditional)_Taiwan.950;
            LC_NUMERIC=C;
            LC_TIME=Chinese (Traditional)_Taiwan.950"

      Encoding - mac

    • OS: Utf-8 / Big5 

    • Encoding - Ubuntu

    • OS: ???
      • Software: RStudio
        • Sys.getlocale()
          • [1]  "LC_CTYPE=zh_TW.UTF-8;
            LC_NUMERIC=C;
            LC_TIME=en_SG.UTF-8;
            LC_COLLATE=zh_TW.UTF-8;
            LC_MONETARY=en_SG.UTF-8;
            LC_MESSAGES=zh_TW.UTF-8;
            LC_PAPER=C;
            LC_NAME=C;
            LC_ADDRESS=C;
            LC_TELEPHONE=C;
            LC_MEASUREMENT=en_SG.UTF-8;
            LC_IDENTIFICATION=C"

      Tip: General Encoding Solution
       (in my opinion)

      • ## [開始處] 先轉成 "C"
        Sys.setlocale(category = "LC_ALL", locale = "C")     # Q: C?
      • ## 中間依據需求統一轉成特定編碼
        myStrVec = read.table("myFile.csv", sep=",", ... , encoding="UTF-8")
        myStr <- iconv(myStrVec[i], from="UTF-8", to="gb2312")
        myURL <- URLencode(myStr)
        myRes <- getURL(myURL, ... , .encoding='gb2312') 
        myRes <- readHTMLTable(myRes, encoding='gb2312', which=7)
      • ## [結尾處] 再轉回 ""
        Sys.setlocale(category = "LC_ALL", locale = "")        # "" 為原本設定

        Key Functions

        • Sys.setlocale()                       {base}  設定系統的 Encoding
        • read.table()                             # {utils} 以特定 Encoding 載入資料
          • myStrVec <-  read.table(myFile , ... , encoding='UTF-8')
        • Encoding()                               # {base} 載入資料檔後轉換 Encoding
        •  iconv()                                       # {base}  載入資料檔後轉換 Encoding
          • myURL <- iconv(myStrVec[i] , from='UTF-8', to='gb2312')

        KEY FUNCTIONS

        • URLencode()                      # {utils}  對 URL 字串進行轉碼  (Query 傳送前)
        • getURL()                               # {RCurl} 承接檢索結果 (Query 傳送後)
          • myRes <- getURL(myURL, ... , .encoding='gb2312') 
        • readLines()                      {base} 承接檢索結果 (Query 傳送後)
          • myRes <- readLines(myURL, encoding='gb2312')
        • readHTMLTable()   {XML} 解析並取出檢索結果中的特定表格

        change encoding

        • Code

        ChangeEncoding <- function(Flag) {
        # Flag: 開頭 0;結尾 1
          if (Flag == 0) {
            switch(Sys.info()[['sysname']],     # .Platform$OS.type
            Windows  = {Sys.setlocale(category = "LC_ALL", locale = "C")},
            Mac      = {Sys.setlocale(category = "LC_ALL", locale = "C")},
            Linux    = {Sys.setlocale(category = "LC_ALL", locale = "C")})  } else if (Flag == 1) { 
            switch(Sys.info()[['sysname']],
              Windows  = {Sys.setlocale(category = "LC_ALL", locale = "")},
              Mac      = {Sys.setlocale(category = "LC_ALL", locale = "")},
              Linux    = {Sys.setlocale(category = "LC_ALL", locale = "")})     } else {
            break
          }
        }

        Tips: TRY CATCH

        • tryCatch({
               StatementA
          }, error = function(Err) {
               StatementB
               myLog <<- "W" 
          }, warning = function(War) {
               StatementC
               myLog <<- "E" 
          }, finally = {
               StatementD
          })
        • <<-     #  assign(... , envir = .GlobalEnv)
        • 20130520 MLDM Monday R Data Import/Export (by Wush)

        tips: rmessenger

        • RMessenger (Powered by Wush)
        • Demo:
          • ## Google Talk 設定區 
            • MessFrom     <- "abc@gmail.com"
            • MessFromPass <- "123"
            • MessTo       <- "efg@gmail.com"
          • ##  提取電腦名稱
          • ## 發送 Google Talk 訊息
            • MessContent  <- paste("SysName:",SysName,"; NumDate:",d,"; \t",date(),"\n") 
            • sendXMPPMessage(jid = MessFrom, password = MessFromPass, to = MessTo, message = MessContent)

        Tips: randomly sleep


        Tips: data purification

        Summary

        • Taiwan R User Group is GREAT! Join Us Now!
        • Encoding - A General Solution In My Opinion
        • Keep Yourself Out of Trouble 
          • Legal Issues
          • Stealthy Rules

        Thanks for your attention



        References

        Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

         (中文版)

        Acknowledgement

        MLDM Monday -- spideR Series

        By ecleetw

        MLDM Monday -- spideR Series

        • 3,826