那些年,小強你別死阿

My Cockroach, Don't Die 

(web crawler/scraping experience sharing)



夏偉傑

twitter: @twxia

First





data is money






Inspect element tool is ur best friend !


Intro


爬蟲無法辨識前端的加密資料!?
爬蟲遇到 frame js 等等繁雜的資訊!?
爬蟲被主機(Host)封殺了!?

how to crawl JS data

(... or crawl by JS)

http://phantomjs.org/


http://casperjs.org/





How can you do

 after ur crawler has been banned?





Solution


  • VPN
  • Proxy
  • VPS ( !? )
  • other

Proxy





google " free proxy list " (UNSTABLE)

or pay some money to buy the proxy list (STABLE)

Proxy

be careful for the Anonymity level
("X-Forwarded-For" will tell the host where u from.)

VPS(!?)



if you use modern VPS,
you may have some command line tools.

I use EC2
so...






ec2-associate-address
ec2-allocate-address
ec2-release-address
( ˙    ˇ   ˙ )




other






summary





請用瀏覽器執行的思維來分析你的資料

耐心是一切的根本

Made with Slides.com