那些年,小強你別死阿
My Cockroach, Don't Die
(web crawler/scraping experience sharing)
夏偉傑
twitter: @twxia
Inspect element tool is ur best friend !
Intro
爬蟲無法辨識前端的加密資料!?
爬蟲遇到 frame js 等等繁雜的資訊!?
爬蟲被主機(Host)封殺了!?
how to crawl JS data
(... or crawl by JS)
How can you do
after ur crawler has been banned?
Solution
- VPN
- Proxy
- VPS ( !? )
- other
Proxy
google " free proxy list " (UNSTABLE)
or pay some money to buy the proxy list (STABLE)
Proxy
be careful for the Anonymity level
("X-Forwarded-For" will tell the host where u from.)
VPS(!?)
if you use modern VPS,
you may have some command line tools.
I use EC2
so...
ec2-associate-address
ec2-allocate-address
ec2-release-address
( ˙ ˇ ˙ )
summary
請用瀏覽器執行的思維來分析你的資料
耐心是一切的根本