Python 爬蟲

資工大四 楊翔鈞

2020.05.07

requests

  • pip install requests
  • 用來發 HTTP request
    • e.g., GET, POST, PUT, DELETE, ...
  • synchronous / 同步

beautifulsoup4

  • pip install beautifulsoup4==4.8.0
  • parse HTML

. . .

. . .

body

div

div

table

span

div

class="ooxx"

id="xxoo"

class="abc"

id="div2"

class="abc"

id="div1"

lxml

  • apt-get install python-lxml

  • easy_install lxml

  • pip install lxml

re

scrapy

  • web crawling framework
  • asynchronous / 非同步

requests 常用 method

  • requests.get(url, ...)
  • requests.post(url, data={'key': 'value', ...})

beautifulsoup4 常用 method

CSS selector

beautifulsoup4 常用 method

node.decompose()

node

參考資源

DEMO :))))

Crawler

By Yang Eugene

Crawler

爬蟲的社課:) @ ccca

  • 492