Python 基礎爬蟲

講者：蔡孟軒

時間：2020/06/10

Outline

認識爬蟲
學習使用套件
實戰

認識爬蟲

什麼是爬蟲？

英文：Web Crawler

將網際網路 (Internet) 視為一張蜘蛛網

爬蟲就是在蜘蛛網上的蜘蛛

幫助你獲取資料

應用

搜尋引擎、比價、評價分析、趨勢分析

範例

所需套件

requests：向網頁伺服器發送請求
BeautifulSoup4：解析HTML

步驟

傳送請求

解析資料

回傳資料

資料Get!!

學習使用套件

安裝套件

requests

$ pip install requests

$ pip install beautifulsoup4

BeautifulSoup4

requests 使用

狀況一：沒有防爬蟲

使用：開眼電影網

Step 1

確認網頁狀態

url = 'http://www.atmovies.com.tw/movie/new/'
response = requests.get(url)

print(response)

常見網頁狀態

200 - 請求成功
403 - 禁止使用 (防爬蟲)
404 - 找不到 (Not Found)

Step 2

Advanced issue found

▲

獲取網頁

import requests
    
url = 'http://www.atmovies.com.tw/movie/new/'
response = requests.get(url)

print(response.text)

遇到亂碼？

response.encoding = 'utf-8'

寫進檔案

with open("data.html", "w", encoding = 'utf-8') as file:
  file.write(response.text)

f = open('example.html', 'w', encoding='utf-8')
f.write(response.text)
f.close()

方法一

方法二

requests 使用

狀況二：有防爬蟲

使用：巴哈姆特陰陽師哈拉版

Step 1

確認網頁狀態

url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url)

print(response)

狀態是 503 (Bad Gateway)

有可能是防爬蟲手法

加上user_agent看看行不行

import requests
    
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url, headers = user_agent)
    
print(response)

F12 → Network → Doc → 重新整理 → Name中的檔案 → Headers → 找到 Request Headers 的 user-agent

獲取網頁

import requests
    
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url, headers = user_agent)
    
print(response.text)

結合 BeautifulSoup4

Step 1：解析HTML

from bs4 import BeautifulSoup

html_doc = '<html> \
    <body> \
        <h1 id="title">Hello World</h1> \
        <a href="#" class="link">This is link1</a> \
        <a href="#link2" class="link">This is link2</a> \
        <a href="#link3" class="link1">This is link3</a> \
    </body> \
</html> '

soup = BeautifulSoup(html_doc, 'html.parser')
# html.parser 的功用就是解析

Step 2：取得需要資料

1. soup.select()

print(soup.select('html'))    #tag
print(soup.select('h1'))    #tag
print(soup.select('a'))    #tag
print(soup.select('#title'))    #id
print(soup.select('.link'))    #class

#範圍選取class為的a標籤
print(soup.select('a.link'))

2. soup.find()

print(soup.find(id = 'title'))    #id是title的內容
print(soup.find(href = '#link2'))    #href是#link2的內容
print(soup.find_all('a')) #尋找所有a標籤的內容

特別說明—尋找 a 的 href

data = soup.select('a')
print(data)
#[<a class="link" href="#">This is link1</a>, <a class="link" href="#link2">This is link2</a>]

for i in data:
    print(i.text)    #內容
    print(i['href'])    #取得連結(方法1)
    print(i.get('href'))    #取得連結(方法2)

用 for 迴圈尋找串列的物件

實戰

LAB-01

目標：從開眼電影網的本周新片獲取電影名稱和完整可連結網址

import requests
from bs4 import BeautifulSoup
    
url = 'http://www.atmovies.com.tw/movie/new/'
response = requests.get(url)
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'html.parser')
filmTitle = soup.select('div.filmTitle a')

for i in filmTitle:
    print(i.text)    #輸出電影名稱
    print('http://www.atmovies.com.tw'+i['href'])    #輸出網址

LAB-02

目標：依 LAB-01 的方式，找出巴哈姆特陰陽師哈拉版的標題及網址

import requests
from bs4 import BeautifulSoup

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url, headers = user_agent)
    
soup = BeautifulSoup(response.text, 'html.parser')
Title = soup.select('div.b-list__tile p')

for i in Title:
    print(i.text)
    print('https://forum.gamer.com.tw/' + i['href'])

References

yan(2020). 電影查詢 Line Bot 筆記. Retrieved from: https://hackmd.io/09eF_NRETdKgD2TjjFVkRQ
椪柑柚(2019). Python 基礎爬蟲. Retrieved from: https://hackmd.io/zBn3qXfPTUibFtueM4RK_A
彭彭的課程(2019). Python 網路爬蟲 Web Crawler 基本教學. Retrieved from: https://www.youtube.com/watch?v=9Z9xKWfNo7k
plusone(2018). Beautiful Soup 解析HTML元素. Retrieved from: https://ithelp.ithome.com.tw/articles/10204390?sc=iThelpR

Python 基礎爬蟲

Outline

認識爬蟲

什麼是爬蟲？

應用

範例

所需套件

步驟

學習使用套件

安裝套件

requests 使用

Step 1

常見網頁狀態

Step 2

寫進檔案

requests 使用

Step 1

狀態是 503 (Bad Gateway)

獲取網頁

結合 BeautifulSoup4

Step 2：取得需要資料

特別說明—尋找 a 的 href

實戰

LAB-01

LAB-02

References

Thanks for listening