講者:蔡孟軒
時間:2020/06/10
英文:Web Crawler
將網際網路 (Internet) 視為一張蜘蛛網
爬蟲就是在蜘蛛網上的蜘蛛
幫助你獲取資料
搜尋引擎、比價、評價分析、趨勢分析
傳送請求
解析資料
回傳資料
資料Get!!
$ pip install requests$ pip install beautifulsoup4狀況一:沒有防爬蟲
使用:開眼電影網
確認網頁狀態
url = 'http://www.atmovies.com.tw/movie/new/'
response = requests.get(url)
print(response)獲取網頁
import requests
url = 'http://www.atmovies.com.tw/movie/new/'
response = requests.get(url)
print(response.text)遇到亂碼?
response.encoding = 'utf-8'with open("data.html", "w", encoding = 'utf-8') as file:
file.write(response.text)f = open('example.html', 'w', encoding='utf-8')
f.write(response.text)
f.close()方法一
方法二
狀況二:有防爬蟲
使用:巴哈姆特陰陽師哈拉版
確認網頁狀態
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url)
print(response)有可能是防爬蟲手法
加上user_agent看看行不行
import requests
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url, headers = user_agent)
print(response)F12 → Network → Doc → 重新整理 → Name中的檔案 → Headers → 找到 Request Headers 的 user-agent
import requests
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url, headers = user_agent)
print(response.text)Step 1:解析HTML
from bs4 import BeautifulSoup
html_doc = '<html> \
<body> \
<h1 id="title">Hello World</h1> \
<a href="#" class="link">This is link1</a> \
<a href="#link2" class="link">This is link2</a> \
<a href="#link3" class="link1">This is link3</a> \
</body> \
</html> '
soup = BeautifulSoup(html_doc, 'html.parser')
# html.parser 的功用就是解析1. soup.select()
print(soup.select('html')) #tag
print(soup.select('h1')) #tag
print(soup.select('a')) #tag
print(soup.select('#title')) #id
print(soup.select('.link')) #class
#範圍選取class為的a標籤
print(soup.select('a.link'))2. soup.find()
print(soup.find(id = 'title')) #id是title的內容
print(soup.find(href = '#link2')) #href是#link2的內容
print(soup.find_all('a')) #尋找所有a標籤的內容data = soup.select('a')
print(data)
#[<a class="link" href="#">This is link1</a>, <a class="link" href="#link2">This is link2</a>]for i in data:
print(i.text) #內容
print(i['href']) #取得連結(方法1)
print(i.get('href')) #取得連結(方法2)用 for 迴圈尋找串列的物件
目標:從開眼電影網的本周新片獲取電影名稱和完整可連結網址
import requests
from bs4 import BeautifulSoup
url = 'http://www.atmovies.com.tw/movie/new/'
response = requests.get(url)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
filmTitle = soup.select('div.filmTitle a')
for i in filmTitle:
print(i.text) #輸出電影名稱
print('http://www.atmovies.com.tw'+i['href']) #輸出網址目標:依 LAB-01 的方式,找出巴哈姆特陰陽師哈拉版的標題及網址
import requests
from bs4 import BeautifulSoup
user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
url = 'https://forum.gamer.com.tw/B.php?bsn=31078'
response = requests.get(url, headers = user_agent)
soup = BeautifulSoup(response.text, 'html.parser')
Title = soup.select('div.b-list__tile p')
for i in Title:
print(i.text)
print('https://forum.gamer.com.tw/' + i['href'])