Web Crawler
Lecturer: Chia
Date: Jun. 1st, 2020
網路爬蟲
OUTLINE
- What is Web crawler?
- Install packages
- Get text
- Get picture
What is Web crawler?
寫程式太麻煩了!我不屑寫,哼~
那... 還有什麼方法?
- 打開網頁
- 複製網頁內容
- 貼到 word / excel
- Repeat
自動化抓取網頁內容的程式
Install packages
$ pip install requests
$ pip install beautifulsoup4
- requests
- 對目標網頁的伺服器發出請求
- beautifulsoup4
- 解析 HyperText Markup Language (HTML)
送出請求 (post/get)
做出回應 (回傳HTML)
requests
beautifulsoup4
Server
Client
Get text
Before Web crawler...
import requests
url = 'https://www.cw.com.tw/today'
# 按F12 - Network - get方法
response = requests.get(url)
print(response)
查看網站回傳的狀態碼,
以確認網頁回傳的狀況。
常見的 HTTP狀態碼
fake_browser = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
response = requests.get(url, headers = fake_browser)
=> 偽裝成瀏覽器
import requests
url = 'https://www.cw.com.tw/today'
response = requests.get(url)
print(response.text) # print(response)
# 將網頁內容存成檔案
f = open('example.html', 'w', encoding='utf-8')
f.write(response.text)
f.close()
print('Success!')
# with open('example1.html', 'w', encoding='utf-8') as f:
# f.write(response.text)
Get text
(1) 取得整個網頁
import requests
from bs4 import BeautifulSoup #新增
url = 'https://www.cw.com.tw/today'
response = requests.get(url)
# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
Get text
(2) 搜尋特定HTML tag片段
Get text
# ...
# 搜尋所有超連結<a>裡面的內容
a_tags = soup.find_all('a')
for tag in a_tags:
print(tag.text)
# ...
# 搜尋<title>裡面的內容
title = soup.find('title')
print(title.text)
Get text
(3) 搜尋特定CSS片段
import requests
from bs4 import BeautifulSoup #新增
url = 'https://www.cw.com.tw/today'
response = requests.get(url)
# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
Get text
# ...
# 搜尋 id = "item1" 內 <a class="channelTitle"> 的內容
tag_list = soup.select('#item1 a.channelTitle')
print(tag_list)
# 搜尋 id = "item1" 內 <h3> 之下的 <a> 的內容
title_list = soup.select('#item1 h3 a')
print(title_list)
# ... 換個方式顯示
tag_list = soup.select('#item1 a.channelTitle')
title_list = soup.select('#item1 h3 a')
for i in range(len(tag_list)):
print(tag_list[i].text, title_list[i].text)
Get picture
(1) 到Google爬取圖片,並儲存
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com.tw/search?q=%E7%8B%97&tbm=isch&ved=2ahUKEwj21fuat97pAhUF0pQKHR51CuoQ2-cCegQIABAA&oq=%E7%8B%97&gs_lcp=CgNpbWcQDFAAWABglztoAHAAeACAAQCIAQCSAQCYAQCqAQtnd3Mtd2l6LWltZw&sclient=img&ei=Is_TXva8GYWk0wSe6qnQDg"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')
import os
# ...
folder_path ='./photo/'
if (os.path.exists(folder_path) == False):
os.makedirs(folder_path)
Get picture
# ...
# 設定所欲下載的圖片數(photolimit)
photolimit = 10
for index, item in enumerate(items):
if index == 0:
pass
elif index <= photolimit:
img_src = requests.get(item['src']).content
img_name = folder_path + str(index) + '.png'
with open(img_name, 'wb') as f: #以byte的形式將圖片寫入
f.write(img_src)
print('第 %d 張' % (index))
- 補充:enumerate
將小說的章節標題+內容
存成 novel.txt
- 小說 (芸汐傳.天才小毒妃):https://www.ck101.org/293/293983/51812965.html
Hint:
1. 先確認網頁回傳的狀況
2. 使用CSS — 爬取 class為 yuedu_zhengwen 內的文字
3. 將文字寫入novel.txt,並存檔
將小說的章節標題+內容
存成 novel.txt
import requests
from bs4 import BeautifulSoup
url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) #使用header避免訪問受到限制
#print(response)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
file_name = './novel.txt'
file = open(file_name, 'w', encoding = 'utf8')
file.write(title + '\n' + '\n')
for i in items:
file.write(i.text + '\n')
file.close()
import requests
from bs4 import BeautifulSoup
url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
file_name = './Lab.txt'
file = open(file_name, 'w', encoding = 'utf8')
file.write(title + '\n' + '\n')
for i in items:
#去除某些不要的文字。如:小÷說◎網 】,♂小÷說◎網 】,
i = str(i).replace('小÷說◎網 】,♂小÷說◎網 】,','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
file.write(i + '\n')
print(i + '\n')
file.close()
- 去除某些不要的文字。如:小÷說◎網 】,♂小÷說◎網 】,
import requests
from bs4 import BeautifulSoup
import os
url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False):
os.makedirs(folder_path) #Create folder
# 在 novel資料夾下,建立 Lab.txt
file_name = './novel/Lab.txt'
file = open(file_name, 'w', encoding = 'utf8')
file.write(title + '\n' + '\n')
for i in items:
i = str(i).replace('小÷說◎網 】,♂小÷說◎網 】,','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
file.write(i + '\n')
print(i + '\n')
file.close()
- 判斷novel資料夾是否存在,並在資料夾下建立 Lab.txt及存檔。
import requests
from bs4 import BeautifulSoup
import os
index = 0
#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False):
os.makedirs(folder_path) #Create folder
def get_content(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')
file_write(items, title)
def file_write(items, title):
global index
file_name = './novel/Lab' + str(index + 1) + '.txt'
f = open(file_name, 'w', encoding='utf-8')
f.write(title + '\n' + '\n')
for i in items:
i = str(i).replace('小÷說◎網 】,♂小÷說◎網 】,','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
f.write(i + '\n')
#print(i + '\n')
f.close() #close file
index += 1
print('Done!')
# 自動爬取多個章節內容並儲存
url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)]
for u in url:
get_content(u)
- 自動爬取多個章節內容並儲存。
Thanks for your listening.
Python Crawler
By BessyHuang
Python Crawler
2020-06-01 已翻新教材
- 494