Web Crawler

Lecturer: Chia

Date: Jun. 1st, 2020

網路爬蟲

OUTLINE

What is Web crawler?
Install packages
Get text
Get picture

What is Web crawler?

寫程式太麻煩了！我不屑寫，哼~

那... 還有什麼方法？

打開網頁
複製網頁內容
貼到 word / excel
Repeat

自動化抓取網頁內容的程式

Install packages

$ pip install requests
$ pip install beautifulsoup4

requests
- 對目標網頁的伺服器發出請求
beautifulsoup4
- 解析 HyperText Markup Language (HTML)

送出請求 (post/get)

做出回應 (回傳HTML)

requests

beautifulsoup4

Server

Client

Get text

Before Web crawler...

import requests

url = 'https://www.cw.com.tw/today'

# 按F12 - Network - get方法
response = requests.get(url)
print(response)

查看網站回傳的狀態碼，

以確認網頁回傳的狀況。

常見的 HTTP狀態碼

200 - 用戶端要求成功。

403 - 禁止使用。(網頁阻擋爬蟲)

404 - 找不到。

fake_browser = {
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

response = requests.get(url, headers = fake_browser)

=> 偽裝成瀏覽器

import requests

url = 'https://www.cw.com.tw/today'
response = requests.get(url)
print(response.text) # print(response)

# 將網頁內容存成檔案
f = open('example.html', 'w', encoding='utf-8')
f.write(response.text)
f.close()
print('Success!')

# with open('example1.html', 'w', encoding='utf-8') as f:
# 	f.write(response.text)

Get text

(1) 取得整個網頁

import requests
from bs4 import BeautifulSoup #新增

url = 'https://www.cw.com.tw/today'
response = requests.get(url)

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')

Get text

(2) 搜尋特定HTML tag片段

Get text

# ...
# 搜尋所有超連結<a>裡面的內容
a_tags = soup.find_all('a')

for tag in a_tags:
  print(tag.text)

# ...
# 搜尋<title>裡面的內容
title = soup.find('title')
print(title.text)

Get text

(3) 搜尋特定CSS片段

import requests
from bs4 import BeautifulSoup #新增

url = 'https://www.cw.com.tw/today'
response = requests.get(url)

# 使用BeautifulSoup解析HTML
soup = BeautifulSoup(response.text, 'html.parser')

Get text

# ...

# 搜尋 id = "item1" 內 <a class="channelTitle"> 的內容
tag_list = soup.select('#item1 a.channelTitle')
print(tag_list)

# 搜尋 id = "item1" 內 <h3> 之下的 <a> 的內容
title_list = soup.select('#item1 h3 a')
print(title_list)

# ... 換個方式顯示
tag_list = soup.select('#item1 a.channelTitle')
title_list = soup.select('#item1 h3 a')

for i in range(len(tag_list)):
	print(tag_list[i].text, title_list[i].text)

Get picture

(1) 到Google爬取圖片，並儲存

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com.tw/search?q=%E7%8B%97&tbm=isch&ved=2ahUKEwj21fuat97pAhUF0pQKHR51CuoQ2-cCegQIABAA&oq=%E7%8B%97&gs_lcp=CgNpbWcQDFAAWABglztoAHAAeACAAQCIAQCSAQCYAQCqAQtnd3Mtd2l6LWltZw&sclient=img&ei=Is_TXva8GYWk0wSe6qnQDg"
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')

import os
# ...
folder_path ='./photo/'

if (os.path.exists(folder_path) == False): 
    os.makedirs(folder_path)

Get picture

# ...
# 設定所欲下載的圖片數(photolimit)
photolimit = 10

for index, item in enumerate(items):
	if index == 0:
		pass
	elif index <= photolimit:
		img_src = requests.get(item['src']).content

		img_name = folder_path + str(index) + '.png'
		with open(img_name, 'wb') as f: #以byte的形式將圖片寫入
			f.write(img_src)
		
		print('第 %d 張' % (index))

補充：enumerate

將小說的章節標題+內容

存成 novel.txt

小說 (芸汐傳.天才小毒妃)：https://www.ck101.org/293/293983/51812965.html

Hint:

1. 先確認網頁回傳的狀況

2. 使用CSS — 爬取 class為 yuedu_zhengwen 內的文字

3. 將文字寫入novel.txt，並存檔

將小說的章節標題+內容

存成 novel.txt

import requests
from bs4 import BeautifulSoup

url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) #使用header避免訪問受到限制
#print(response)

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')

file_name = './novel.txt'
file = open(file_name, 'w', encoding = 'utf8') 
file.write(title + '\n' + '\n')

for i in items:
    file.write(i.text + '\n')
file.close()

import requests
from bs4 import BeautifulSoup

url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) 

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')

file_name = './Lab.txt'
file = open(file_name, 'w', encoding = 'utf8') 
file.write(title + '\n' + '\n')

for i in items:
    #去除某些不要的文字。如：小÷說◎網 】，♂小÷說◎網 】，
    i = str(i).replace('小÷說◎網 】，♂小÷說◎網 】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
    file.write(i + '\n')
    print(i  + '\n')  
file.close()

去除某些不要的文字。如：小÷說◎網】，♂小÷說◎網】，

import requests
from bs4 import BeautifulSoup
import os

url = 'https://www.ck101.org/293/293983/51812965.html'
response = requests.get(url) 

soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('title').text
items = soup.select('.yuedu_zhengwen')

#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False): 
    os.makedirs(folder_path) #Create folder

# 在 novel資料夾下，建立 Lab.txt
file_name = './novel/Lab.txt'
file = open(file_name, 'w', encoding = 'utf8') 
file.write(title + '\n' + '\n')

for i in items:
    i = str(i).replace('小÷說◎網 】，♂小÷說◎網 】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
    file.write(i + '\n')
    print(i  + '\n')  
file.close()

判斷novel資料夾是否存在，並在資料夾下建立 Lab.txt及存檔。

import requests
from bs4 import BeautifulSoup
import os
index = 0
    
#判斷資料夾是否存在
folder_path ='./novel/'
if (os.path.exists(folder_path) == False): 
    os.makedirs(folder_path) #Create folder
    
def get_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = soup.find('title').text
    items = soup.select('.yuedu_zhengwen')

    file_write(items, title)

def file_write(items, title):
    global index
    
    file_name = './novel/Lab' + str(index + 1) + '.txt'
    f = open(file_name, 'w', encoding='utf-8')
    f.write(title + '\n' + '\n')
    
    for i in items:
        i = str(i).replace('小÷說◎網 】，♂小÷說◎網 】，','').replace('<br/>','').replace('<div class="yuedu_zhengwen" id="content">','').replace('</div>','')
        f.write(i + '\n')
        #print(i  + '\n')

    f.close() #close file
    index += 1
    print('Done！')

# 自動爬取多個章節內容並儲存
url = ['https://www.ck101.org/293/293983/5181{}.html'.format(str(i)) for i in range(2965,4330)]

for u in url:
    get_content(u)

自動爬取多個章節內容並儲存。

Thanks for your listening.