Python 基礎爬蟲

SIRLA

講者:楊子右

時間:2019/12/18

Outline

認識爬蟲
爬蟲實戰
LAB

認識爬蟲

網路爬蟲

Web Crawler
一種機器人，會自動瀏覽網路網頁把目標的資訊擷取下來。

應用

比價、趨勢、資料分析、搜尋引擎……

爬蟲程式步驟

請求

傳送資料

browser

server

資料解析

取得資訊

爬蟲套件

Requests

用來對目標網頁的server發出request

BeautifulSoup

用來解析HTML

爬蟲實戰

GET vs. POST

	GET	POST
網址差異	網址會帶有 HTML Form 表單的參數與資料。	資料傳遞時，網址並不會改變。
資料傳遞量	由於是透過 URL 帶資料，所以有長度限制。	由於不透過 URL 帶參數，所以不受限於 URL 長度限制。
安全性	表單參數與填寫內容可在 URL 看到。	透過 HTTP Request 方式，故參數與填寫內容不會顯示於 URL。

取得網頁內容

安裝套件：Requests

 pip install requests

取得網頁內容

台鐵時刻查詢

import requests 

response = requests.get('https://www.railway.gov.tw/tra-tip-web/tip/tip001/tip112/querybytime')

print(response.text)

with open('request.html', 'w', encoding='utf-8') as f:
    f.write(response.text)

data1 = {'_csrf': '56538d1c-2a43-41cf-a65c-d0ed7cec7c8f',
         'trainTypeList': 'ALL',
         'transfer': 'ONE',
         'startStation': '3360-彰化',
         'endStation': '3470-斗六',
         'rideDate': '2019/12/19',
         'startOrEndTime': 'true',
         'startTime': '00:00',
         'endTime': '23:59'}
         
response = requests.post('https://www.railway.gov.tw/tra-tip-web/tip/tip001/tip112/querybytime', data = data1)

傳送資料

with open('requests.html', 'w', encoding='utf-8') as f:
    f.write(response1.text)

模擬用戶代理

股票網站

import requests

res = requests.get('https://www.stockdog.com.tw/stockdog/index.php?m=overview&sid=1101')

print(res.text)

加上user-agent

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'}

res = requests.get('https://www.stockdog.com.tw/stockdog/index.php?m=overview&sid=1101',headers = user_agent)

encoding

全國新書資訊網

import requests

res = requests.get('http://isbn.ncl.edu.tw/NEW_ISBNNet/main_DisplayRecord_Popup.php?Pact=view&Pkey=1080117*0046')

print(res.text)

res.encoding = 'utf-8'

with open('encoding.html', 'w', encoding='utf-8') as f:
    f.write(res.text)

擷取網頁內容

安裝套件：BeautifulSoup

 pip install BeautifulSoup4

在擷取網頁內容之前...

必須先知道什麼是HTML

範本

from bs4 import BeautifulSoup

# 原始 HTML 程式碼
html_doc = '<html> \
	<body> \
		<h1 id="title">Hello World</h1> \
		<a href="#" class="link">This is link1</a> \
		<a href="# link2" class="link">This is link2</a> \
	</body> \
<html> '

# 以 Beautiful Soup 解析 HTML 程式碼
soup = BeautifulSoup(html_doc, 'html.parser')

<h1 id="title" > Hello World </h1>

標籤(Tag)

屬性(Attribute)

文字

print(soup.text)
print(soup.contents)		#ALL
print(soup.select('html'))	#TAG
print(soup.select('h1'))	#TAG
print(soup.select('a'))	#TAG
print(soup.select('#title'))	#ID
print(soup.select('.link'))	#CLASS

Lab1

擷取圖書資訊

URL : https://www.kingstone.com.tw/new/basic/2010230006305?zone=book&lid=search&actid=WISE
Output :

資訊組織
作者：張慧銖、陳淑燕、邱子恒、陳
出版社：華藝數位  
出版日：2017/6/21
ISBN：9789864371310
適讀年齡：全齡適讀

import requests
from bs4 import BeautifulSoup

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

res = requests.get('https://www.kingstone.com.tw/new/basic/2010230006305?zone=book&lid=search&actid=WISE',headers = user_agent)

soup = BeautifulSoup(res.text, 'html.parser')

print(soup)

STEP1. 取得網頁內容

a = soup.select('.pdname_basic')[0]
print(a.text)

b = soup.select('.basiccol')[0]
result = ''
for i in range(1,6):
    c = b.select('.basicunit')[i]
    result += c
       
print(result)

STEP2. 擷取網頁內容-1

if i < 3:
        d= c.select('.title_basic')[0]
        result += d.text
        e = c.select('a')[0]
        result += e.text
    else:
        result += ' '.join(c.text.split())
if i < 5:
        result += '\n'

STEP3. 擷取網頁內容-2

import requests
from bs4 import BeautifulSoup

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

res = requests.get('https://www.kingstone.com.tw/new/basic/2010230006305?zone=book&lid=search&actid=WISE',headers = user_agent)

soup = BeautifulSoup(res.text, 'html.parser')

a = soup.select('.pdname_basic')[0]
print(a.text)
b = soup.select('.basiccol')[0]
result = ''
for i in range(1,6):
    c = b.select('.basicunit')[i]
    if i < 3:
        d= c.select('.title_basic')[0]
        result += d.text
        e = c.select('a')[0]
        result += e.text
    else:
        result += ' '.join(c.text.split())
    if i < 5:
        result += '\n'

print(result)

完整參考程式碼

Lab2

擷取電影資訊

URL : https://movies.yahoo.com.tw/movieinfo_main/小小夜曲-little-nights-little-love-10213
Output :

小小夜曲
上映日期：2019-11-22
片　　長：02時00分
發行公司：天馬行空

import requests
from bs4 import BeautifulSoup

res = requests.get('https://movies.yahoo.com.tw/movieinfo_main/%E5%B0%8F%E5%B0%8F%E5%A4%9C%E6%9B%B2-little-nights-little-love-10213')

soup = BeautifulSoup(res.text,'html.parser')

a = soup.select('.movie_intro_info_r')[0]
b = a.select('h1')[0]
print(b.text)
for i in range(0,3):
    c = a.select('span')[i]
    print(c.text)

完整參考程式碼

參考資料

Python爬蟲實戰

Python爬蟲筆記

Python 使用 Beautiful Soup 抓取與解析網頁資料，開發網路爬蟲教學

Python 基礎爬蟲

Outline

認識爬蟲

爬蟲實戰

認識爬蟲

網路爬蟲

Web Crawler
一種機器人，會自動瀏覽網路網頁把目標的資訊擷取下來。

應用

比價、趨勢、資料分析、搜尋引擎……

爬蟲程式步驟

爬蟲套件

Requests

用來對目標網頁的server發出request

BeautifulSoup

用來解析HTML

爬蟲實戰

GET vs. POST

取得網頁內容

安裝套件：Requests

取得網頁內容

模擬用戶代理

加上user-agent

encoding

擷取網頁內容

安裝套件：BeautifulSoup

在擷取網頁內容之前...

必須先知道什麼是HTML

範本

Lab1

STEP1. 取得網頁內容

STEP2. 擷取網頁內容-1

STEP3. 擷取網頁內容-2

Lab2

參考資料

Thank you for listening.

Python 基本爬蟲教學

Python 基本爬蟲教學

ur89170218

Python 基礎爬蟲

Outline

認識爬蟲

爬蟲實戰

認識爬蟲

網路爬蟲

Web Crawler 一種機器人，會自動瀏覽網路網頁把目標的資訊擷取下來。

應用

比價、趨勢、資料分析、搜尋引擎……

爬蟲程式步驟

爬蟲套件

Requests

用來對目標網頁的server發出request

BeautifulSoup

用來解析HTML

爬蟲實戰

GET vs. POST

取得網頁內容

安裝套件：Requests

取得網頁內容

模擬用戶代理

加上user-agent

encoding​

擷取網頁內容

安裝套件：BeautifulSoup

在擷取網頁內容之前...

必須先知道什麼是HTML

範本

Lab1

STEP1. 取得網頁內容

STEP2. 擷取網頁內容-1

STEP3. 擷取網頁內容-2

Lab2

參考資料

Thank you for listening.

Python 基本爬蟲教學

More from ur89170218

Web Crawler
一種機器人，會自動瀏覽網路網頁把目標的資訊擷取下來。

encoding