Python Asyncio

講者:土豆

時間:2021/05/30

大綱

  • What is Asynchronous?
  • asyncio 簡介
  • asyncio 應用 - 非同步爬蟲

What is Asynchronous?

大家應該都有在餐廳看過這個東西

Synchronous 餐廳

假設:點餐1mins、製作2mins,且只有一個櫃檯

A點餐

A製作

總時間:9mins

B點餐

B製作

C點餐

C製作

1

2

3

4

5

6

7

8

9

Asynchronous 餐廳

假設:點餐1mins、製作2mins,且只有一個櫃檯

A點餐

A製作

總時間:5mins

B點餐

B製作

C點餐

C製作

1

2

3

4

5

6

7

8

9

asyncio 簡介

Example in synchronous way

import time

def fetch_data(data_name):
    print('start fetching data:', data_name)
    time.sleep(3)
    print('stop fetching data:', data_name)

start_time = time.time()
for data_name in ['A', 'B', 'C']:
    fetch_data(data_name)
print('花費時間:', time.time() - start_time, '秒')

Example in asynchronous way

import asyncio
import time

async def fetch_data(data_name):
    print('start fetching data:', data_name)
    await asyncio.sleep(3)
    print('stop fetching data:', data_name)

async def main():
    tasks = []
    for data_name in ['A', 'B', 'C']:
        tasks.append(asyncio.create_task(fetch_data(data_name)))
    await asyncio.gather(*tasks)

start_time = time.time()
asyncio.run(main())
print('花費時間:', time.time() - start_time, '秒')

Terminology

Coroutine

一個function,你可以控制它暫停或是繼續,也可以在需要的時候讓它釋放資源給別的coroutine。

Event loop

負責執行並監控所有coroutine。

asyncio 應用 - 非同步爬蟲

安裝套件

pip install requests beautifulsoup4 # for scrawler
pip install aiohttp # for asynchronous scrawler

同步爬蟲

import time
import requests
from bs4 import BeautifulSoup

urls = [
    "https://zh.wikipedia.org/wiki/%E7%8F%8D%E7%8F%A0%E5%A5%B6%E8%8C%B6",
    "https://www.cosmopolitan.com/tw/lifestyle/food-and-drink/g34501236/milktea-20201104/",
    "https://www.cna.com.tw/news/firstnews/202104170147.aspx",
    "https://www.elle.com/tw/beauty/health/g33947289/drinking-bubble-milk-tea/",
    "https://www.womenshealthmag.com/tw/food-nutrition/restaurant/g35087801/pearl-milk-tea-drinks-top10/",
    "https://www.oktea.com.tw/product.php?pid_for_show=3340",
    "https://www.chingshin.tw/product/pearl-milk-tea",
    "https://www.books.com.tw/products/N001116299"
]

def fetch_data(url):
    print('start fetching data:', url)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    result = ''.join(soup.stripped_strings)
    print('stop fetching data:', url)
    
    return result

def main():
    results = []
    for url in urls:
        results.append(fetch_data(url))
    
    for i, result in enumerate(results):
        with open('results/result' + str(i) + '.txt', 'w', encoding='utf8') as f:
            f.write(result)


start_time = time.time()
main()
print('花費時間:', time.time() - start_time, '秒')

非同步爬蟲

import asyncio
import time
from bs4 import BeautifulSoup
from aiohttp import ClientSession

urls = [
    "https://zh.wikipedia.org/wiki/%E7%8F%8D%E7%8F%A0%E5%A5%B6%E8%8C%B6",
    "https://www.cosmopolitan.com/tw/lifestyle/food-and-drink/g34501236/milktea-20201104/",
    "https://www.cna.com.tw/news/firstnews/202104170147.aspx",
    "https://www.elle.com/tw/beauty/health/g33947289/drinking-bubble-milk-tea/",
    "https://www.womenshealthmag.com/tw/food-nutrition/restaurant/g35087801/pearl-milk-tea-drinks-top10/",
    "https://www.oktea.com.tw/product.php?pid_for_show=3340",
    "https://www.chingshin.tw/product/pearl-milk-tea",
    "https://www.books.com.tw/products/N001116299"
]

async def fetch_data(url, session):
    print('start fetching data:', url)
    r = await session.request(method='GET', url=url)
    html = await r.text()
    soup = BeautifulSoup(html, 'html.parser')
    result = ''.join(soup.stripped_strings)
    print('stop fetching data:', url)
    return result

async def main():
    tasks = []
    async with ClientSession() as session:
        for url in urls:
            tasks.append(asyncio.create_task(fetch_data(url, session)))
        results = await asyncio.gather(*tasks)

        for i, result in enumerate(results):
            with open('results/result' + str(i) + '.txt', 'w', encoding='utf8') as f:
                f.write(result)


start_time = time.time()
asyncio.run(main())
print('花費時間:', time.time() - start_time, '秒')

References

Python Async

By Sam Yang

Python Async

  • 501