聯課爬蟲教學

爬蟲？

no.

What it is?

搜尋引擎自動瀏覽、存取網頁的方式

```
自動化擷取網頁資料
```

常用模組

Requests + Beautiful Soup / PyQuery
Scrapy / Pyspider
Selenium

Selenium

一個瀏覽器自動化的工具包

優點：

```
方便操作
```

可以進行登入、滑鼠滾動等操作

模擬瀏覽器訪問網站，不易被阻攔

缺點：

```
不是正式的爬蟲套件
```
```
比起正式爬蟲速度較慢
```

Setting Up

安裝 Selenium

pip install Selenium

ChromeDriver

Chrome 瀏覽器版本查詢：

https://www.whatismybrowser.com/detect/what-version-of-chrome-do-i-have

安裝 Chromedriver

http://chromedriver.chromium.org/

將Chrome driver放到以下位置

... > User > anaconda3 >

... > User > anaconda3 > Scripts

Basics

import Selenium 模組

from selenium import webdriver

WebDriver 可以驅動瀏覽器的應用程式

driver = webdriver.Chrome()
url = '想爬的網站網址'
driver.get(url)

#program

driver.quit()

建立爬蟲本體

Wait

給瀏覽器加載的時間

1. 強制等待

import time

driver = webdriver.Chrome()
url = '想爬的網站網址'
driver.get(url)

sleep(10)

#program

driver.quit()

較死板
浪費多餘的時間

意思

到指定的秒數之前都不要執行下一行程式

2. 隱性等待

driver = webdriver.Chrome()
driver.implicity_wait(30)
url = '想爬的網站網址'
driver.get(url)

#program

driver.quit()

意思

到指定的秒數之前，

如果網頁加載完畢即馬上執行下一行程式

否則如果超過時間，

即強制執行下一行程式

較彈性
可省下時間

3. 顯性等待

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.implicity_wait(30)
url = '想爬的網站網址'
driver.get(url)

locator = (By.Link_TEXT, '7122')
WebDriverWait(driver, 10).until(ec.presence_of_element_located(locator))

#program

driver.quit()

意思

利用 until 與 until_not

靈活地檢查某項東西是否符合自定義的條件

如果條件符合，即繼續執行下一行程式

如果超過指定時間，

即強制執行下一行程式

語法

WebDriverWait(driver, 指定時間, 檢查頻率, 忽略異常).until(自定義條件, 超時返回的訊息)
WebDriverWait(driver, 指定時間, 檢查頻率, 忽略異常).until_not(自定義條件, 超時返回的訊息)

指定條件

selenium.webdriver.support.expected_conditions

以下兩個條件類驗證title，驗證傳入的參數title是否等於或包含於driver.title
title_is
title_contains

以下兩個條件驗證元素是否出現，傳入的參數都是元組類型的locator，如(By.ID, 'kw')
顧名思義，一個只要一個符合條件的元素加載出來就通過；另一個必須所有符合條件的元素都加載出來才行
presence_of_element_located
presence_of_all_elements_located

以下三個條件驗證元素是否可見，前兩個傳入參數是元組類型的locator，第三個傳入WebElement
第一個和第三個其實質是一樣的
visibility_of_element_located
invisibility_of_element_located
visibility_of

以下兩個條件判斷某段文本是否出現在某元素中，一個判斷元素的text，一個判斷元素的value
text_to_be_present_in_element
text_to_be_present_in_element_value

以下條件判斷frame是否可切入，可傳入locator元組或者直接傳入定位方式：id、name、index或WebElement
frame_to_be_available_and_switch_to_it

以下條件判斷是否有alert出現
alert_is_present

以下條件判斷元素是否可點擊，傳入locator
element_to_be_clickable

以下四個條件判斷元素是否被選中，第一個條件傳入WebElement對象，第二個傳入locator元組
第三個傳入WebElement對像以及狀態，相等返回True，否則返回False
第四個傳入locator以及狀態，相等返回True，否則返回False
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be

最後一個條件判斷一個元素是否仍在DOM中，傳入WebElement對象，可以判斷頁面是否刷新了
staleness_of

要怎麼讓爬蟲動起來？

>> 讓他開始找東西

要怎麼讓他開始找東西?

>> Selector

HTML

<網頁原始碼視讀>

按下F12、

でこんなになる。

\(\uparrow\) 網頁原始碼

Selector

Xpath

可在XML文檔中查找資料的語言

driver.find_element_by_xpath("object_xpath")

Selector 語法

絕對 Xpath

/html/body/div[5]/div/div[2]/section[2]/div[2]/div[2]/article/div/a/img

一個物件的位置全名
只適用於找固定物件

相對 Xpath

//*[@id="cf44340633"]/div/a/img

語法一覽：

nodename 選取此節點的所有子節點
/ 從當前節點選取直接子節點
// 從當前節點選取子孫節點
. 選取當前節點
.. 選取當前節點的父節點
@ 選取屬性

[ ]: 對於物件的要求條件

> 屬性一覽 <

測試 Xpath

打開chrome開發者工具(F12)

測試 Xpath

2. 選擇"console"

測試 Xpath

3. 輸入

$x("your xpath here");

註: 這是瀏覽器console的內建語法，而非javascript

如果有選到就會顯示在下方(廢話

Css selector

比 Xpath 快0.1 ~ 0.3毫秒
官方所推薦的用法

find_element_by_id(idName)
find_element_by_class_name(className)
find_element_by_tag_name(tag_name)
find_element_by_css_selector(css)

實作前小練習

鯊鯊 > <

我全都要 !

1. 建立爬蟲本體

from selenium import webdriver
import os
import urllib
from urllib.request import urlopen

url = 'https://home.gamer.com.tw/creationDetail.php?sn=4660391'
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url)

#program

2. 找出圖片的xpath

pic_path = '//div[@class="MSG-list8C"]/div/div//img'
pic_links = driver.find_elements_by_xpath(pic_path)

3. 使用os建立資料夾

import os

img_folder = 'D:\\Sharky_imgs\\'
if not os.path.isdir(img_folder):
  os.mkdir(img_folder)

4. 使用 get_attribute() 找出圖片物件的 data-src

for i in range(len(pic_links)):
    pic_src = pic_links[i].get_attribute('data-src')

5. 使用 urllib 下載圖片

import urllib

try:
  urllib.request.urlretrieve(pic_src, img_folder + str(i+1) + '.jpg')
except:
  pass

6. 關掉瀏覽器

driver.quit()

完成 !

from selenium import webdriver
import os
import urllib

url = 'https://home.gamer.com.tw/creationDetail.php?sn=4660391'
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url)

pic_path = '//div[@class="MSG-list8C"]/div/div//img'
pic_links = driver.find_elements_by_xpath(pic_path)

img_folder = 'D:\\Sharky_imgs\\'
if not os.path.isdir(img_folder):
    os.mkdir(img_folder)

for i in range(len(pic_links)):
    pic_src = pic_links[i].get_attribute('data-src')
    print(pic_src)
    try:
        urllib.request.urlretrieve(pic_src, img_folder + str(i+1) + '.jpg')
    except:
        pass

driver.quit()

實作 time!

練習I. Stocks

Stocks!

用網頁中數據繪製折線圖

0. 觀察網頁

Yahoo Stocks - Tesla

->數據都在 Historical Data 頁面

https://finance.yahoo.com/quote/TSLA/history?p=TSLA

0. 觀察網頁

Yahoo Stocks - Tesla

->數據都在 Historical Data 頁面

https://finance.yahoo.com/quote/TSLA/history?p=TSLA

0. 觀察網頁

Yahoo Stocks - Tesla

->數據都在 Historical Data 頁面

https://finance.yahoo.com/quote/TSLA/history?p=TSLA

0. 觀察網頁

Yahoo Stocks - Tesla

->數據都在 Historical Data 頁面

https://finance.yahoo.com/quote/TSLA/history?p=TSLA

1. 建立爬蟲本體

from selenium import webdriver

driver = webdriver.Chrome()
url = "https://finance.yahoo.com/quote/TSLA/history?p=TSLA"
driver.implicitly_wait(30)
driver.get(url)

driver.close() #結束前關閉瀏覽器

2. 選取標題

title = driver.find_element_by_tag_name("h1")
print(title.text) #title 是Selenium 物件 .text取得文字

標題在<h1>標籤中，

可利用 CSS Selector 的 find_element_by_tag_name

3. 選取股價

price = driver.find_element_by_xpath("//section/div[2]/table/tbody/tr[1]/td[5]")
print(price.text)

使用 Close*欄數據繪製圖表

找到這一欄的位置複製 X-Path:
“/html/body/div[1]/div/div/div[1]/div/div[3]/div[1]/div/div[2]/section/div[2]/table/tbody/tr[1]/td[5]”

簡化成: //section/div[2]/table/tbody/tr[1]/td[5]”

4. 選取多筆股價

prices = []
for i in range(1,31):
    # get price
    price = driver.find_element_by_xpath(f"//section/div[2]/table/tbody/tr[{i}]/td[5]/span")
    daily_price = price.text
    daily_price = float(daily_price)
    prices.insert(0, daily_price)

使用迴圈取得30日內的所有股價

改變tr，取得每一列的資料

並把資料放到list中

5. 選取日期

dates = []
for i in range(1,31):
    # get date
    date_elem = driver.find_element_by_xpath(f"//section/div[2]/table/tbody/tr[{i}]/td[1]/span")
    date = date_elem.text
    dates.insert(0, date)

跟股價的方式一樣
但把xpath最後的td(欄位)改成第1欄

第一部分完成！

options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

url = "https://finance.yahoo.com/quote/TSLA/history?p=TSLA"
driver.implicitly_wait(30)
driver.get(url)

★ 可以在 Web Driver 加入 --headless 參數

畫圖

0. 套件安裝

pip install matplotlib

import matplotlib.pyplot as plt

1. 畫線

plt.plot(dates,prices,color="red",marker="o",label=title.text)

plt.show()

plot 前兩個參數是橫軸數據&縱軸數據
型態是list

2. 設定標題、座標軸標示

x軸標示建議垂直顯示，避免重疊

plt.title("Stocks",fontsize=24)
plt.xticks(fontsize=12, rotation='vertical')
plt.yticks(fontsize=12)
plt.xlabel('Day',fontsize=14,labelpad=20)
plt.ylabel('Price',fontsize=14,labelpad=20)

3.添加格線、圖例

plt.legend(loc="best",fontsize=14) #add legend
plt.grid(True) #add grids
plt.show()

完成圖表

練習一

試著繪製Amazon股價圖

https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch

★字串需額外處理

    daily_price = price.text
    try:
        daily_price=float(daily_price)
    except:
        daily_price = daily_price[:-7] + daily_price[-6:-1]

練習二

修改程式，讓它能輸入三個公司的代碼後

把三個公司的股價畫在同一張圖

★網址格式化
★用list存lists

companies = [input("Company 1:"),input("Company 2:"),input("Company 1:")] #set search targets#
#companies = ["AMZN","GOOG","AAPL"]
titles = []
all_prices = []
all_dates = []
for elem in companies:
    # get url
    url = f"https://finance.yahoo.com/quote/{elem}/history?p={elem}"
    
    #Your code...

GOOG AMZN DIS

完整版

from selenium import webdriver
import matplotlib.pyplot as plt
#settings
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

companies = [input("Company 1:"),input("Company 2:"),input("Company 3:")] #set search targets
titles = []
all_prices = []
all_dates = []
for elem in companies:
    # get url
    url = f"https://finance.yahoo.com/quote/{elem}/history?p={elem}"
    driver.implicitly_wait(30)
    driver.get(url)

    #get stock title
    title_tag = driver.find_element_by_xpath('//*[@id="quote-header-info"]/div[2]/div[1]/div[1]/h1')
    titles.append(title_tag.text)

    dates = []
    prices = []
    for i in range(30):
        #get date
        date_tag = driver.find_element_by_xpath(f"//section/div[2]/table/tbody/tr[{i+1}]/td[1]/span")
        date = date_tag.text
        date = date[:-6]  # (remove the year)
        dates.insert(0,date)

        # get price
        price = driver.find_element_by_xpath(f"//section/div[2]/table/tbody/tr[{i+1}]/td[5]/span")
        daily_price = price.text
        try:
            daily_price=float(daily_price)
        except:
            daily_price = daily_price[:-7] + daily_price[-6:-1]
            daily_price=float(daily_price)
        prices.insert(0,daily_price)

    all_dates.append(dates)
    all_prices.append(prices)
driver.close()

# draws three lines
plt.plot(all_dates[0],all_prices[0],color="red",marker="o",label=titles[0])
plt.plot(all_dates[1],all_prices[1],color="green",marker="o",label=titles[1])
plt.plot(all_dates[2],all_prices[2],color="blue",marker="o",label=titles[2])


plt.title("Stocks",fontsize=24) #set title
plt.xticks(fontsize=12, rotation='vertical')#set x axis
plt.yticks(fontsize=12 )#set y axis
plt.xlabel('Day',fontsize=14,labelpad=20)
plt.ylabel('Price',fontsize=14,labelpad=20)
plt.legend(loc="best",fontsize=14) #add legend
plt.grid(True,axis="y") #add grids
plt.show()

練習三

擷取150天內數據

Yahoo Stocks - Tesla

★滑鼠滾輪控制

★可移除--headless參數，方便觀察

driver.execute_script("window.scrollTo(0, Y)")

    driver.execute_script("window.scrollTo(0, 10000)")
    time.sleep(1)
    for i in range(150):
      #Your Code...

X軸太擠了

import matplotlib.dates as mdates
from datetime import datetime

def mon_to_num(month):
        MON2NUM={
        "Jan":"01",
        "Feb":"02",
        "Mar":"03",
        "Apr":"04",
        "May":"05",
        "Jun":"06",
        "Jul":"07",
        "Aug":"08",
        "Sep":"09",
        "Oct":"10",
        "Nov":"11",
        "Dec":"12"
    }
    return MON2NUM.get(month,"")
def format_date(str):
    new = mon_to_num(str[:-9])+"/"+str[-8:-6]+"/"+str[-4:]
    return new

xs = [datetime.strptime(d,'%m/%d/%Y').date() for d in dates]
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=10))
plt.gcf().autofmt_xdate()

把日期轉換成datetime格式
並利用matplotlib的套件控制出現頻率

Full Version

from selenium import webdriver
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import time

def mon_to_num(str):
    if str=="Jan":   return "01"
    elif str=="Feb": return "02"
    elif str=="Mar": return "03"
    elif str=="Apr": return "04"
    elif str=="May": return "05"
    elif str=="Jun": return "06"
    elif str=="Jul": return "07"
    elif str=="Aug": return "08"
    elif str=="Sep": return "09"
    elif str=="Oct": return "10"
    elif str=="Nov": return "11"
    elif str=="Dec": return "12"
def format_date(str):
    new = mon_to_num(str[:-9])+"/"+str[-8:-6]+"/"+str[-4:]
    return new
#settings
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)

#companies = [input("Company 1:"),input("Company 2:"),input("Company 1:")] #set search targets#
companies = ["AMZN","GOOG","DIS"]
titles = []
all_prices = []
dates = []
for elem in companies:
    # get url
    url = f"https://finance.yahoo.com/quote/{elem}/history?p={elem}"
    driver.implicitly_wait(30)
    driver.get(url)

    #get stock title
    title_tag = driver.find_element_by_tag_name("h1")
    titles.append(title_tag.text)

    prices = []
    driver.execute_script("window.scrollTo(0, 10000)")
    time.sleep(1)
    dates.clear()
    for i in range(150):
        #time.sleep((random.randint(0,100))/1000)
        #get date
        date_tag = driver.find_element_by_xpath(f"//section/div[2]/table/tbody/tr[{i+1}]/td[1]/span")
        date = date_tag.text
        date = format_date(date)
        dates.insert(0,date)

        # get price
        price = driver.find_element_by_xpath(f"//section/div[2]/table/tbody/tr[{i+1}]/td[5]/span")
        daily_price = price.text
        try:
            daily_price=float(daily_price)
        except:
            daily_price = daily_price[:-7] + daily_price[-6:-1]
            daily_price=float(daily_price)
        prices.insert(0,daily_price)
    all_prices.append(prices)
driver.close()

xs = [datetime.strptime(d,'%m/%d/%Y').date() for d in dates]
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=15))
plt.gcf().autofmt_xdate()

plt.plot(xs,all_prices[0],color="red",label=titles[0])
plt.plot(xs,all_prices[1],color="green",label=titles[1])
plt.plot(xs,all_prices[2],color="blue",label=titles[2])

plt.title("Stocks",fontsize=24) #set title
plt.xticks(fontsize=12,rotation="vertical")#set x axis
plt.yticks(fontsize=12 )#set y axis
plt.xlabel('Day',fontsize=14,labelpad=20)
plt.ylabel('Price',fontsize=14,labelpad=20)
plt.legend(loc="best",fontsize=14) #add legend
plt.grid(b=True,axis="y") #add grids
plt.show()

Better Version(把程式包成函數)

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import time
import re

COLORS=['red','green','blue']
def getColor(i):
    return COLORS[i%len(COLORS)]
MON2NUM={
    "Jan":"01",
    "Feb":"02",
    "Mar":"03",
    "Apr":"04",
    "May":"05",
    "Jun":"06",
    "Jul":"07",
    "Aug":"08",
    "Sep":"09",
    "Oct":"10",
    "Nov":"11",
    "Dec":"12"
}
def mon_to_num(month):
    return MON2NUM.get(month,"")
def format_date(date_text):
    return '/'.join([mon_to_num((x:=re.split(', | ',date_text))[0])]+x[1:])


def generate(*companies):
    # settings
    options = webdriver.ChromeOptions()
    # options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)
    driver.implicitly_wait(30)
    titles = {}
    all_prices = {}
    prices = {}
    for company in companies:
        # get url
        url = f"https://finance.yahoo.com/quote/{company}/history?p={company}"
        driver.get(url)

        # get stock title
        title_tag = driver.find_element_by_tag_name("h1")
        titles[company] = title_tag.text

        driver.execute_script("window.scrollTo(0, 10000)")
        prices.clear()
        for i in range(150):
            # time.sleep((random.randint(0,100))/1000)

            xpath=f"//section/div[2]/table/tbody/tr[{i + 1}]"
            try:
              WebDriverWait(driver,2,0.01).until(EC.presence_of_element_located((By.XPATH,xpath)))
            except:
              pass

            # get date
            date_tag = driver.find_element_by_xpath(xpath+"/td[1]/span")
            date = format_date(date_tag.text)

            # get price
            price = driver.find_element_by_xpath(xpath+"/td[5]/span")
            daily_price = price.text
            daily_price = float(daily_price.replace(',',''))

            prices[date] = daily_price

        all_prices[company]=prices.copy()

    driver.close()

    getDates = lambda dates: [datetime.strptime(d, '%m/%d/%Y').date() for d in dates]
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
    plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=10))
    plt.gcf().autofmt_xdate()
    for i,company in enumerate(companies):
        plt.plot(getDates(all_prices[company].keys()), all_prices[company].values(), color=getColor(i), label=titles[company])
    plt.title("Stocks", fontsize=24)  # set title
    plt.xticks(fontsize=12,rotation="vertical")#set x axis
    plt.yticks(fontsize=12)  # set y axis
    plt.xlabel('Day', fontsize=14, labelpad=20)
    plt.ylabel('Price', fontsize=14, labelpad=20)
    plt.legend(loc="best", fontsize=14)  # add legend
    plt.grid(b=True, axis="y")  # add grids

    plt.show()
generate("AMZN", "GOOG", "DIS")

練習II. Stonks

a new way to view webcrawling

輸出圖表成圖片

plt.savefig("stocks.png")

plt.show() (顯示圖表)會把畫好的圖表洗掉，故不要執行這行
或是在這行之前先執行plt.savefig()

似乎需要修改......

調整顏色，把圖表畫得像meme

getDates = lambda dates: [datetime.strptime(d, '%m/%d/%Y').date() for d in dates]
plt.figure(figsize=(6.68,5.02),facecolor='#1B46AF')
axis = plt.gca()
axis.set_facecolor('#1B46AF')
xaxis = plt.gca().xaxis
xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
xaxis.set_major_locator(mdates.DayLocator(interval=10))
for spine in axis.spines.values():
    spine.set_color('#649FDF')
plt.gcf().autofmt_xdate()
for i,company in enumerate(companies):
    plt.plot(getDates(all_prices[company].keys()), all_prices[company].values(), color=getColor(i), label=titles[company])
# plt.title("Stocks", fontsize=24, color="#649FDF")  # set title
plt.yticks(fontsize=12)  # set y axis
xlabel = plt.xlabel('Day', fontsize=14, labelpad=10)
ylabel = plt.ylabel('Price', fontsize=14, labelpad=10)
xlabel.set_color('#649FDF')
ylabel.set_color('#649FDF')
plt.legend(loc="upper right", fontsize=14) # add legend
plt.grid(b=True, axis="y")  # add grids
plt.tick_params(color="#649FDF",labelcolor="#649FDF",grid_color='#649FDF')

記得刪掉標題!

記得調整圖片大小!

記得調整圖示位置!

新的輸出圖片><

再建立新的python檔案，此處稱之為"meme_factory.py"

並把剛剛生成圖表的檔案稱為"get_stock_graph.py"

#meme_factory.py

import get_stock_graph

匯入剛剛的檔案><
就可以在這個檔案執行剛剛爬蟲

並生成圖片的程式!

可以執行看看剛剛包成函數的程式!

get_stock_graph.generate("AMZN", "GOOG", "DIS")

修但幾勒
如何處理圖片?

當然要用

Pillow
aka PIL

aka Python Imaging Library

匯入Pillow

匯入剛剛的圖表(.png)

from PIL import Image

imageA = Image.open('stocks.png')
imageA = imageA.convert('RGBA')
widthA , heightA = imageA.size

匯入meme man

imageB = Image.open('mememan.png')
imageB = imageB.convert('RGBA')
widthB , heightB = imageB.size

但meme man的寬度要是整體(背景)的一半
寬高比例又不能變OAO

newWidthB = int(widthA/2)
newHeightB = int(heightB/widthB*newWidthB)

imageB_resize = imageB.resize((newWidthB, newHeightB))

建立新的圖片

resultPicture = Image.new('RGBA', imageA.size, (0, 0, 0, 0))
resultPicture.paste(imageA,(0,0))

把背景(折線圖)貼上去

right_bottom = (0, heightA - newHeightB)
resultPicture.paste(imageB_resize, right_bottom, imageB_resize)

把調整尺寸之後的meme man貼上去

from PIL import ImageDraw, ImageFont

來寫字吧

先匯入模組

font = ImageFont.truetype('arial.ttf', 90, encoding='utf-8')

載入字體(字型,大小,編碼)

canvas = ImageDraw.Draw(resultPicture)

建立畫布(v畫筆)!

text_position = (widthA/2 +20, heightA/2 +20)
text_color = (255,255,255,255) #color in RGBA
stroke_color = (0,0,0,255) #color in RGBA

文字位置、顏色(RGBA)

canvas.text(text_position, 
            "stonks", 
            fill=text_color, 
            font=font, 
            stroke_width=2, 
            stroke_fill=stroke_color)

把字寫上去!!!

resultPicture.save("stonks.png")

輸出圖片!

完整code(get_stock_graph.py)

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime
import os
import re

COLORS=['red','green','blue','yellow','purple']
def getColor(i):
    return COLORS[i%len(COLORS)]
MON2NUM={
    "Jan":"01",
    "Feb":"02",
    "Mar":"03",
    "Apr":"04",
    "May":"05",
    "Jun":"06",
    "Jul":"07",
    "Aug":"08",
    "Sep":"09",
    "Oct":"10",
    "Nov":"11",
    "Dec":"12"
}
def mon_to_num(month):
    return MON2NUM.get(month,"")
def format_date(date_text):
    return '/'.join([mon_to_num((x:=re.split(', | ',date_text))[0])]+x[1:])


def generate(*companies):
    # settings
    options = webdriver.ChromeOptions()
    # options.add_argument("--headless")
    driver = webdriver.Chrome(os.path.join(os.getcwd(), 'koronedriver.exe'), options=options)
    driver.implicitly_wait(30)
    titles = {}
    all_prices = {}
    prices = {}
    for company in companies:
        # get url
        url = f"https://finance.yahoo.com/quote/{company}/history?p={company}"
        driver.get(url)

        # get stock title
        title_tag = driver.find_element_by_tag_name("h1")
        titles[company] = title_tag.text

        driver.execute_script("window.scrollTo(0, 10000)")
        prices.clear()
        for i in range(150):
            # time.sleep((random.randint(0,100))/1000)

            xpath=f"//section/div[2]/table/tbody/tr[{i + 1}]"
            try:
                WebDriverWait(driver,2,0.01).until(EC.presence_of_element_located((By.XPATH,xpath)))
            except:
                pass

            # get date
            date_tag = driver.find_element_by_xpath(xpath+"/td[1]/span")
            date = format_date(date_tag.text)

            # get price
            price = driver.find_element_by_xpath(xpath+"/td[5]/span")
            daily_price = price.text
            daily_price = float(daily_price.replace(',',''))

            prices[date] = daily_price

        all_prices[company]=prices.copy()

    driver.close()

    getDates = lambda dates: [datetime.strptime(d, '%m/%d/%Y').date() for d in dates]
    plt.figure(figsize=(6.68,5.02),facecolor='#1B46AF')
    axis = plt.gca()
    axis.set_facecolor('#1B46AF')
    xaxis = plt.gca().xaxis
    xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
    xaxis.set_major_locator(mdates.DayLocator(interval=10))
    for spine in axis.spines.values():
        spine.set_color('#649FDF')
    plt.gcf().autofmt_xdate()
    for i,company in enumerate(companies):
        plt.plot(getDates(all_prices[company].keys()), all_prices[company].values(), color=getColor(i), label=titles[company])
    # plt.title("Stocks", fontsize=24, color="#649FDF")  # set title
    plt.yticks(fontsize=12)  # set y axis
    xlabel = plt.xlabel('Day', fontsize=14, labelpad=10)
    ylabel = plt.ylabel('Price', fontsize=14, labelpad=10)
    xlabel.set_color('#649FDF')
    ylabel.set_color('#649FDF')
    plt.legend(loc="upper right", fontsize=14) # add legend
    plt.grid(b=True, axis="y")  # add grids
    plt.tick_params(color="#649FDF",labelcolor="#649FDF",grid_color='#649FDF')


    plt.savefig("stocks.png")
    # plt.show()

完整code(meme_factory.py)

from PIL import Image, ImageDraw, ImageFont
import get_stock_graph
get_stock_graph.generate("AMZN", "GOOG", "DIS")

imageA = Image.open('stocks.png')
imageA = imageA.convert('RGBA')
widthA , heightA = imageA.size


imageB = Image.open('mememan.png')
imageB = imageB.convert('RGBA')
widthB , heightB = imageB.size

newWidthB = int(widthA/2)
newHeightB = int(heightB/widthB*newWidthB)

imageB_resize = imageB.resize((newWidthB, newHeightB))


resultPicture = Image.new('RGBA', imageA.size, (0, 0, 0, 0))
resultPicture.paste(imageA,(0,0))


right_bottom = (0, heightA - newHeightB)
resultPicture.paste(imageB_resize, right_bottom, imageB_resize)

#載入字體(字型,大小,編碼)
font = ImageFont.truetype('arial.ttf', 90, encoding='utf-8')


#建立畫布
canvas = ImageDraw.Draw(resultPicture)
text_position = (widthA/2 +20, heightA/2 +20)
text_color = (255,255,255,255) #color in RGBA
stroke_color = (0,0,0,255) #color in RGBA
canvas.text(text_position, "stonks", fill=text_color, font=font, stroke_width=2, stroke_fill=stroke_color)

resultPicture.save("stonks.png")

(很爛的)輸出結果

待改進

橘色箭頭
文字光暈
3d旋轉背景(旋轉矩陣?)

Test it ur self

練習III. 油圖

重點

img_links = driver.find_elements_by_xpath('//div[@class="center"]//img')

換個方法抓圖片

import time
import os
import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests

driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get('https://wall.alphacoders.com/tags.php?tid=229')

img_folder = 'D:\\alpha_coders'

if not os.path.isdir(img_folder):
    os.mkdir(img_folder)

curr = int(1)
for i in range(1):
    img_links = driver.find_elements_by_xpath('//div[@class="center"]//img')

    for j in range(len(img_links)):
        if(j%2 == 0):
            img_url = img_links[j].get_attribute('src')
            r = requests.get(img_url)
            with open(img_folder  + '\\' + str(curr) + '.jpg', "wb") as handler:
                handler.write(r.content)
            curr += 1
            
driver.quit()

下一頁？

from selenium.webdriver.common.keys import Keys

nextpage = driver.find_element_by_tag_name('body')
nextpage.send_keys(Keys.ARROW_RIGHT)

Documentation

import time
import os
import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
#import urllib
#from urllib.request import urlopen

driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get('https://wall.alphacoders.com/tags.php?tid=229')

img_folder = 'D:\\alpha_coders'

if not os.path.isdir(img_folder):
    os.mkdir(img_folder)

curr = int(1)
for i in range(1):
    img_links = driver.find_elements_by_xpath('//div[@class="center"]//img')

    for j in range(len(img_links)):
        if(j%2 == 0):
            img_url = img_links[j].get_attribute('src')
            r = requests.get(img_url)
            with open(img_folder  + '\\' + str(curr) + '.jpg', "wb") as handler:
                handler.write(r.content)
            curr += 1
    
    nextpage = driver.find_element_by_tag_name('body')
    nextpage.send_keys(Keys.ARROW_RIGHT)
        

driver.quit()

最後の最後

隨網頁應變
不懂？查吧！
不要觸法> <

聯課爬蟲教學

爬蟲？

What it is?

常用模組

Selenium

一個瀏覽器自動化的工具包

優點：

缺點：

Setting Up

安裝 Selenium

ChromeDriver

將Chrome driver放到以下位置

Basics

import Selenium 模組

建立爬蟲本體

Wait

給瀏覽器加載的時間

1. 強制等待

較死板

浪費多餘的時間

意思

2. 隱性等待

意思

較彈性

可省下時間

3. 顯性等待

意思

利用 until 與 until_not

靈活地檢查某項東西是否符合自定義的條件

如果條件符合，即繼續執行下一行程式

如果超過指定時間，

即強制執行下一行程式

語法

指定條件

要怎麼讓爬蟲動起來？

>> 讓他開始找東西

要怎麼讓他開始找東西?

HTML

按下F12、

Selector

Xpath

Selector 語法

絕對 Xpath

一個物件的位置全名

只適用於找固定物件

相對 Xpath

語法一覽：

測試 Xpath

測試 Xpath

測試 Xpath

Css selector

實作前小練習

鯊鯊 > <

我全都要 !

1. 建立爬蟲本體

2. 找出圖片的xpath

3. 使用os建立資料夾

4. 使用 get_attribute() 找出圖片物件的 data-src

5. 使用 urllib 下載圖片

6. 關掉瀏覽器

完成 !

實作 time!

練習I. Stocks

Stocks!

0. 觀察網頁

Yahoo Stocks - Tesla

0. 觀察網頁

Yahoo Stocks - Tesla

0. 觀察網頁

Yahoo Stocks - Tesla

0. 觀察網頁

Yahoo Stocks - Tesla

1. 建立爬蟲本體

2. 選取標題

3. 選取股價

4. 選取多筆股價

5. 選取日期

第一部分完成！

畫圖

0. 套件安裝

修但幾勒
如何處理圖片?

但meme man的寬度要是整體(背景)的一半
寬高比例又不能變OAO