餘弦相似度

講者：Chia

日期：Oct. 10th, 2020

什麼是「向量」？

同時兼具方向和距離大小的移動量

分量

什麼是「向量內積」？

三角函數

什麼是「向量內積」？

透過向量內積，我們希望求得什麼？

計算出兩向量夾角的度數。

|| A ||

|| B ||

鄰邊的長度

「餘弦相似度」又是什麼？

Cosine Similarity
- 一種相當常見，用來計算文本相似度的方法。
- 目標：求得『兩個向量』之間的夾角 $θ$

夾角越小，即

$cos ( θ$ ) 接近 1

代表兩向量越相像。

= cos $( θ )$

「餘弦相似度」又是什麼？

總而言之，

餘弦相似度 (Cosine Similarity) 就是

計算兩個 n 維向量間的相似程度

特徵擷取演算法

(Ex: TF-IDF)

+ 餘弦相似度計算

How to use?

手刻程式碼

# Vectors
vec_a = [1, 2, 3, 4, 5]
vec_b = [1, 3, 5, 7, 9]

# Dot and norm
dot = sum(a*b for a, b in zip(vec_a, vec_b))
norm_a = sum(a*a for a in vec_a) ** 0.5
norm_b = sum(b*b for b in vec_b) ** 0.5

# Cosine similarity
cos_sim = dot / (norm_a*norm_b)

# Results
print('My version:', cos_sim)
# My version: 0.9972413740548081

手刻程式碼

from sklearn.metrics.pairwise import cosine_similarity

# Vectors
vec_a = [1, 2, 3, 4, 5]
vec_b = [1, 3, 5, 7, 9]

# Results
print('Scikit-Learn:', cosine_similarity([vec_a], [vec_b]))
# Scikit-Learn: [[0.99724137]]

sklearn 版

應用範例：計算文章的相似度

# 安裝所需套件
pip install scikit-learn pandas

# 安裝jieba斷詞器台灣繁體特化版本
pip install git+https://github.com/APCLab/jieba-tw.git

應用範例：計算文章的相似度

使用 jieba 斷詞，設置自定義的「停用詞」字典
使用 TF-IDF 找出文章的關鍵詞，生成各文章的 TF-IDF 向量
計算兩兩文章向量的餘弦相似度，值越大，就表示越相似

設計說明

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import jieba

def stopword(docs_seg_list):
    stopWords = []
    with open('stopWords.txt', 'r', encoding='UTF-8') as file:
        for data in file.readlines():
            data = data.strip()
            stopWords.append(data)

    new_docs_seg = []
    for doc in docs_seg_list:
        doc = doc.split()
        remainderWords = list(filter(lambda a: a not in stopWords and a != '\n', doc))
        # print(remainderWords)
        new_docs_seg.append(' '.join(remainderWords))
        # print(new_docs_seg)
    return new_docs_seg


# Step1: 已斷詞的文本
docs = ["圖書館有提供掃描服務嗎？", "圖書館有提供影印及列印服務嗎？", "圖書館裡面可以吃東西嗎？"]
docs_seg = []
for doc in docs:
    seg = jieba.cut(doc)
    # seg = stopword(seg)
    docs_seg.append(' '.join(seg))
print(docs_seg)


# Step2: 建立計算每個term在doc有多少個
# 加入tokenizer參數，覆寫原有的tokenize程序，否則單一個中文字會被自動審略掉
# 參考 https://github.com/scikit-learn/scikit-learn/issues/7251#issuecomment-242897897
vectorizer = CountVectorizer(tokenizer=lambda x: x.split()) 
text_count_vector = vectorizer.fit_transform(docs_seg)
tf_vector = text_count_vector.toarray()


# Step3: 計算TFIDF
tfidf_transfomer = TfidfTransformer()
docs_tfidf = tfidf_transfomer.fit_transform(text_count_vector)
df = pd.DataFrame(docs_tfidf.T.toarray(), index=vectorizer.get_feature_names())


# Step4: TF-IDF + 餘弦相似度
print('============TF-IDF + 餘弦相似度============')
print('文本：\n{}\t{}\nTF-IDF + 餘弦相似度：{}\n'.format(docs[0], docs[1], cosine_similarity([df[0]], [df[1]])))
print('文本：\n{}\t{}\nTF-IDF + 餘弦相似度：{}\n'.format(docs[0], docs[2], cosine_similarity([df[0]], [df[2]])))
print('文本：\n{}\t{}\nTF-IDF + 餘弦相似度：{}\n'.format(docs[1], docs[2], cosine_similarity([df[1]], [df[2]])))

應用範例：找出相似的文章

參考資料

[Book] 深度學習的數學地圖：用 Python 實作神經網路的數學模型（附數學快查學習地圖）
[Web] Cosine Similarity (餘弦相似度) 的計算方法及程式碼
[Web] Cosine Similarity – Understanding the math and how it works (with python codes)
[Slides] 15分鐘搞懂TFIDF by 土豆
[Hackmd Note] Cosine Similarity by Chia

餘弦相似度

目次

什麼是「向量」、「向量內積」？

「餘弦相似度」又是什麼？

手刻程式碼

應用範例：計算文章的相似度

參考資料

什麼是「向量」？

什麼是「向量內積」？

什麼是「向量內積」？

「餘弦相似度」又是什麼？

「餘弦相似度」又是什麼？

總而言之，

餘弦相似度 (Cosine Similarity) 就是

計算兩個 n 維向量間的相似程度

手刻程式碼

手刻程式碼

應用範例：計算文章的相似度

應用範例：計算文章的相似度

應用範例：找出相似的文章

參考資料