講者:Chia
日期:Oct. 10th, 2020
同時兼具方向和距離大小的移動量
分量
分量
三角函數
透過向量內積,我們希望求得什麼?
計算出兩向量夾角的度數。
|| A ||
|| B ||
內積公式: a⋅b = |a| |b| cosθ
鄰邊的長度
夾角越小,即
cos( θ ) 接近 1
代表兩向量越相像。
= cos( θ )
特徵擷取演算法
(Ex: TF-IDF)
+ 餘弦相似度計算
How to use?
# Vectors
vec_a = [1, 2, 3, 4, 5]
vec_b = [1, 3, 5, 7, 9]
# Dot and norm
dot = sum(a*b for a, b in zip(vec_a, vec_b))
norm_a = sum(a*a for a in vec_a) ** 0.5
norm_b = sum(b*b for b in vec_b) ** 0.5
# Cosine similarity
cos_sim = dot / (norm_a*norm_b)
# Results
print('My version:', cos_sim)
# My version: 0.9972413740548081
from sklearn.metrics.pairwise import cosine_similarity
# Vectors
vec_a = [1, 2, 3, 4, 5]
vec_b = [1, 3, 5, 7, 9]
# Results
print('Scikit-Learn:', cosine_similarity([vec_a], [vec_b]))
# Scikit-Learn: [[0.99724137]]
sklearn 版
# 安裝所需套件
pip install scikit-learn pandas
# 安裝jieba斷詞器台灣繁體特化版本
pip install git+https://github.com/APCLab/jieba-tw.git
設計說明
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import jieba
def stopword(docs_seg_list):
stopWords = []
with open('stopWords.txt', 'r', encoding='UTF-8') as file:
for data in file.readlines():
data = data.strip()
stopWords.append(data)
new_docs_seg = []
for doc in docs_seg_list:
doc = doc.split()
remainderWords = list(filter(lambda a: a not in stopWords and a != '\n', doc))
# print(remainderWords)
new_docs_seg.append(' '.join(remainderWords))
# print(new_docs_seg)
return new_docs_seg
# Step1: 已斷詞的文本
docs = ["圖書館有提供掃描服務嗎?", "圖書館有提供影印及列印服務嗎?", "圖書館裡面可以吃東西嗎?"]
docs_seg = []
for doc in docs:
seg = jieba.cut(doc)
# seg = stopword(seg)
docs_seg.append(' '.join(seg))
print(docs_seg)
# Step2: 建立計算每個term在doc有多少個
# 加入tokenizer參數,覆寫原有的tokenize程序,否則單一個中文字會被自動審略掉
# 參考 https://github.com/scikit-learn/scikit-learn/issues/7251#issuecomment-242897897
vectorizer = CountVectorizer(tokenizer=lambda x: x.split())
text_count_vector = vectorizer.fit_transform(docs_seg)
tf_vector = text_count_vector.toarray()
# Step3: 計算TFIDF
tfidf_transfomer = TfidfTransformer()
docs_tfidf = tfidf_transfomer.fit_transform(text_count_vector)
df = pd.DataFrame(docs_tfidf.T.toarray(), index=vectorizer.get_feature_names())
# Step4: TF-IDF + 餘弦相似度
print('============TF-IDF + 餘弦相似度============')
print('文本:\n{}\t{}\nTF-IDF + 餘弦相似度:{}\n'.format(docs[0], docs[1], cosine_similarity([df[0]], [df[1]])))
print('文本:\n{}\t{}\nTF-IDF + 餘弦相似度:{}\n'.format(docs[0], docs[2], cosine_similarity([df[0]], [df[2]])))
print('文本:\n{}\t{}\nTF-IDF + 餘弦相似度:{}\n'.format(docs[1], docs[2], cosine_similarity([df[1]], [df[2]])))