N-gram簡介及應用
Lecturer:Lamuyag
Date:Oct. 18th, 2020
OUTLINE
-
什麼是N-Gram
-
二、三元模型公式
-
實際使用
-
lab
-
參考資料
什麼是N-Gram
每一個字節片段=gram
假設第N個詞出現只與前面N-1個詞相關
常用
Bi-Gram、Tri-Gram
二元模型、三元模型

二元模型公式

實際例子




基於語料庫判斷合理性
s1 = "<s> I want English food </s>"
s2 = "<s> want I English food</s>"

基於語料庫判斷合理性
P(s1)=
P(i|<s>)P(want|i)P(english|want)P(food|english)P(</s>|food)
=0.25×0.33×0.0011×0.5×0.68=0.000031
P(s2)=P(want|<s>)P(i|want)P(english|want)P(food|english)P(</s>|food)
=0.25*0.0022*0.0011*0.5*0.68 = 0.00000002057

基於語料庫判斷合理性
P(s1)=0.000031
P(s2)=0.00000002057
P(s1)>P(s2)
實作
from collections import Counter, namedtuple
import json
import re
DATASET_DIR = './WebNews.json'
with open(DATASET_DIR, encoding = 'utf8') as f:
dataset = json.load(f)
# 除了繁體中文字以外的字
seg_list = list(map(lambda d: d['detailcontent'], dataset))
rule = re.compile(r"[^\u4e00-\u9fa5]")
seg_list = [rule.sub('', seg) for seg in seg_list]
# 利用 set,將重複的字與機率去除,例如計算出兩次在「桃園」後出現「縣」的機率都是 1,只保留一組。
def ngram(documents, N=2):
ngram_prediction = dict()
total_grams = list()
words = list()
Word = namedtuple('Word', ['word', 'prob'])
for doc in documents:
split_words = ['<s>'] + list(doc) + ['</s>']
# 計算分子
[total_grams.append(tuple(split_words[i:i+N])) for i in range(len(split_words)-N+1)]
# 計算分母
[words.append(tuple(split_words[i:i+N-1])) for i in range(len(split_words)-N+2)]
total_word_counter = Counter(total_grams)
word_counter = Counter(words)
for key in total_word_counter:
word = ''.join(key[:N-1])
if word not in ngram_prediction:
ngram_prediction.update({word: set()})
next_word_prob = total_word_counter[key]/word_counter[key[:N-1]]
w = Word(key[-1], '{:.3g}'.format(next_word_prob))
ngram_prediction[word].add(w)
return ngram_prediction
# 使用trigram,也就是計算接在兩個字之後第三個字的機率。對結果進行排序,因此在預測下一個字時,能夠直接取得前幾個最高機率的字。
tri_prediction = ngram(seg_list, N=3)
for word, ng in tri_prediction.items():
tri_prediction[word] = sorted(ng, key=lambda x: x.prob, reverse=True)
# 預測輸入的下一個字
text = '桃園'
next_words = list(tri_prediction[text])[:5]
for next_word in next_words:
print('next word: {}, probability: {}'.format(next_word.word, next_word.prob))參考資料
N-gram簡介及應用
By Lamuyang
N-gram簡介及應用
- 107