Lecturer:Chia
單數 / 複數
NLTK
Natural Language Toolkit
一套基於 Python 的自然語言處理工具箱,便於進行文本分析。
優點:
收錄多個的語料庫可供下載。
方便文本進行預處理。
可用NLTK處理自己的文本。
在語料蒐集上
盡量平衡分配在不同的體裁和語式上
體裁:新聞、社論、文學、聊天...
語式:陳述語氣、祈使語氣、條件語氣...
Gutenberg Corpus (文學作品)
Web and Chat Text (非正式文本)
Brown Corpus
Reuters Corpus
Inaugural Address Corpus (就職演說)
Annotated Text Corpora (語言標註的文本)
Corpora in Other Languages
沒有任何結構 isolated
依據類別區分 categorized
有時類別會重疊 overlapping
時間性 temporal
包含古騰堡計劃電子文本檔案庫中的少量文本,該檔案庫擁有約25,000本免費電子書。
>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
Output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', ...]
沒有任何結構 Isolated
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> emma
Output:
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,
handsome, clever, and rich, with a comfortable home\nand happy disposition,
seemed to unite some of the best blessings\nof existence; and had lived nearly
twenty-one years in the world\nwith very little to distress or vex her. ...'
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> emma
Output:
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]
>>> emma = nltk.corpus.gutenberg.sents('austen-emma.txt')
>>> emma
Output:
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]
>>> import nltk
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")
Output:
Displaying 25 of 37 matches:
er father , was sometimes taken by 'surprize' at his being still able to pity `
hem do the other any good ." " You 'surprize' me ! Emma must do Harriet good : a
Knightley actually looked red with 'surprize' and displeasure , as he stood up ,
r . Elton , and found to his great 'surprize' , that Mr . Elton was actually on
已做匿名處理"UserNNN"及手動刪除個人資訊。
沒有任何結構 Isolated
多種語言的語料庫,如:udhr語料庫。
包含超過300種語言的《世界人權宣言》(Universal Declaration of Human Rights)。
沒有任何結構 Isolated
第一個百萬字的英語電子語料庫。
該語料庫內含500個文本,按類型分類。
用於研究文學體裁(stylistics)之間的差異。
>>> from nltk.corpus import brown
>>> brown.categories() #列出所有類別
Output:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
>>> brown.categories(fileids=['cg22'])
Output:
['belles_lettres']
依據類別區分 Categorized
路透社語料庫,包含10,788個新聞文檔。
類別彼此重疊,因為一篇新聞通常涵蓋多個主題。
有時類別會重疊 Overlapping
>>> from nltk.corpus import reuters
>>> reuters.categories()
Output:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', ...]
>>> reuters.categories('test/15618')
Output:
['barley', 'corn', 'grain', 'wheat']
就職演說語料庫,包含55個文本。
每個總統其致辭各一個,其較為特別的屬性是時間維度(年份)。
>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
Output:
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]
時間性 Temporal