認識語料庫(Corpora)
自然語言處理入門
Lecturer:Chia
Outline
-
Text vs. Corpus/ Corpora
-
NLTK 收錄的語料庫
-
語料庫的內部結構
Text vs. Corpus/ Corpora
Text vs. Corpus/ Corpora
-
文本 Text
- A text (in the sense of literary theory) is any object that can be read.
Text vs. Corpus/ Corpora
- Corpus / Corpora 語料庫
- Text corpus
- In linguistics, a large and structured set of texts. (nowadays usually electronically stored and processed)
單數 / 複數
NLTK 收錄的語料庫
-
NLTK
-
Natural Language Toolkit
-
一套基於 Python 的自然語言處理工具箱,便於進行文本分析。
-
優點:
-
收錄多個的語料庫可供下載。
-
方便文本進行預處理。
-
可用NLTK處理自己的文本。
-
-
NLTK 收錄的語料庫
-
在語料蒐集上
-
盡量平衡分配在不同的體裁和語式上
-
體裁:新聞、社論、文學、聊天...
-
語式:陳述語氣、祈使語氣、條件語氣...
-
-
NLTK 收錄的語料庫
-
Gutenberg Corpus (文學作品)
-
Web and Chat Text (非正式文本)
-
Brown Corpus
-
Reuters Corpus
-
Inaugural Address Corpus (就職演說)
-
Annotated Text Corpora (語言標註的文本)
-
Corpora in Other Languages
語料庫的內部結構
-
沒有任何結構 isolated
-
依據類別區分 categorized
-
有時類別會重疊 overlapping
-
時間性 temporal
Gutenberg Corpus
-
包含古騰堡計劃電子文本檔案庫中的少量文本,該檔案庫擁有約25,000本免費電子書。
- gutenberg 語料庫內,含有哪些文本?
>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
Output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', ...]
沒有任何結構 Isolated
Gutenberg Corpus
- 不做任何處理,查看austen-emma文本?
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> emma
Output:
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse,
handsome, clever, and rich, with a comfortable home\nand happy disposition,
seemed to unite some of the best blessings\nof existence; and had lived nearly
twenty-one years in the world\nwith very little to distress or vex her. ...'
Gutenberg Corpus
- 以字詞為單位,查看austen-emma文本?
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> emma
Output:
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]
- 以句子為單位,查看austen-emma文本?
>>> emma = nltk.corpus.gutenberg.sents('austen-emma.txt')
>>> emma
Output:
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]
Gutenberg Corpus
- 配合.concordance(),找出字詞的上下文
>>> import nltk
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")
Output:
Displaying 25 of 37 matches:
er father , was sometimes taken by 'surprize' at his being still able to pity `
hem do the other any good ." " You 'surprize' me ! Emma must do Harriet good : a
Knightley actually looked red with 'surprize' and displeasure , as he stood up ,
r . Elton , and found to his great 'surprize' , that Mr . Elton was actually on
Web and Chat Text
- 收錄網路上的文本
- 如:論壇內容、即時通訊的聊天會話、口語對話內容、電影劇本、廣告和評論。
-
已做匿名處理"UserNNN"及手動刪除個人資訊。
沒有任何結構 Isolated
Corpora in Other Languages
-
多種語言的語料庫,如:udhr語料庫。
-
包含超過300種語言的《世界人權宣言》(Universal Declaration of Human Rights)。
-
沒有任何結構 Isolated
Brown Corpus
-
第一個百萬字的英語電子語料庫。
-
該語料庫內含500個文本,按類型分類。
-
用於研究文學體裁(stylistics)之間的差異。
-
>>> from nltk.corpus import brown
>>> brown.categories() #列出所有類別
Output:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
>>> brown.categories(fileids=['cg22'])
Output:
['belles_lettres']
依據類別區分 Categorized
Reuters Corpus
-
路透社語料庫,包含10,788個新聞文檔。
-
類別彼此重疊,因為一篇新聞通常涵蓋多個主題。
-
有時類別會重疊 Overlapping
>>> from nltk.corpus import reuters
>>> reuters.categories()
Output:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', ...]
>>> reuters.categories('test/15618')
Output:
['barley', 'corn', 'grain', 'wheat']
Inaugural Address Corpus
-
就職演說語料庫,包含55個文本。
-
每個總統其致辭各一個,其較為特別的屬性是時間維度(年份)。
-
>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
Output:
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]
時間性 Temporal
Thanks for listening.
NLP-Corpora
By BessyHuang
NLP-Corpora
- 312