認識語料庫(Corpora)

自然語言處理入門

Lecturer：Chia

Outline

Text vs. Corpus/ Corpora
NLTK 收錄的語料庫
語料庫的內部結構

Text vs. Corpus/ Corpora

文本 Text
- A text (in the sense of literary theory) is any object that can be read.

Text vs. Corpus/ Corpora

Corpus / Corpora 語料庫
- Text corpus
- In linguistics, a large and structured set of texts. (nowadays usually electronically stored and processed)

單數 / 複數

NLTK 收錄的語料庫

NLTK
- Natural Language Toolkit
- 一套基於 Python 的自然語言處理工具箱，便於進行文本分析。
- 優點：
  - 收錄多個的語料庫可供下載。
  - 方便文本進行預處理。
  - 可用NLTK處理自己的文本。

NLTK 收錄的語料庫

在語料蒐集上
- 盡量平衡分配在不同的體裁和語式上
  - 體裁：新聞、社論、文學、聊天...
  - 語式：陳述語氣、祈使語氣、條件語氣...

NLTK 收錄的語料庫

Gutenberg Corpus (文學作品)
Web and Chat Text (非正式文本)
Brown Corpus
Reuters Corpus
Inaugural Address Corpus (就職演說)
Annotated Text Corpora (語言標註的文本)
Corpora in Other Languages

語料庫的內部結構

沒有任何結構 isolated
依據類別區分 categorized
有時類別會重疊 overlapping
時間性 temporal

Gutenberg Corpus

包含古騰堡計劃電子文本檔案庫中的少量文本，該檔案庫擁有約25,000本免費電子書。

gutenberg 語料庫內，含有哪些文本？

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()

Output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', ...]

沒有任何結構 Isolated

Gutenberg Corpus

不做任何處理，查看austen-emma文本？

>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> emma

Output:
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, 
handsome, clever, and rich, with a comfortable home\nand happy disposition, 
seemed to unite some of the best blessings\nof existence; and had lived nearly 
twenty-one years in the world\nwith very little to distress or vex her. ...'

Gutenberg Corpus

以字詞為單位，查看austen-emma文本？

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> emma

Output:
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]

以句子為單位，查看austen-emma文本？

>>> emma = nltk.corpus.gutenberg.sents('austen-emma.txt')
>>> emma

Output:
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]

Gutenberg Corpus

配合.concordance()，找出字詞的上下文

>>> import nltk
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")

Output:
Displaying 25 of 37 matches:
er father , was sometimes taken by 'surprize' at his being still able to pity `
hem do the other any good ." " You 'surprize' me ! Emma must do Harriet good : a
Knightley actually looked red with 'surprize' and displeasure , as he stood up ,
r . Elton , and found to his great 'surprize' , that Mr . Elton was actually on

Web and Chat Text

收錄網路上的文本
- 如：論壇內容、即時通訊的聊天會話、口語對話內容、電影劇本、廣告和評論。
已做匿名處理"UserNNN"及手動刪除個人資訊。

沒有任何結構 Isolated

Corpora in Other Languages

多種語言的語料庫，如：udhr語料庫。
- 包含超過300種語言的《世界人權宣言》(Universal Declaration of Human Rights)。

沒有任何結構 Isolated

Brown Corpus

第一個百萬字的英語電子語料庫。
- 該語料庫內含500個文本，按類型分類。
- 用於研究文學體裁(stylistics)之間的差異。

>>> from nltk.corpus import brown
>>> brown.categories() #列出所有類別
Output:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 
 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

>>> brown.categories(fileids=['cg22'])
Output:
['belles_lettres']

依據類別區分 Categorized

Reuters Corpus

路透社語料庫，包含10,788個新聞文檔。
- 類別彼此重疊，因為一篇新聞通常涵蓋多個主題。

有時類別會重疊 Overlapping

>>> from nltk.corpus import reuters
>>> reuters.categories()
Output:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', ...]

>>> reuters.categories('test/15618')
Output:
['barley', 'corn', 'grain', 'wheat']

Inaugural Address Corpus

就職演說語料庫，包含55個文本。
- 每個總統其致辭各一個，其較為特別的屬性是時間維度(年份)。

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()

Output:
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]

時間性 Temporal

認識語料庫(Corpora)

自然語言處理入門

Outline

Text vs. Corpus/ Corpora

NLTK 收錄的語料庫

語料庫的內部結構

Text vs. Corpus/ Corpora

Text vs. Corpus/ Corpora

文本 Text

Text vs. Corpus/ Corpora

NLTK 收錄的語料庫

NLTK 收錄的語料庫

NLTK 收錄的語料庫

語料庫的內部結構

Gutenberg Corpus

Gutenberg Corpus

Gutenberg Corpus

Gutenberg Corpus

Web and Chat Text

Corpora in Other Languages

Brown Corpus

Reuters Corpus

Inaugural Address Corpus

Thanks for listening.