認識語料庫(Corpora)

自然語言處理入門

Lecturer:Chia

Outline

  • Text vs. Corpus/ Corpora

  • NLTK 收錄的語料庫

  • 語料庫的內部結構

Text vs. Corpus/ Corpora

Text vs. Corpus/ Corpora

  • 文本 Text

    • A text (in the sense of literary theory) is any object that can be read.

Text vs. Corpus/ Corpora

  • Corpus / Corpora  語料庫
    • Text corpus
    • In linguistics, a large and structured set of texts. (nowadays usually electronically stored and processed)

單數   /   複數

NLTK 收錄的語料庫

  • NLTK

    • Natural Language Toolkit

    • 一套基於 Python 的自然語言處理工具箱,便於進行文本分析。

    • 優點:

      • 收錄多個的語料庫可供下載。

      • 方便文本進行預處理。

      • 可用NLTK處理自己的文本。

NLTK 收錄的語料庫

  • 在語料蒐集上

    • 盡量平衡分配在不同的體裁和語式上

      • 體裁:新聞、社論、文學、聊天...

      • 語式:陳述語氣、祈使語氣、條件語氣...

NLTK 收錄的語料庫

  • Gutenberg Corpus (文學作品)

  • Web and Chat Text (非正式文本)

  • Brown Corpus

  • Reuters Corpus

  • Inaugural Address Corpus (就職演說)

  • Annotated Text Corpora     (語言標註的文本)

  • Corpora in Other Languages

語料庫的內部結構

  • 沒有任何結構 isolated

  • 依據類別區分 categorized

  • 有時類別會重疊 overlapping

  • 時間性 temporal

Gutenberg Corpus

  • gutenberg 語料庫內,含有哪些文本?
>>> import nltk
>>> nltk.corpus.gutenberg.fileids()

Output:
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', ...]

沒有任何結構​ Isolated

Gutenberg Corpus

  • 不做任何處理,查看austen-emma文本?
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> emma

Output:
'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, 
handsome, clever, and rich, with a comfortable home\nand happy disposition, 
seemed to unite some of the best blessings\nof existence; and had lived nearly 
twenty-one years in the world\nwith very little to distress or vex her. ...'

Gutenberg Corpus

  • 字詞為單位,查看austen-emma文本?
>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> emma

Output:
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]
  • 句子為單位,查看austen-emma文本?
>>> emma = nltk.corpus.gutenberg.sents('austen-emma.txt')
>>> emma

Output:
[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ...]

Gutenberg Corpus

  • 配合.concordance(),找出字詞的上下文
>>> import nltk
>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
>>> emma.concordance("surprize")

Output:
Displaying 25 of 37 matches:
er father , was sometimes taken by 'surprize' at his being still able to pity `
hem do the other any good ." " You 'surprize' me ! Emma must do Harriet good : a
Knightley actually looked red with 'surprize' and displeasure , as he stood up ,
r . Elton , and found to his great 'surprize' , that Mr . Elton was actually on

Web and Chat Text

  • 收錄網路上的文本
    • 如:論壇內容、即時通訊的聊天會話、口語對話內容、電影劇本、廣告和評論。
  • 已做匿名處理"UserNNN"及手動刪除個人資訊。

沒有任何結構​ Isolated

Corpora in Other Languages

  • 多種語言的語料庫,如:udhr語料庫。

    • 包含超過300種語言的《世界人權宣言》(Universal Declaration of Human Rights)。

沒有任何結構​ Isolated

Brown Corpus

  • ​第一個百萬字的英語電子語料庫。

    • 該語料庫內含500個文本,按類型分類。

    • 用於研究文學體裁(stylistics)之間的差異。

>>> from nltk.corpus import brown
>>> brown.categories() #列出所有類別
Output:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 
 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

>>> brown.categories(fileids=['cg22'])
Output:
['belles_lettres']

依據類別區分 Categorized

Reuters Corpus

  • 路透社語料庫,包含10,788個新聞文檔。

    • 類別彼此重疊,因為一篇新聞通常涵蓋多個主題。

有時類別會重疊 Overlapping

>>> from nltk.corpus import reuters
>>> reuters.categories()
Output:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', ...]

>>> reuters.categories('test/15618')
Output:
['barley', 'corn', 'grain', 'wheat']

Inaugural Address Corpus

  • 就職演說語料庫,包含55個文本。

    • 每個總統其致辭各一個,其較為特別的屬性是時間維度(年份)

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()

Output:
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]

時間性 Temporal

Thanks for listening.

NLP-Corpora

By BessyHuang

NLP-Corpora

  • 323