Ultra-Fine Entity Typing

 

Eunsol Choi

Omer Levy

Yejin Choi

Luke Zettlemoyer*

 

Paul G. Allen School of Computer Science & Engineering, University of Washington

*Allen Institute for Artificial Intelligence, Seattle WA

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 87-96

Melbourne, Australia, July 15-20, 2018. (C) 2018 Association for Computational Linguistics

Abstract

Abstract

  • New entity typing task.
  • New evaluation sets.
  • New model can predict open types.
    • Achieves state-of-the-art performance.
    • Set baseline for the new dataset.

Introduction

Example

  • Bill robbed John. He was arrested.

    • "Bill", "he" are both "criminal".
      • Due to "robbing" & "arresting".
    • "John" is a "victim".
      • Because he was "robbed".

New Task

  • Given a sentence with a target entity mention.
  • Predict free-form NP for the target entities.

Existing Datasets

  • The label is heavily skewed toward coarse-grained types.
  • e.g. OntoNotes dataset marked about half of mentions as "other".

New Dataset

  • More diverse and fine-grained.

Task & Data

Task

  • Given a sentence and an entity mention \(e\).
  • Predict a set of natural-language phrases \(T\) that describe the type of \(e\).
  • The selection of \(T\) is context sensitive.
    • e.g. "Bill Gates has donated billions to eradicate malaria."
    • "Bill Gates" should be typed as "philanthropist" but not "inventor".

Data

  • About 6K mentions via crowdsourcing.
  • Using a large type vocabulary.

Sentence Source

  • Gigaword
  • OntoNotes
  • Web Articles
    • Via links to Wikipedia

Automatic Mention Detection

  • Maximal NP from constituency parser.
  • Mentions from coreference resolution system.
  • e.g. In 1817, in collaboration with David Hare, he set up the Hindu College.

Annotators

  • 5 workers from Mechanical Turk.
  • About 10K frequent NPs from Wiktionary.
  • Provide labels per each example.
  • Using WordNet to expand synonyms and hypernyms.
  • Only collect types selected by at least 3 annotators.

Data Analysis

  • Each type is classified into 3 disjoint bins:
    • 9 general types
      • e.g. person, location.
    • 121 fine-grained types
      • Such as film, athlete.
      • Mapped to labels from prior work.
    • 10,201 ultra-fine types
      • Encompassing every other label in the type space.
      • e.g. detective, lawsuit.

Data Analysis

  • 6K examples
  • 5 labels per example.
    • 0.9 general type
    • 0.6 fine types
    • 3.9 ultra-fine types.
  • 2.3k of unique types.
  • 429 types to cover 80% labels.

Type Coverage

  • To cover 80%:
    • FIGER requires only 7 types.
    • OntoNotes requires only 4 types.
    • The new dataset requires 429 types.

Mention Coverage

  • Existing datasets focus on named entity mentions.
    • OntoNotes contained nomial expressions.
  • The new dataset:
    • 40% pronouns.
    • 38% nominal expressions.
    • 22% named entity mentions.

Distant Supervision

Distant Supervision

  • Fine-grained NER systems is typically obtained by:
    • Linking entity mentions.
    • Drawing their types from knowledge bases.
  • Limitations
    • Recall can suffer due to KB incompleteness.
    • Precision can suffer when the selected types do not fit the context.

Recall Problem

  • Mining entity mentions that were linked to wiki page.
  • Extract types from their encyclopedic definitions.

Precision Problem

  • A new source of distant supervision.
  • Automatically extracted nominal head words from raw text.
  • Using head words as a form of distant supervision provides fine-grained information about mentions.
    • e.g. The 44th president of Unite State.

Entity Linking

  • The first sentence of a wiki page often states the entity's type via an "is a" relation.
  • Extracted descriptions for 3.1M entities which contain 4.6K unique type labels.

Contextualized Supervision

  • Many nominal entity mentions include type information.
  • Nominal head words are extracted.
    • With a dependency parser from Gigaword and Wikilink.
  • Lower case all words and convert plural to singular.

Model

Model

  • The architecture is from neural AttentiveNER model.
    • Improving the representations.
    • Introducing a new multitask objective to handle multiple sources of supervision.

Context Representation

  • Given a sentence \(x_1, ..., x_n\)
  • Represent each token \(x_i\) using a pre-trained word embedding \(w_i\).
  • Concatenate an additional location embedding \(l_i\).
    • Whether \(x_i\) is before, inside, or after the mention.

Context Representation

  • Using \([x_i;l_i]\) as an input to a bidirectional LSTM.
    • Producing a contextualized representation \(h_i\) for each token.
  • Represent the context \(c\) as a weighted sum of the contextualized token representations using MLP-based attention:
    • \(a_i=SoftMax_i(v_a\cdot relu(W_ah_i))\)
    • \(W_a\) and \(v_a\) are parameters of MLP.

Mention Representation

  • Represent the mention \(m\) as the concatenation of two items:
    • A character-based representation produced by a CNN.
    • A weighted sum of the pre-trained word embeddings in the mention span computed by attention.

Final Representation

  • Concatenation of context and mention representation.
  • The final representation: \(r=[c;m]\)

Label Prediction

  • A type label embedding matrix \(W_t \in R^{n\times d}\)
    • \(n\) is the number of labels in prediction space.
    • \(d\) is the dimension of \(r\).
  • This matrix is combination of \(W_{gerenal}, W_{fine}, W_{ultra}\).
  • Each type's probability is predicted via the sigmoid of its inner product with \(r: y=\sigma(W_tr)\)
    • Predict every type \(t\) for which \(y_t>0.5\).
    • Or \(arg\ max\ y_t\) if no such type.

Multiple Source

  • Distant supervision provide partial ultra-fine types.
  • KBs provide general types.
  • Head words provide only ultra-fine types.

Multitask Objective

  • Divide labels into three bins (general, fine, and ultra-fine).
  • Update labels only in bin.
  • The training objective is to minimize \(J\) where \(t\) is the target vector at each granularity:
    • \(J_{all}=J_{general}\cdot 1_{general}(t)+J_{fine}\cdot 1_{fine}(t)+J_{ultra}\cdot 1_{ultra}(t)\)
    • \(l_{category}(t)\) is an indicator function
      • To check if \(t\) contains a type in the category.
    • \(J_{category}\) is the category-specific logistic regression objective:
      • \(J=-\sum_{i}t_i\cdot log(y_i)+(1-t_i)\cdot log(1-y_i)\)

Evaluation

Experiment

  • The AttentiveNER were reimplemented for reference.
  • Measure
    • Macro-averaged precision, recall, F1.
    • average mean reciprocal rank (MRR).

Results

Breakdown Results

Analysis

  • 50 examples were analyzed from the dev set.

Improving Existing Fine-Grained NER with Better Distant Supervision

Experiment

  • The widely-used OntoNotes dataset were choosed.
  • Augmenting the training data.
  • Compare performance to other published results and the reimplementation of AttentiveNER.
  • Measure
    • Macro- and micro-averaged F1 score and accuracy.

Augmenting the Training Data

  • Manually mapping.
    • 77% directly correspond to suitable labels.
  • Expanding labels according to their hypernyms.

Results

  • Bullet One
  • Bullet Two
  • Bullet Three

Ablation Results

Conclusion

Conclusion

  • These new form of distant supervision boost performa-nce on both the new and the existing dataset.
  • These result set the first performance and suggest that the data will support significant future work.

Ultra-Fine Entity Typing

By Penut Chen (PenutChen)