SQuAD: The Stanford Question Answering Dataset

Update #4: CNN results and answer extraction methodology

August 11th, 2016

Overview

  • Week's outcomes
    • Results for CNN methodology for sentence ranking
    • Methodology proposal and preliminary results for answer extraction baseline
  • Next steps

Week's outcomes

CNN methodology for sentence ranking

Results

First Run:

  • 30% of SQUAD's training data
  • 80% training, 10% dev, 10% test
  • Exact match as labels for training

MAP = 0.2503

MRR = 0.2503

CNN methodology for sentence ranking

Results

MAP = 0.2394

MRR = 0.2394

  • Jaccard similarity as labels for training

Second Run:

  • 30% of SQUAD's training data
  • 80% training, 10% dev, 10% test

Answer extraction baseline

General Metodologies

  1. Treat sentence ranking and answer extraction as two separate tasks                                                                     - Assume the sentence ranker is  right and get an estimate of the error of the exact answer extraction.
  2. Treat sentence ranking and answer extraction as a unified process                                                                   - Pass the score of the sentence ranker as a feature for the answer extractor

Answer extraction baseline

Metodology 1.1 (almost question "agnostic")

 Idea: Use features that extract lexical, syntactical and semantical structure of sentence, question and answer to train a classifier.

For each word in answers:

  • Indicator as part of the answer (True/False)
  • Lemma (for each of its l-r neighbors)
  • POS (for each of its l-r neighbors)
  • NER (for each of its l-r neighbors)
  • Type of question (W's)
  • Animacy, number, gender and emotion

Answer extraction baseline

Metodology 1.1 (almost question "agnostic")

 Idea: Use features that extract lexical, syntactical and semantical structure of sentence, question and answer to train a classifier.

Example:  "it"

(False, u'It', u'PRP', u'O', 'whom', '', '', '', u'is', u'VBZ', u'O', u'INANIMATE', u'SINGULAR', u'NEUTRAL', u'PRONOMINAL')

Answer extraction baseline

Metodology 1.1 (almost question "agnostic")

Random forest classifier

  • 10 trees
  • Uses bootstrap sampling
  • Criterion Gini

Parameters:

Results:

  • F1 0.1392

  • Confusion matrix
502,419 5,853
60,095 5,335

True

Pred

0

1

0

1

Answer extraction baseline

Metodology 1.1 (almost question "agnostic")

Random forest classifier

  • 100 trees
  • Uses bootstrap sampling
  • Criterion Gini

Parameters:

Results:

  • F1 0.1393

  • Confusion matrix
502,671 5,644
60,352 5,345

True

Pred

0

1

0

1

Answer extraction baseline

Metodology 1.2 (question sensitive)

 Idea: Use features that extract lexical, syntactical and semantical structure of sentence, question and answer to train a classifier.

For each word in answers:

  • Indicator as part of the answer (for each of its l-r-d neighbors)
  • Indicator as part of the question (for each of its l-r-d neighbors)
  • Lemma (for each of its l-r-d neighbors)
  • POS (for each of its l-r-d neighbors)
  • NER (for each of its l-r-d neighbors)
  • Type of question
  • Animacy, number, gender and emotion
  • Type of dependency

Answer extraction baseline

Metodology 1.2 (question sensitive)

 Idea: Use features that extract lexical, syntactical and semantical structure of sentence, question and answer to train a classifier.

Example:  "it"

(False, u'It', u'PRP', u'O', False, 'whom', '', '', '', '', '', u'is', u'VBZ', u'O', False, False, u'replica', u'NN', u'O', u'nsubj', False, False, u'INANIMATE', u'SINGULAR', u'NEUTRAL', u'PRONOMINAL')

Comparison with Stanford

Next steps

  • Adapt CNN code to run efficiently on server
  • Integrate CNN with BM25
  • Add topicality to CNN feature array
  • Mix different embeddings
  • Extract and add more question knowledge to classifier (e.g. TED, question class)
  • Integrate two tasks in on single pipeline
  • Explore other methodologies: attention-based

What's next?

Update 4

By Sophie Germain

Update 4

LTI-CMU

  • 1,104