TREMA-UNH at TREC 2018

Complex Answer Retrieval and News Track

 

Sumanta Kashyapi

sk1105@wildcats.unh.edu

Shubham Chatterjee

sc1242@wildcats.unh.edu

Jordan Ramsdell

jsc57@wildcats.unh.edu

Laura Dietz

dietz@cs.unh.edu

 

TREMA lab, University of New Hampshire, U.S.A

Abstract

  • This paper focus on:
    • Passage retrieval
    • Entity-aware passage retrieval
    • Entity retrieval

Introduction

  • Base on combination of low-level input runs such as BM25, Query likelihood, SDM, RM3, and Entity Context model.
  • Trained with coordinate ascent Learning-to-Rank.
  • Also experimented with a new Learning-to-Walk methods for supervised graph walks.
  • Best run is a combination of RM25, Query likelihood, and without RM3.

CAR @ TREC

  • Including the two following tasks:
    • CAR Passage Task
    • CAR Entity Task
  • Also participating the NEWS track entity ranking task.

CAR Passage Task

  • Given an article stub \(Q\)
  • Retrieve for each of its sections \(H_i\)
  • A ranking of relevant passages \(P\)
    • The passage \(P\) is a taken from a provided paragraph corpus.

CAR Entity Task

  • Given an article stub \(Q\)
  • Retrieve for each of its sections \(H_i\)
  • A ranking of relevant entities \(E\)
    • The entity \(E\) is a taken from a provided Wikipedia corpus.
    • For each entity \(E\), a support passage from the passage corpus is to be identified that explains why the entity is relevant for the heading \(H_i\) on the stub.

NEWS Entity Task

  • Given a news article with title and content.
  • Annotated with entity links to a set of entities:
    • \(\mathcal{E}=\{E_1, E_2, ... E_n\}\)
  • The task is to rank the given entities by importance for the article.

Overarching Approach

  • The KG provide access to knowledge about entities.
    • Derived from Wikipedia.
    • Wiki dump provided by the TREC CAR.
    • Doesn’t include pages from test queries.

Low-level Retrieval Features

  • Each of approaches are based on:
    • Variety of unsupervised retrieval models.
    • Document indexes.

Indexes

  • Following indexes is created for use of retrieval methods.
    • A paragraph index out of the text in passages.
    • A page index out of all visible text on Wikipedia pages.
    • An entity index out of the first paragraph, anchor text, and category info of Wikipedia.

Query Models

  • Given a stub with:
    • Page title \(T\)
    • A tree-shaped outline of headings
      • \(H1, H1.1, H1.2, H1.2.1, H2...\)
  • Section Path Queries
    • Concatenate the page title and all parent headings.
      • e.g. \(H1.2 \to T,H1,H1.2\)

Retrieval Models

  • Given a query model to transform the stub into a keyword query and an index.
    • BM25
    • Query likelihood with Dirichlet smoothing
  • Both of above are implemented in Lucene.

Expansion Models

  • None
    • No expansion, just uses initial ranking.
  • RM3
    • Expansion using 20 top paragraphs to expand.
  • ECM
    • Representing a document as a bag-of-entity-link-targets.
    • Uses top 100 paragraphs/pages.
  • ECM-psg
    • Expand query with top 100 expansion entities.

Low-level Paragraph Retrieval Features

  • sectionPath-bm25-none
    • BM25 retrieval model with no expansion.
  • sectionPath-ql-none
    • Query likelihood retrieval.
  • sectionPath-bm25-rm
    • BM25-based RM3.
  • sectionPath-ql-rm
    • likelihood-based RM3.
  • All of these methods using section path quries.

Low-level Entity Retrieval Features

  • sectionPath-bm25-ecm
    • BM25 ranking with ECM ranking based on BM25 page retrieval.
  • sectionPath-ql-ecm
    • Based on Query likelihood.
  • all-bm25-ecm
    • Constructing a query from title and all headings.
    • ECM ranking based on BM25 page.
  • all-ql-ecm
    • Based on Query likelihood.

Methods

UNH-p-l2r

  • Training the model using the coordinate ascent algorithm of RankLib’s learning to rank implementation.
  • Optimized for mean average precision.
  • Training combinatons of
    • sectionPath-bm25-none
    • sectionPath-ql-none
    • sectionPath-bm25-rm
    • sectionPath-ql-rm

UNH-e-l2r

  • Using Lucene indexes for entity, page and paragraph.
  • Features were created for each index.
  • Combining query-level, retrieval-model, and expansion-model methods of:
    • sectionPath-bm25-ecm
    • sectionPath-ql-ecm
    • all-bm25-ecm
    • all-ql-ecm

UNH-p-SDM

  • Sequential Dependence Model.
  • Using Lucene to index the TREC CAR 2017 paragraph corpus after removing stop words.
  • Indexing the unigrams, bigrams, and unordered windowed-bigrams as document fields.
    • Window size 8 words.
  • Tokenize and stem documents using Lucene’s English analyzer.

Joint Entity-Passage Methods

  • Features can be used to jointly score entities and passages.
  • Following approaches is used.
    • Passage Retrieval using Entity Features
    • Entity Retrieval using Passage Features

Passage Retrieval using Entity Features

  • Let \(f_e\) be an entity relevance feature.
    • For each passage \(p\):
      • Let \(E\) be the set of entities contained in \(p\).
      • Then a feature that scores the relevance of \(p\) given the relevance entities can be represented as \(f_p(p)=\sum_{e\in E}f_e(e)\).

Entity Retrieval using Passage Features

  • Let \(f_p\) be a passage relevance feature.
    • For each entity \(e\):
      • Let \(P\) be the set of passages that contain at least one instance of \(e\).
      • Sum over the scores of passages that contain \(e\) to produce a score for \(e\)

\(f_e(e)=\sum_{p\in P}f_p(p)\).

Description of Passage Features

  • UNH-p-SDM
    • Directly use the scores of passages obtained.
  • SDM
    • Passages are scored using a standard SDM model under query likelihood.

Description of Entity Features

  • Entity Link Frequency
  • Fielded Queries
  • Global Entity Context

Entity Link Frequency

  • Summing over the total number of times an entity was linked by candidate passages, and then normalize.

Fielded Queries

  • Consider page attributes as document fields:
    • Categories
    • Inlinks
    • Outlinks
    • Section headers
    • Page name disambiguations
    • Page name redirects
  • Scoring the relevance of entities by BM25.

Global Entity Context

  • Create a pseudo-document that contains the unigrams, bigrams, and win-dowed bigrams derived from that passage.
  • Pseudo-documents are scored with BM25.
  • Entity’s score is the highest scored context.

UNH-e-SDM

  • Top 100 candidate passages ranked using UNH-p-SDM 

UNH-p-Mixed

  • Using all of the passage and entity features.
  • Top 100 candidate passages by UNH-p-SDM.
  • All entities linked to the passages are used.
  • Learning a weighted combination of passage features.
    • Using learn to rank.
    • UNH-p-SDM feature receives the highest weight.

UNH-e-Mixed

  • Using all of the entity features.
  • Candidate passages and entities are retrieved using UNH-p-Mixed.
  • Learn a linear combination of entity features.

UNH-e-graph

  • Entities are nodes.
  • Paragraph form edges between all nodes.

​Learning to Walk

  • For all wiki pages in the Y1 Train & Test.
  • These pages are entity linked with DBpedia Spotlight.

DBpedia Spotlight

Learning to Walk

  • Predicted entity links are compared to link including in wiki page.
  • Spotlight links are only retained if the link of wiki page link to the same target.

Evaluation

Evaluation on TREC CAR

  • Y1 Train
    • benchmarkY1train
    • Automatic tree-qrels
    • 5-fold CV
  • Y1 Test
    • benchmarkY1test
    • Automatic tree-qrels
    • Trained on Y1 Train
  • Y2 Test
    • benchmarkY2test
    • Manual assessments
    • Trained on Y1 Train
    • Best methods selected on Y1 Test

Evaluation Measures

  • MAP
    • Mean-average Precision
  • RPrec
    • R-Precision
  • NDCG
    • Normalize Discounted Cumulative Gain

Title Text

  • Bullet One
  • Bullet Two
  • Bullet Three

​Performance

  • UNH-p-l2r and UNH-e-l2r work best on Y2 Test.
    • It is surprising because the method is fairly simple.
    • Both were not performing best on Y1 Train and Y1 Test.

Title Text

​Evaluation on TREC News

  • UNH-TitleBm25
    • Using the article title as a query.
    • Use BM25 to retrieve from the page index.
  • UNH-ParaBM25
    • Using the first paragraph of the news article as query.
    • Use BM25 to retrieve from the page index.
  • UNH-ParaBm25Ecm
    • Using the first paragraph of the news article as query to retrieve CAR paragraphs.
    • The ecm entity ranking is derived from these paragraphs.

Performance

  • Retrieving entities with the first paragraph works best.
    • Performs above the median and slightly below the best run.

Conclusion

  • In CAR Entity Retrieval
    • Using the context relevance result in a significant improvement over just using fielded quries features.
  • On the NEWS entity ranking task
    • Using the first paragraph of the news article is as good as its title.

TREMA-UNH at TREC 2018 Complex Answer Retrieval and News Track

By Penut Chen (PenutChen)

TREMA-UNH at TREC 2018 Complex Answer Retrieval and News Track

  • 18