TREMA-UNH at TREC 2018

Complex Answer Retrieval and News Track

Sumanta Kashyapi

sk1105@wildcats.unh.edu

Shubham Chatterjee

sc1242@wildcats.unh.edu

Jordan Ramsdell

jsc57@wildcats.unh.edu

Laura Dietz

dietz@cs.unh.edu

TREMA lab, University of New Hampshire, U.S.A

Abstract

This paper focus on:
- Passage retrieval
- Entity-aware passage retrieval
- Entity retrieval

Introduction

Base on combination of low-level input runs such as BM25, Query likelihood, SDM, RM3, and Entity Context model.
Trained with coordinate ascent Learning-to-Rank.
Also experimented with a new Learning-to-Walk methods for supervised graph walks.
Best run is a combination of RM25, Query likelihood, and without RM3.

CAR @ TREC

Including the two following tasks:
- CAR Passage Task
- CAR Entity Task
Also participating the NEWS track entity ranking task.

CAR Passage Task

Given an article stub \(Q\)
Retrieve for each of its sections \(H_i\)
A ranking of relevant passages \(P\)
- The passage \(P\) is a taken from a provided paragraph corpus.

CAR Entity Task

Given an article stub \(Q\)
Retrieve for each of its sections \(H_i\)
A ranking of relevant entities \(E\)
- The entity \(E\) is a taken from a provided Wikipedia corpus.
- For each entity \(E\), a support passage from the passage corpus is to be identified that explains why the entity is relevant for the heading \(H_i\) on the stub.

NEWS Entity Task

Given a news article with title and content.
Annotated with entity links to a set of entities:
- \(\mathcal{E}=\{E_1, E_2, ... E_n\}\)
The task is to rank the given entities by importance for the article.

Overarching Approach

The KG provide access to knowledge about entities.
- Derived from Wikipedia.
- Wiki dump provided by the TREC CAR.
- Doesn’t include pages from test queries.

Low-level Retrieval Features

Each of approaches are based on:
- Variety of unsupervised retrieval models.
- Document indexes.

Indexes

Following indexes is created for use of retrieval methods.
- A paragraph index out of the text in passages.
- A page index out of all visible text on Wikipedia pages.
- An entity index out of the first paragraph, anchor text, and category info of Wikipedia.

Query Models

Given a stub with:
- Page title \(T\)
- A tree-shaped outline of headings
  - \(H1, H1.1, H1.2, H1.2.1, H2...\)
Section Path Queries
- Concatenate the page title and all parent headings.
  - e.g. \(H1.2 \to T,H1,H1.2\)

Retrieval Models

Given a query model to transform the stub into a keyword query and an index.
- ```
BM25
```
- ```
Query likelihood with Dirichlet smoothing
```
Both of above are implemented in Lucene.

Expansion Models

```
None
```
- No expansion, just uses initial ranking.
```
RM3
```
- Expansion using 20 top paragraphs to expand.
```
ECM
```
- Representing a document as a bag-of-entity-link-targets.
- Uses top 100 paragraphs/pages.
```
ECM-psg
```
- Expand query with top 100 expansion entities.

Low-level Paragraph Retrieval Features

```
sectionPath-bm25-none
```
- BM25 retrieval model with no expansion.
```
sectionPath-ql-none
```
- Query likelihood retrieval.
```
sectionPath-bm25-rm
```
- BM25-based RM3.
```
sectionPath-ql-rm
```
- likelihood-based RM3.
All of these methods using section path quries.

Low-level Entity Retrieval Features

```
sectionPath-bm25-ecm
```
- BM25 ranking with ECM ranking based on BM25 page retrieval.
```
sectionPath-ql-ecm
```
- Based on Query likelihood.
```
all-bm25-ecm
```
- Constructing a query from title and all headings.
- ECM ranking based on BM25 page.
```
all-ql-ecm
```
- Based on Query likelihood.

Methods

UNH-p-l2r

Training the model using the coordinate ascent algorithm of RankLib’s learning to rank implementation.
Optimized for mean average precision.

Training combinatons of

```
sectionPath-bm25-none
```
```
sectionPath-ql-none
```
```
sectionPath-bm25-rm
```
```
sectionPath-ql-rm
```

UNH-e-l2r

Using Lucene indexes for entity, page and paragraph.
Features were created for each index.
Combining query-level, retrieval-model, and expansion-model methods of:
- ```
sectionPath-bm25-ecm
```
- ```
sectionPath-ql-ecm
```
- ```
all-bm25-ecm
```
- ```
all-ql-ecm
```

UNH-p-SDM

Sequential Dependence Model.
Using Lucene to index the TREC CAR 2017 paragraph corpus after removing stop words.
Indexing the unigrams, bigrams, and unordered windowed-bigrams as document fields.
- Window size 8 words.
Tokenize and stem documents using Lucene’s English analyzer.

Joint Entity-Passage Methods

Features can be used to jointly score entities and passages.
Following approaches is used.
- Passage Retrieval using Entity Features
- Entity Retrieval using Passage Features

Passage Retrieval using Entity Features

Let \(f_e\) be an entity relevance feature.
- For each passage \(p\):
  - Let \(E\) be the set of entities contained in \(p\).
  - Then a feature that scores the relevance of \(p\) given the relevance entities can be represented as \(f_p(p)=\sum_{e\in E}f_e(e)\).

Entity Retrieval using Passage Features

Let \(f_p\) be a passage relevance feature.
- For each entity \(e\):
  - Let \(P\) be the set of passages that contain at least one instance of \(e\).
  - Sum over the scores of passages that contain \(e\) to produce a score for \(e\)

\(f_e(e)=\sum_{p\in P}f_p(p)\).

Description of Passage Features

UNH-p-SDM
- Directly use the scores of passages obtained.
SDM
- Passages are scored using a standard SDM model under query likelihood.

Description of Entity Features

Entity Link Frequency
Fielded Queries
Global Entity Context

Entity Link Frequency

Summing over the total number of times an entity was linked by candidate passages, and then normalize.

Fielded Queries

Consider page attributes as document fields:
- Categories
- Inlinks
- Outlinks
- Section headers
- Page name disambiguations
- Page name redirects
Scoring the relevance of entities by BM25.

Global Entity Context

Create a pseudo-document that contains the unigrams, bigrams, and win-dowed bigrams derived from that passage.
Pseudo-documents are scored with BM25.
Entity’s score is the highest scored context.

UNH-e-SDM

Top 100 candidate passages ranked using UNH-p-SDM

UNH-p-Mixed

Using all of the passage and entity features.
Top 100 candidate passages by UNH-p-SDM.
All entities linked to the passages are used.
Learning a weighted combination of passage features.
- Using learn to rank.
- UNH-p-SDM feature receives the highest weight.

UNH-e-Mixed

Using all of the entity features.
Candidate passages and entities are retrieved using UNH-p-Mixed.
Learn a linear combination of entity features.

UNH-e-graph

Entities are nodes.
Paragraph form edges between all nodes.

Learning to Walk

For all wiki pages in the Y1 Train & Test.
These pages are entity linked with DBpedia Spotlight.

DBpedia Spotlight

↓

Learning to Walk

Predicted entity links are compared to link including in wiki page.
Spotlight links are only retained if the link of wiki page link to the same target.

Evaluation

Evaluation on TREC CAR

Y1 Train
- ```
benchmarkY1train
```
- Automatic tree-qrels
- 5-fold CV
Y1 Test
- ```
benchmarkY1test
```
- Automatic tree-qrels
- Trained on Y1 Train
Y2 Test
- ```
benchmarkY2test
```
- Manual assessments
- Trained on Y1 Train
- Best methods selected on Y1 Test

Evaluation Measures

MAP
- Mean-average Precision
RPrec
- R-Precision
NDCG
- Normalize Discounted Cumulative Gain

Title Text

Bullet One
Bullet Two
Bullet Three

Performance

UNH-p-l2r and UNH-e-l2r work best on Y2 Test.
- It is surprising because the method is fairly simple.
- Both were not performing best on Y1 Train and Y1 Test.

Title Text

Evaluation on TREC News

UNH-TitleBm25
- Using the article title as a query.
- Use BM25 to retrieve from the page index.
UNH-ParaBM25
- Using the first paragraph of the news article as query.
- Use BM25 to retrieve from the page index.
UNH-ParaBm25Ecm
- Using the first paragraph of the news article as query to retrieve CAR paragraphs.
- The ecm entity ranking is derived from these paragraphs.

Performance

Retrieving entities with the first paragraph works best.
- Performs above the median and slightly below the best run.

Conclusion

In CAR Entity Retrieval
- Using the context relevance result in a significant improvement over just using fielded quries features.
On the NEWS entity ranking task
- Using the first paragraph of the news article is as good as its title.

TREMA-UNH at TREC 2018 Complex Answer Retrieval and News Track

By Penut Chen (PenutChen)

TREMA-UNH at TREC 2018 Complex Answer Retrieval and News Track

Penut Chen (PenutChen)

I love oppai!

github.com/penut85420

TREMA-UNH at TREC 2018

Complex Answer Retrieval and News Track

Abstract

Introduction

CAR @ TREC

CAR Passage Task

CAR Entity Task

NEWS Entity Task

Overarching Approach

Low-level Retrieval Features

Indexes

Query Models

Retrieval Models

Expansion Models

Low-level Paragraph Retrieval Features

Low-level Entity Retrieval Features

Methods

UNH-p-l2r

UNH-e-l2r

UNH-p-SDM

Joint Entity-Passage Methods

Passage Retrieval using Entity Features

Entity Retrieval using Passage Features

Description of Passage Features

Description of Entity Features

Entity Link Frequency

Fielded Queries

Global Entity Context

UNH-e-SDM

UNH-p-Mixed

UNH-e-Mixed

UNH-e-graph

​Learning to Walk

DBpedia Spotlight

Learning to Walk

Evaluation

Evaluation on TREC CAR

Evaluation Measures

Title Text

​Performance

Title Text

​Evaluation on TREC News

Performance

Conclusion

TREMA-UNH at TREC 2018 Complex Answer Retrieval and News Track

More from Penut Chen (PenutChen)

Learning to Walk

Performance

Evaluation on TREC News