Context-Aware Personal Information Retrieval From Multiple Social Networks

Presentation

Sophie Le Page and Theodore Morin

Authors

Xiaogang Han, Wei Wei, Chunyan Miao, Jian-Ping Mei, and Hengjie Song

Social Network Services

People use SNSs to collect and share information
- Microblogs (Twitter)
- Social networks (Facebook)
- Social bookmarks (del.icio.us)
Referring to previously-seen information is common
Three-quarters of web page hits are re-visits
- Replying to questions on QA websites
- Replying to posts on SNSs

Problem

How do we automatically retrieve the most context-relevant previously-seen web information without user intervention???

Sample User Scenario

A film lover has reviewed a movie on Facebook
Film lover's friend posts about the movie on Twitter
The film lover could provide comments about the movie by retrieving the review, but may have forgotten it

Personal Web Information

A PWI indicates previously-seen information on different SNSs
It is challenging to make connections between the user’s context and their PWIs when the PWIs are spread across multiple SNSs

Problem Statement

Given a session and the targeting replier, generate a query to retrieve the most relevant PWIs from the target's document collection

Problem Statement Example

Solution

The paper proposes the Context-Aware Personal Information Retrieval (CPIR) algorithm. The algorithm...

First builds a query by capturing the user's information need
Then retrieves the user's most relevant PWIs

Challenges

Posts in the conversations are short and ambiguous
Documents in SNSs are noisy and complex

Context-Aware Personal Information Retrieval Algorithm

Session

A Session (S) is an online conversation with
- An initial post, p
- A set of replies, R
It is represented by the Vector Space Model
Each term is weighted by its tf-idf score

Step 1

Query formulation

and expansion

Step 2

Ranking the PWIs

CPIR

Query Formulation and Expansion

Query Q is built by

Considering both the replies and the PWIs of all users participating in the Session
Using the PWIs of the creator and the repliers

Initial Post and Replies

First, the initial p is treated as the basic query
Next, combine the replies with p
- Replies are weighted according to their similarities with p
The expanded query is calculated:

Methods, Techniques and External Sources

KL-divergence method

Obtains better results than vector space based measures

Smoothing techniques

Extracts semantic information from texts that are typically short in length

WordNet external source

Expands the documents before calculating similiarities

PWIs of the Creator

and Existing Repliers

We consider the PWIs of the session creator and existing repliers to further expand the query
Only the top k most relevant PWIs are selected
The expanded query can be represented...

PWIs Ranking

Importance Ranking

Users in the same session S who share common interests are ranked more important
A Markov random walk model is used
- Markov random walk model is represented as a probability matrix graph
- Rank the PWIs of a user u on implicit relationships between the web information of all users in the Session
- Uses a subset of users' PWIs that are most relevant to the topic of the session

Final Ranking

Make final ranking with a linear combination of

Similarity between the expanded query Q and each document
Importance of the document in the collection of PWIs

Text

CPIR Algorithm

Data Description

Tested on a FriendFeed dataset

Data was collected by monitoring a data-stream on FriendFeed from 01/08/2010 to 30/09/2010 (two months)

Data was filtered to extract

Post-reply pairs written in English
Posts with repliers that have at least 50 PWIs

Manual Annotation

To construct manual annotation results

105 post-reply pairs were randomly sampled
- Posted by 73 unique users
- Users have an average of 316 PWIs
Two volunteers manually labeled 23,046 replier PWIs as relevant or irrelevant
Tokenization and part-of speech tagging are performed to eliminate noisy terms
Stop words are removed and terms are stemmed

Data Analysis

98% of conversations have at least three replies
78% of conversations have at least three unique repliers
Confirms feasibility of using the conversations to model task environment to receive past information

65% of users use at least two services
- Confirms documents are extracted from diverse information
63% of users posted more than 10 PWIs
- Gives motivation to utilize PWIs of users to expand query and improve retrieval performance

Data Analysis

Retrieval Performance

CPIR λ=1 achieves improvement over baseline methods
- Expanding the initial query with replies in the conversation enhanced context cues
- Adding PWIs further captured the content information
- KL-based measure outperforms cosine-based measure to calculate document similarities
CPIR graph-based ranking algorithm further improves performance

Parameter Settings

Optimal parameter obtained is by fine tuning
Most important parameter λ controls how to combine the ranking scores from the random walk model

Conclusions and Future Work

Conclusion

CPIR significantly outperforms baseline methods (!!!)

Future Work

Replace importance ranking algorithm with cluster-based techniques to capture multiple topics in a conversation
Use fuzzy combination methods vs linear combination for query expansion and final ranking, since they have been shown to boost performance
Take document recentness as an important factor in document ranking