Context-Aware Personal Information Retrieval From Multiple Social Networks

Presented by: Sophie Le Page and Theodore Morin

Authors: Xiaogang Han, Wei Wei, Chunyan Miao, Jian-Ping Mei, and Hengjie Song

Overview

Exponential growth of the web services
People use Social Network Services (SNSs) to collect and share previously seen information

For Example:

Microblogging (e.g., twitter)
Social network (e.g., Facebook)
Social bookmarking (e.g., Delicious)

Overview (continued)

Referring to and integrating previously-seen information is a common activity people do on the web
58-81% of web page access are re-visits to pages previously seen

For Example:

Replying to questions on question answering websites
Replying to posts on Social Networking Services (e.g., FriendFeed)

Problem

How to automatically retrieve the most context-relevant previously-seen web information without user intervention

For Example:

A film lover has reviewed a movie on Facebook
Friend posts about the movie on Twitter
The film lover could provide comments about the movie by retrieving the review, but may have forgotten it

Solution

Use Personal Web Information (PWI) on different SNSs and Context-aware Personal Information Retrieval (CPIR) algorithm

Study how to:

Build a query by capturing the user's information need
Retrieve the user's most relevant PWIs to facilitate information reuse

Challenges:

Posts in the conversations are short and ambiguous
User's documents in SNSs are noisy and complex

A Conversation on FriendFeed

Related Work

Personal Information Retrieval Across Multiple Social Networks

Information fragmentation problems
Diversity among platforms

Context-based query generation and retrieval approaches

Consider, post, replies to the post, PWIs of all participating users
Treat the personal information retrieval problem as a ranking problem

Social Aggregation Services

Assume users involved in the same conversation share common interests across multiple SNSs

Problem Statement

Given a session and the targeting replier, generate a query to retrieve the most relevant PWIs from the target's document collection

Symbols and Definitions

A Session (S) is an online conversation
- initial post p
- set of replies
Represented by the Vector Space Model
Each term is weighted by its tf-idf score

Context-Aware Personal Information Retrieval Algorithm

Composed into two steps:

Query formulation and expansion
PWIs ranking

Query Formulation and Expansion

Participatory context used to reformulate and expand the query
- Considers both replies and PWIs of all participating users
- PWIs of the creator and repliers are used to obtain richer information

Query Expansion

Query Q is built by modeling the session at two levels:

An initial post p and existing replies
The PWIs of the creator and existing repliers

Initial Post and Replies

First, the initial p is treated as the basic query
Next, combine the replies with p
- weighted according to their similarities with p
The expanded query is calculated as follows:

Methods, Techniques and External Sources

KL-divergence

Obtains better results than vector space based measures

Smoothing techniques

Take the entire vocabulary into consideration to compare two distributions

External source

Introduce the translation-based language model with WordNet to expand the documents before calculating similiarities

KL-divergence

The KL-divergence between p and ri:

KL-divergence (continued)

P'(w|v) is the expanded distribution:

P(w'|v) is the tf-idf score of w in v

f(w'|w) is the translation probability of word w to word w' calculated using WordNet sense similarity

KL-divergence (continued)

Similarity between p and ri:

PWIs of the creator and existing repliers

To further exand the query, consider PWIs of the creator and existing repliers
Only the top k most relevant are selected
The expanded query can be represented as:

PWIs Ranking

Implicit-topical context
- Consider shared interests
- The common interests is the topic of the conversation
- Relevant PWIs of the targeting user can be collected by implicity inferring the subset of documents on the topic

Importance Ranking

User in the same session S share common interests (at least the topic S)
Employ a Markov random walk model
Rank the PWIs of a user u on implicit relationships between the web information of all users in S
Find a subset of u's PWIs that are most relevant to the topic of the session

Transition Probability Matrix

Let G(N, E) be a graph of documents
In G a vertex ni ∈ N is a PWI d ∈ Di (Di ∈ D, and Di≠Dp)
The transition probability matrix of G is represented by P = [pij]
Each transition probability from node ni to node nj is given by:

Similarity Scores

Similarity scores between the generated Q and each PWI in G are used
Done to overcome the "dangling link" while conducting a random walk on graph G
For node ni, the reset probability xi is calculated as follows:

Normalize xi to make the sum of all elements in x equal to 1

Eigenvectors

The stationary eigenvector π can be computed iteratively using the power method
P is the transition matrix
x is the reset probability vector

Final Ranking

Use a linear combination of the two previously mentioned ranking scores:
- Importance of the document in the collection of PWIs
- Similarity between the expanded query Q and each document
Obtain the final score for each di ∈ Dt as follows:

The top ranked PWIs are selected as the recommendation results to the targeting replier

Experiments and Analysis

Involves:

Evaluating the retrieval algorithm
Describing the analysis of the dataset
Discussing performance of CPIR algorithm by comparing with other baselines

Data Description

FriendFeed dataset
- collected by monitoring the data stream on FriendFeed from 01/08/2010 to 30/09/2010 (two months)
From these conversations select
- Post-reply pairs written in English
- Repliers that have at least 50 PWIs

Manual Annotation

To construct manual annotation results:

Randomly sample 105 post-reply pairs
- replies are posted by 73 unique users
- each user has ~316 PWIs
Two volunteers manually labeled 23,046 PWIs of the repliers as relevant or irrelevant
Tokenization and part-of speech tagging are performed to eliminate noisy terms
Stop words are removed and terms are stemmed

Data Analysis

98% of conversations have at least three replies
78% of conversations have at least three unique repliers
Confirms feasibility of using the conversations to model task environment to receieve past information

Data Analysis (continued)

65% of users use at least two services
- Confirms documents are extracted from diverse information
63% of users posed more than 10 PWIs
- Motivation to utilize PWIs of users to expand query and improve retrieval performance

Data Analysis (continued)

The major portion of PWIs come from sources such as FriendFeed, Twitter, and Google Reader

Retrieval Performance

CPIR λ=1 is without graph ranking
CPIR contains CPIR λ=1 with graph ranking
CPIR λ=1 achieves improvement over baseline methods
- Expanding the initial query with replies in the conversation enhanced context cues
- Adding PWIs further captured the content information
CPIR graph-based ranking algorithm further improves performance

Distribution of Retrieved Documents

Figure 7 shows the top five social network sites with the largest number of retrieved documents
Number of retrieved documents is proportional to the total number of documents in those platforms

Parameter Settings

Optimal parameter obtained by fine tuning
Most important parameter λ controls how to combine the ranking scores from the random walk model

Conclusions and Future Work

Conclusion

CPIR significantly outperforms baseline methods

Future Work

Replace importance ranking algorithm with clustering-based techniques
Take document recency as a factor in document ranking

Paper Presentation

By sofa13

Paper Presentation

Context-Aware Personal Information Retrieval From Multiple Social Networks

Context-Aware Personal Information Retrieval From Multiple Social Networks

Overview

Overview (continued)

Problem

Solution

A Conversation on FriendFeed

Related Work

Problem Statement

Symbols and Definitions

Context-Aware Personal Information Retrieval Algorithm

Query Formulation and Expansion

Query Expansion

Initial Post and Replies

Methods, Techniques and External Sources

KL-divergence

KL-divergence (continued)

KL-divergence (continued)

PWIs of the creator and existing repliers

PWIs Ranking

Importance Ranking

Transition Probability Matrix

Similarity Scores

Eigenvectors

Final Ranking

Experiments and Analysis

Data Description

Manual Annotation

Data Analysis

Data Analysis (continued)

Data Analysis (continued)

Retrieval Performance

Distribution of Retrieved Documents

Parameter Settings

Conclusions and Future Work

Paper Presentation

More from sofa13