Context-Aware Personal Information Retrieval From Multiple Social Networks

Presented by: Sophie Le Page and Theodore Morin

Authors: Xiaogang Han, Wei Wei, Chunyan Miao, Jian-Ping Mei, and Hengjie Song

Overview

  • Exponential growth of the web services
  • People use Social Network Services (SNSs) to collect and share previously seen information


For Example:

  • Microblogging (e.g., twitter)
  • Social network (e.g., Facebook)
  • Social bookmarking (e.g., Delicious)

Overview (continued)

  • Referring to and integrating previously-seen information is a common activity people do on the web
  • 58-81% of web page access are re-visits to pages previously seen

 

For Example:

  • Replying to questions on question answering websites
  • Replying to posts on Social Networking Services (e.g., FriendFeed)

Problem

  • How to automatically retrieve the most context-relevant previously-seen web information without user intervention

 

For Example:

  • A film lover has reviewed a movie on Facebook
  • Friend posts about the movie on Twitter
  • The film lover could provide comments about the movie by retrieving the review, but may have forgotten it

Solution

Use Personal Web Information (PWI) on different SNSs and Context-aware Personal Information Retrieval (CPIR) algorithm

 

Study how to:

  • Build a query by capturing the user's information need
  • Retrieve the user's most relevant PWIs to facilitate information reuse

 

Challenges:

  • Posts in the conversations are short and ambiguous
  • User's documents in SNSs are noisy and complex

A Conversation on FriendFeed

Related Work

Personal Information Retrieval Across Multiple Social Networks

  • Information fragmentation problems
  • Diversity among platforms 

Context-based query generation and retrieval approaches

  • Consider, post, replies to the post, PWIs of all participating users
  • Treat the personal information retrieval problem as a ranking problem

Social Aggregation Services

  • Assume users involved in the same conversation share common interests across multiple SNSs

Problem Statement 

  • Given a session and the targeting replier, generate a query to retrieve the most relevant PWIs from the target's document collection

Symbols and Definitions

  • A Session (S) is an online conversation
    • initial post p
    • set of replies
  • Represented by the Vector Space Model
  • Each term is weighted by its tf-idf score

Context-Aware Personal Information Retrieval Algorithm

Composed into two steps:

  1. Query formulation and expansion
  2. PWIs ranking

Query Formulation and Expansion

  • Participatory context used to reformulate and expand the query
    • Considers both replies and PWIs of all participating users
    • PWIs of the creator and repliers are used to obtain richer information

Query Expansion

Query Q is built by modeling the session at two levels:

  1. An initial post p and existing replies
  2. The PWIs of the creator and existing repliers

Initial Post and Replies

  • First, the initial p is treated as the basic query
  • Next, combine the replies with p
    • weighted according to their similarities with p
  • The expanded query is calculated as follows:

Methods, Techniques and External Sources

KL-divergence

  • Obtains better results than vector space based measures

Smoothing techniques

  • Take the entire vocabulary into consideration to compare two distributions 

External source

  • Introduce the translation-based language model with WordNet to expand the documents before calculating similiarities

KL-divergence

The KL-divergence between p and ri:

KL-divergence (continued)

P'(w|v) is the expanded distribution:

P(w'|v) is the tf-idf score of w in v

 

f(w'|w) is the translation probability of word w to word w' calculated using WordNet sense similarity

KL-divergence (continued)

Similarity between p and ri:

PWIs of the creator and existing repliers

  • To further exand the query, consider PWIs of the creator and existing repliers
  • Only the top k most relevant are selected
  • The expanded query can be represented as:

PWIs Ranking

  • Implicit-topical context
    • Consider shared interests
    • The common interests is the topic of the conversation
    • Relevant PWIs of the targeting user can be collected by implicity inferring the subset of documents on the topic

Importance Ranking

  • User in the same session S share common interests (at least the topic S)
  • Employ a Markov random walk model
  • Rank the PWIs of a user u on implicit relationships between the web information of all users in S
  • Find a subset of u's PWIs that are most relevant to the topic of the session

Transition Probability Matrix

  • Let G(N, E) be a graph of documents
  • In G a vertex ni ∈ N is a PWI d  Di (Di  D, and DiDp)
  • The transition probability matrix of G is represented by P = [pij]
  • Each transition probability from node ni to node nj is given by:

Similarity Scores

  • Similarity scores between the generated Q and each PWI in G are used
  • Done to overcome the "dangling link" while conducting a random walk on graph G
  • For node ni, the reset probability xi is calculated as follows:
  • Normalize xi to make the sum of all elements in x equal to 1

Eigenvectors

  • The stationary eigenvector π can be computed iteratively using the power method
  • P is the transition matrix  
  • x is the reset probability vector

Final Ranking

  • Use a linear combination of the two previously mentioned ranking scores:
    • Importance of the document in the collection of PWIs
    • Similarity between the expanded query Q and each document
  • Obtain the final score for each di  Dt as follows:
  • The top ranked PWIs are selected as the recommendation results to the targeting replier

Experiments and Analysis

Involves:

  • Evaluating the retrieval algorithm
  • Describing the analysis of the dataset
  • Discussing performance of CPIR algorithm by comparing with other baselines

Data Description

  • FriendFeed dataset
    • collected by monitoring the data stream on FriendFeed from 01/08/2010 to 30/09/2010 (two months)
  • From these conversations select 
    • Post-reply pairs written in English
    • Repliers that have at least 50 PWIs

Manual Annotation

To construct manual annotation results:

  • Randomly sample 105 post-reply pairs
    • replies are posted by 73 unique users
    • each user has ~316 PWIs
  • Two volunteers manually labeled 23,046 PWIs of the repliers as relevant or irrelevant
  • Tokenization and part-of speech tagging are performed to eliminate noisy terms
  • Stop words are removed and terms are stemmed

Data Analysis

  • 98% of conversations have at least three replies
  • 78% of conversations have at least three unique repliers
  • Confirms feasibility of using the conversations to model task environment to receieve past information

Data Analysis (continued)

  • 65% of users use at least two services
    • Confirms documents are extracted from diverse information
  • 63% of users posed more than 10 PWIs
    • Motivation to utilize PWIs of users to expand query and improve retrieval performance

Data Analysis (continued)

  • The major portion of PWIs come from sources such as FriendFeed, Twitter, and Google Reader

Retrieval Performance

  • CPIR λ=1 is without graph ranking
  • CPIR contains CPIR λ=1 with graph ranking
  • CPIR λ=1 achieves improvement over baseline methods
    • ​Expanding the initial query with replies in the conversation enhanced context cues
    • Adding PWIs further captured the content information
  • CPIR graph-based ranking algorithm further improves performance

Distribution of Retrieved Documents

  • Figure 7 shows the top five social network sites with the largest number of retrieved documents
  • Number of retrieved documents is proportional to the total number of documents in those platforms

Parameter Settings

  • Optimal parameter obtained by fine tuning
  • Most important parameter λ controls how to combine the ranking scores from the random walk model​

Conclusions and Future Work

Conclusion

  • CPIR significantly outperforms baseline methods

 

Future Work

  • Replace importance ranking algorithm with clustering-based techniques
  • Take document recency as a factor in document ranking

Paper Presentation

By sofa13

Paper Presentation

Context-Aware Personal Information Retrieval From Multiple Social Networks

  • 619