Context-Aware Personal Information Retrieval From Multiple Social Networks
Presented by: Sophie Le Page and Theodore Morin
Authors: Xiaogang Han, Wei Wei, Chunyan Miao, Jian-Ping Mei, and Hengjie Song
Overview
- Exponential growth of the web services
- People use Social Network Services (SNSs) to collect and share previously seen information
For Example:
- Microblogging (e.g., twitter)
- Social network (e.g., Facebook)
- Social bookmarking (e.g., Delicious)
Overview (continued)
- Referring to and integrating previously-seen information is a common activity people do on the web
- 58-81% of web page access are re-visits to pages previously seen
For Example:
- Replying to questions on question answering websites
- Replying to posts on Social Networking Services (e.g., FriendFeed)
Problem
- How to automatically retrieve the most context-relevant previously-seen web information without user intervention
For Example:
- A film lover has reviewed a movie on Facebook
- Friend posts about the movie on Twitter
- The film lover could provide comments about the movie by retrieving the review, but may have forgotten it
Solution
Use Personal Web Information (PWI) on different SNSs and Context-aware Personal Information Retrieval (CPIR) algorithm
Study how to:
- Build a query by capturing the user's information need
- Retrieve the user's most relevant PWIs to facilitate information reuse
Challenges:
- Posts in the conversations are short and ambiguous
- User's documents in SNSs are noisy and complex
A Conversation on FriendFeed
Related Work
Personal Information Retrieval Across Multiple Social Networks
- Information fragmentation problems
- Diversity among platforms
Context-based query generation and retrieval approaches
- Consider, post, replies to the post, PWIs of all participating users
- Treat the personal information retrieval problem as a ranking problem
Social Aggregation Services
- Assume users involved in the same conversation share common interests across multiple SNSs
Problem Statement
- Given a session and the targeting replier, generate a query to retrieve the most relevant PWIs from the target's document collection
Symbols and Definitions
- A Session (S) is an online conversation
- initial post p
- set of replies
- Represented by the Vector Space Model
- Each term is weighted by its tf-idf score
Context-Aware Personal Information Retrieval Algorithm
Composed into two steps:
- Query formulation and expansion
- PWIs ranking
Query Formulation and Expansion
-
Participatory context used to reformulate and expand the query
- Considers both replies and PWIs of all participating users
- PWIs of the creator and repliers are used to obtain richer information
Query Expansion
Query Q is built by modeling the session at two levels:
- An initial post p and existing replies
- The PWIs of the creator and existing repliers
Initial Post and Replies
- First, the initial p is treated as the basic query
-
Next, combine the replies with p
- weighted according to their similarities with p
- The expanded query is calculated as follows:
Methods, Techniques and External Sources
KL-divergence
- Obtains better results than vector space based measures
Smoothing techniques
- Take the entire vocabulary into consideration to compare two distributions
External source
- Introduce the translation-based language model with WordNet to expand the documents before calculating similiarities
KL-divergence
The KL-divergence between p and ri:
KL-divergence (continued)
P'(w|v) is the expanded distribution:
P(w'|v) is the tf-idf score of w in v
f(w'|w) is the translation probability of word w to word w' calculated using WordNet sense similarity
KL-divergence (continued)
Similarity between p and ri:
PWIs of the creator and existing repliers
- To further exand the query, consider PWIs of the creator and existing repliers
- Only the top k most relevant are selected
- The expanded query can be represented as:
PWIs Ranking
-
Implicit-topical context
- Consider shared interests
- The common interests is the topic of the conversation
- Relevant PWIs of the targeting user can be collected by implicity inferring the subset of documents on the topic
Importance Ranking
- User in the same session S share common interests (at least the topic S)
- Employ a Markov random walk model
- Rank the PWIs of a user u on implicit relationships between the web information of all users in S
- Find a subset of u's PWIs that are most relevant to the topic of the session
Transition Probability Matrix
- Let G(N, E) be a graph of documents
- In G a vertex ni ∈ N is a PWI d ∈ Di (Di ∈ D, and Di≠Dp)
- The transition probability matrix of G is represented by P = [pij]
- Each transition probability from node ni to node nj is given by:
Similarity Scores
- Similarity scores between the generated Q and each PWI in G are used
- Done to overcome the "dangling link" while conducting a random walk on graph G
- For node ni, the reset probability xi is calculated as follows:
- Normalize xi to make the sum of all elements in x equal to 1
Eigenvectors
- The stationary eigenvector π can be computed iteratively using the power method
- P is the transition matrix
- x is the reset probability vector
Final Ranking
-
Use a linear combination of the two previously mentioned ranking scores:
- Importance of the document in the collection of PWIs
- Similarity between the expanded query Q and each document
- Obtain the final score for each di ∈ Dt as follows:
- The top ranked PWIs are selected as the recommendation results to the targeting replier
Experiments and Analysis
Involves:
- Evaluating the retrieval algorithm
- Describing the analysis of the dataset
- Discussing performance of CPIR algorithm by comparing with other baselines
Data Description
-
FriendFeed dataset
- collected by monitoring the data stream on FriendFeed from 01/08/2010 to 30/09/2010 (two months)
-
From these conversations select
- Post-reply pairs written in English
- Repliers that have at least 50 PWIs
Manual Annotation
To construct manual annotation results:
-
Randomly sample 105 post-reply pairs
- replies are posted by 73 unique users
- each user has ~316 PWIs
- Two volunteers manually labeled 23,046 PWIs of the repliers as relevant or irrelevant
- Tokenization and part-of speech tagging are performed to eliminate noisy terms
- Stop words are removed and terms are stemmed
Data Analysis
- 98% of conversations have at least three replies
- 78% of conversations have at least three unique repliers
- Confirms feasibility of using the conversations to model task environment to receieve past information
Data Analysis (continued)
-
65% of users use at least two services
- Confirms documents are extracted from diverse information
-
63% of users posed more than 10 PWIs
- Motivation to utilize PWIs of users to expand query and improve retrieval performance
Data Analysis (continued)
- The major portion of PWIs come from sources such as FriendFeed, Twitter, and Google Reader
Retrieval Performance
- CPIR λ=1 is without graph ranking
- CPIR contains CPIR λ=1 with graph ranking
-
CPIR λ=1 achieves improvement over baseline methods
- Expanding the initial query with replies in the conversation enhanced context cues
- Adding PWIs further captured the content information
- CPIR graph-based ranking algorithm further improves performance
Distribution of Retrieved Documents
- Figure 7 shows the top five social network sites with the largest number of retrieved documents
- Number of retrieved documents is proportional to the total number of documents in those platforms
Parameter Settings
- Optimal parameter obtained by fine tuning
- Most important parameter λ controls how to combine the ranking scores from the random walk model
Conclusions and Future Work
Conclusion
- CPIR significantly outperforms baseline methods
Future Work
- Replace importance ranking algorithm with clustering-based techniques
- Take document recency as a factor in document ranking
Paper Presentation
By sofa13
Paper Presentation
Context-Aware Personal Information Retrieval From Multiple Social Networks
- 607