Context-Aware Personal Information Retrieval From Multiple Social Networks
Presentation
Sophie Le Page and Theodore Morin
Authors
Xiaogang Han, Wei Wei, Chunyan Miao, Jian-Ping Mei, and Hengjie Song
Social Network Services
-
People use SNSs to collect and share information
- Microblogs (Twitter)
- Social networks (Facebook)
- Social bookmarks (del.icio.us)
- Referring to previously-seen information is common
-
Three-quarters of web page hits are re-visits
- Replying to questions on QA websites
- Replying to posts on SNSs
Problem
How do we automatically retrieve the most context-relevant previously-seen web information without user intervention???
Sample User Scenario
- A film lover has reviewed a movie on Facebook
- Film lover's friend posts about the movie on Twitter
- The film lover could provide comments about the movie by retrieving the review, but may have forgotten it
Personal Web Information
- A PWI indicates previously-seen information on different SNSs
- It is challenging to make connections between the user’s context and their PWIs when the PWIs are spread across multiple SNSs
Problem Statement
Given a session and the targeting replier, generate a query to retrieve the most relevant PWIs from the target's document collection
Problem Statement Example
Solution
The paper proposes the Context-Aware Personal Information Retrieval (CPIR) algorithm. The algorithm...
- First builds a query by capturing the user's information need
- Then retrieves the user's most relevant PWIs
Challenges
- Posts in the conversations are short and ambiguous
- Documents in SNSs are noisy and complex
Context-Aware Personal Information Retrieval Algorithm
Session
-
A Session (S) is an online conversation with
- An initial post, p
- A set of replies, R
- It is represented by the Vector Space Model
- Each term is weighted by its tf-idf score
Step 1
Query formulation
and expansion
Step 2
Ranking the PWIs
CPIR
Query Formulation and Expansion
Query Q is built by
- Considering both the replies and the PWIs of all users participating in the Session
- Using the PWIs of the creator and the repliers
Initial Post and Replies
- First, the initial p is treated as the basic query
-
Next, combine the replies with p
- Replies are weighted according to their similarities with p
- The expanded query is calculated:
Methods, Techniques and External Sources
KL-divergence method
- Obtains better results than vector space based measures
Smoothing techniques
- Extracts semantic information from texts that are typically short in length
WordNet external source
- Expands the documents before calculating similiarities
PWIs of the Creator
and Existing Repliers
- We consider the PWIs of the session creator and existing repliers to further expand the query
- Only the top k most relevant PWIs are selected
- The expanded query can be represented...
PWIs Ranking
Importance Ranking
- Users in the same session S who share common interests are ranked more important
-
A Markov random walk model is used
- Markov random walk model is represented as a probability matrix graph
- Rank the PWIs of a user u on implicit relationships between the web information of all users in the Session
- Uses a subset of users' PWIs that are most relevant to the topic of the session
Final Ranking
Make final ranking with a linear combination of
- Similarity between the expanded query Q and each document
- Importance of the document in the collection of PWIs
Text
CPIR Algorithm
Data Description
Tested on a FriendFeed dataset
- Data was collected by monitoring a data-stream on FriendFeed from 01/08/2010 to 30/09/2010 (two months)
Data was filtered to extract
- Post-reply pairs written in English
- Posts with repliers that have at least 50 PWIs
Manual Annotation
To construct manual annotation results
-
105 post-reply pairs were randomly sampled
- Posted by 73 unique users
- Users have an average of 316 PWIs
- Two volunteers manually labeled 23,046 replier PWIs as relevant or irrelevant
- Tokenization and part-of speech tagging are performed to eliminate noisy terms
- Stop words are removed and terms are stemmed
Data Analysis
- 98% of conversations have at least three replies
- 78% of conversations have at least three unique repliers
- Confirms feasibility of using the conversations to model task environment to receive past information
-
65% of users use at least two services
- Confirms documents are extracted from diverse information
-
63% of users posted more than 10 PWIs
- Gives motivation to utilize PWIs of users to expand query and improve retrieval performance
Data Analysis
Retrieval Performance
-
CPIR λ=1 achieves improvement over baseline methods
- Expanding the initial query with replies in the conversation enhanced context cues
- Adding PWIs further captured the content information
- KL-based measure outperforms cosine-based measure to calculate document similarities
- CPIR graph-based ranking algorithm further improves performance
Parameter Settings
- Optimal parameter obtained is by fine tuning
- Most important parameter λ controls how to combine the ranking scores from the random walk model
Conclusions and Future Work
Conclusion
- CPIR significantly outperforms baseline methods (!!!)
Future Work
- Replace importance ranking algorithm with cluster-based techniques to capture multiple topics in a conversation
- Use fuzzy combination methods vs linear combination for query expansion and final ranking, since they have been shown to boost performance
- Take document recentness as an important factor in document ranking
CSI4107 -- IR Social Network
By Ted Morin
CSI4107 -- IR Social Network
Context-Aware Personal Information Retrieval From Multiple Social Networks
- 613