Context-Aware Personal Information Retrieval From Multiple Social Networks

Presentation

Sophie Le Page and Theodore Morin

Authors

Xiaogang Han, Wei Wei, Chunyan Miao, Jian-Ping Mei, and Hengjie Song

Social Network Services

  • People use SNSs to collect and share information
    • Microblogs (Twitter)
    • Social networks (Facebook)
    • Social bookmarks (del.icio.us)
  • Referring to previously-seen information is common
  • Three-quarters of web page hits are re-visits
    • ​Replying to questions on QA websites
    • Replying to posts on SNSs

Problem

How do we automatically retrieve the most context-relevant previously-seen web information without user intervention???

 

Sample User Scenario

  1. A film lover has reviewed a movie on Facebook
  2. Film lover's friend posts about the movie on Twitter
  3. The film lover could provide comments about the movie by retrieving the review, but may have forgotten it

Personal Web Information

  • A PWI indicates previously-seen information on different SNSs
  • It is challenging to make connections between the user’s context and their PWIs when the PWIs are spread across multiple SNSs

Problem Statement 

Given a session and the targeting replier, generate a query to retrieve the most relevant PWIs from the target's document collection

Problem Statement Example

Solution

The paper proposes the Context-Aware Personal Information Retrieval (CPIR) algorithm. The algorithm...

  • First builds a query by capturing the user's information need
  • Then retrieves the user's most relevant PWIs

 

Challenges

  • Posts in the conversations are short and ambiguous
  • Documents in SNSs are noisy and complex

Context-Aware Personal Information Retrieval Algorithm

  Session

  • A Session (S) is an online conversation with
    • An initial post, p
    • A set of replies, R
  • It is represented by the Vector Space Model
  • Each term is weighted by its tf-idf score

Step 1

Query formulation

and expansion

Step 2

Ranking the PWIs

CPIR

Query Formulation and Expansion

Query Q is built by

  • Considering both the replies and the PWIs of all users participating in the Session
  • Using the PWIs of the creator and the repliers

Initial Post and Replies

  • First, the initial p is treated as the basic query
  • Next, combine the replies with p
    • Replies are weighted according to their similarities with p
  • The expanded query is calculated:

Methods, Techniques and External Sources

KL-divergence method

  • Obtains better results than vector space based measures

Smoothing techniques

  • Extracts semantic information from texts that are typically short in length

WordNet external source

  • Expands the documents before calculating similiarities

PWIs of the Creator

and Existing Repliers

  • We consider the PWIs of the session creator and existing repliers to further expand the query
  • Only the top k most relevant PWIs are selected
  • The expanded query can be represented...

PWIs Ranking

Importance Ranking

  • Users in the same session S who share common interests are ranked more important
  • A Markov random walk model is used
    • Markov random walk model is represented as a probability matrix graph
    • Rank the PWIs of a user u on implicit relationships between the web information of all users in the Session
    • Uses a subset of users'  PWIs that are most relevant to the topic of the session

Final Ranking

Make final ranking with a linear combination of

  • Similarity between the expanded query Q and each document
  • Importance of the document in the collection of PWIs

Text

CPIR Algorithm

Data Description

Tested on a FriendFeed dataset

  • Data was collected by monitoring a data-stream on FriendFeed from 01/08/2010 to 30/09/2010 (two months)

Data was filtered to extract

  • Post-reply pairs written in English
  • Posts with repliers that have at least 50 PWIs

Manual Annotation

To construct manual annotation results

  • 105 post-reply pairs were randomly sampled
    • Posted by 73 unique users
    • Users have an average of 316 PWIs
  • Two volunteers manually labeled 23,046 replier PWIs as relevant or irrelevant
  • Tokenization and part-of speech tagging are performed to eliminate noisy terms
  • Stop words are removed and terms are stemmed

Data Analysis

  • 98% of conversations have at least three replies
  • 78% of conversations have at least three unique repliers
  • Confirms feasibility of using the conversations to model task environment to receive past information
  • 65% of users use at least two services
    • Confirms documents are extracted from diverse information
  • 63% of users posted more than 10 PWIs
    • Gives motivation to utilize PWIs of users to expand query and improve retrieval performance

Data Analysis

Retrieval Performance

  • CPIR λ=1 achieves improvement over baseline methods
    • ​Expanding the initial query with replies in the conversation enhanced context cues
    • Adding PWIs further captured the content information
    • ​​KL-based measure outperforms cosine-based measure to calculate document similarities
  • CPIR graph-based ranking algorithm further improves performance

Parameter Settings

  • Optimal parameter obtained is by fine tuning
  • Most important parameter λ controls how to combine the ranking scores from the random walk model​

Conclusions and Future Work

Conclusion

  • CPIR significantly outperforms baseline methods (!!!)

Future Work

  • Replace importance ranking algorithm with cluster-based techniques to capture multiple topics in a conversation
  • Use fuzzy combination methods vs linear combination for query expansion and final ranking, since they have been shown to boost performance
  • Take document recentness as an important factor in document ranking
Made with Slides.com