Title Text

Kishor Mohite

Rajat Jain 

Sampath Kolachana

Outline

  • Motivation
  • Problem Statement
  • System Architecture
  • Research Demo
  • Algorithms
  • Novelty & Conclusion
  • Future Work

Motivation

A Random Internet Article

Movie review blog

A financial portal

A news article

People with interests

Want to know

  • What news is currently trending regarding the event discussed in the article or the event of interest?
     
  • What are the views of readers regarding those news?
     
  • What are the views of people on social networking sites, regarding the same?

Problem Statement

  • Provide user with other news articles describing same event
  • To aggregate opinions of user from various platform for same event
  • To generate the event time-line for the various news events
  • Summarise news event based on articles from same cluster

System Architecture

Demo

Algorithms

  • Clustering News Articles
  • Ranking Algorithms
  • Summarisation
  • Tag Generation
  • Trending Tags
  • Categorisation

Clustering Articles

Background

  • 7 Indian News Sources
  • 900-1000 Articles to Cluster

Requirements

  • Incremental clustering
  • Single pass algorithm

Proposed Approach

  • Use news headlines Instead of Articles
  • Similarity measure based clustering

Solution

  • Stopword Removal
  • Word Stemming
  • Bag of Word Representation
  • Distance computation

Results

  • Gives better results with moderate headlines per cluster
  • Limitations
    • Noisy headlines in cluster
    • If event gets large news coverage
    • i.e, More headlines within short time span

Results

Ranking Algorithms

  1. Comment Ranking
  2. Cluster Ranking
  3. Article Ranking

Comment Ranking

Flawed Ranking Systems

Score = Upvotes - Downvotes
Score=UpvotesDownvotesScore = Upvotes - Downvotes
Score = Upvotes / Totalvotes
Score=Upvotes/TotalvotesScore = Upvotes / Totalvotes

Current Approach - Wilson Score

Wilson Score or precisely the lower bound of Wilson score confidence interval for a Bernoulli parameter is used.

 

With a chance of 85%, the real fraction of positive ratings will be equal to this value.

Wilson Score - Advantages

  • With a chance of 85%, the real fraction of positive ratings will be equal to this value.
     
  • Quality comments make it to the top despite of the time of posting.
     
  • Auto feedback system
     
  • Application for comments from different platforms

Results

Cluster Ranking

Cluster 1

Cluster 3

Cluster 2

A2

A1

A4

A3

Linear Regression

  • Attribute Ai is normalized to value ai, and wi is the associated weight.
     
  • Training data set -
     1. Clusters older than 10 days will have no
         comment activity.
     2. Use final comments on these clusters as popularity
         measure and thus target values for scores.
     3. The comments made till the time of cluster
         generation form A4.
     
  • Typical values of weights- [w1,w2,w3,w4]
Score = w_1*a_1 + w_2*a_2 + w_3*a_3 + w_4*a_4
Score=w1a1+w2a2+w3a3+w4a4Score = w_1*a_1 + w_2*a_2 + w_3*a_3 + w_4*a_4

Sectioning

Sections based on-
1. Latest time
2. Number of news sources

Sections sorted based on number of headlines and number of comments.

Sections chosen using hit and trial.


Example -

Updated in last 3 hours News sources reporting >= 5
Updated in last 7 hours News sources reporting >= 5
Updated in last 3 hours News sources reporting >= 3

Novelty

Cluster ranking never done by making use of comments.

 

More suitable for cluster ranking than comment ranking.

 

Factor of both number of news sources and news headlines covering that cluster are used

Article Ranking

Articles in a cluster

Factors used

1. Time at which article was written or last updated.

2. Normalized number of comments under article.

3. Time at which last comment was made.

 

Normalized done on -
                          News source + Category

Normalized Value = 0.5* (Actual Value/Average Value)
NormalizedValue=0.5(ActualValue/AverageValue)Normalized Value = 0.5* (Actual Value/Average Value)

Article Summarization

Overview

  • Our system generates summary for each of the news clusters generated.
  • Generating summary(Abstraction) vs Extracting summary(Extraction).
  • Problem: Identifying top-k sentences that summarize a news cluster or event.
  • Multiple source summary generation vs Single source summary

Algorithm

  • Stemming and stop words removal
  • Extracting feature vector for each sentence
  • Generating a complete graph
  • Scoring each sentence based on distance from all other sentences
  • Correction factor due to other headlines

Representative Tags Generation

Background

  • Form clusters out of news articles
  • Generate Representative Tags

Solution

  • Use Part of Speech Tagger
    • Bigram Tagger
    • Unigram Tagger
    • Backoff Tagger (Custom Context Free Grammer)
  • Tokenize Headlines
  • Choose everything that is,
    • Proper Noun
    • Proper Noun + Common Noun
    • Noun + Verb
  • Count occurance of the tags and choose 3 most frequent tags for headlines

Results & Uses

  • Representative tags give idea about what cluster is representing
  • Generated tags are also used for generating trends graph
  • Trends graph shows the media coverage of particular term in given period
    • How well an upcoming movie is being covered
    • Tracking product release and people's opinions

Tags shown for a news cluster

Trends graph for tag 'Panama'

List of articles with same tag

Conclusion and Novelty

  • Proposed a novel system for cross platform news exploration which aggregate news from various web sources.
  • Proposed a novel news cluster ranking algorithm based on popularity prediction using comments.
  • Proposed a novel news comments ranking algorithm which uses wilson score.

deck

By Rajat Jain

deck

  • 276