Title Text
Kishor Mohite
Rajat Jain
Sampath Kolachana
Outline
- Motivation
- Problem Statement
- System Architecture
- Research Demo
- Algorithms
- Novelty & Conclusion
- Future Work
Motivation
A Random Internet Article
Movie review blog
A financial portal
A news article
People with interests
Want to know
- What news is currently trending regarding the event discussed in the article or the event of interest?
- What are the views of readers regarding those news?
- What are the views of people on social networking sites, regarding the same?
Problem Statement
- Provide user with other news articles describing same event
- To aggregate opinions of user from various platform for same event
- To generate the event time-line for the various news events
- Summarise news event based on articles from same cluster
System Architecture
Demo
Algorithms
- Clustering News Articles
- Ranking Algorithms
- Summarisation
- Tag Generation
- Trending Tags
- Categorisation
Clustering Articles
Background
- 7 Indian News Sources
- 900-1000 Articles to Cluster
Requirements
- Incremental clustering
- Single pass algorithm
Proposed Approach
- Use news headlines Instead of Articles
- Similarity measure based clustering
Solution
- Stopword Removal
- Word Stemming
- Bag of Word Representation
- Distance computation
Results
- Gives better results with moderate headlines per cluster
- Limitations
- Noisy headlines in cluster
- If event gets large news coverage
- i.e, More headlines within short time span
Results
Ranking Algorithms
- Comment Ranking
- Cluster Ranking
- Article Ranking
Comment Ranking
Flawed Ranking Systems
Current Approach - Wilson Score
Wilson Score or precisely the lower bound of Wilson score confidence interval for a Bernoulli parameter is used.
With a chance of 85%, the real fraction of positive ratings will be equal to this value.
Wilson Score - Advantages
- With a chance of 85%, the real fraction of positive ratings will be equal to this value.
- Quality comments make it to the top despite of the time of posting.
- Auto feedback system
- Application for comments from different platforms
Results
Cluster Ranking
Cluster 1
Cluster 3
Cluster 2
A2
A1
A4
A3
Linear Regression
- Attribute Ai is normalized to value ai, and wi is the associated weight.
- Training data set -
1. Clusters older than 10 days will have no
comment activity.
2. Use final comments on these clusters as popularity
measure and thus target values for scores.
3. The comments made till the time of cluster
generation form A4.
- Typical values of weights- [w1,w2,w3,w4]
Sectioning
Sections based on-
1. Latest time
2. Number of news sources
Sections sorted based on number of headlines and number of comments.
Sections chosen using hit and trial.
Example -
Updated in last 3 hours News sources reporting >= 5
Updated in last 7 hours News sources reporting >= 5
Updated in last 3 hours News sources reporting >= 3
Novelty
Cluster ranking never done by making use of comments.
More suitable for cluster ranking than comment ranking.
Factor of both number of news sources and news headlines covering that cluster are used
Article Ranking
Articles in a cluster
Factors used
1. Time at which article was written or last updated.
2. Normalized number of comments under article.
3. Time at which last comment was made.
Normalized done on -
News source + Category
Article Summarization
Overview
- Our system generates summary for each of the news clusters generated.
- Generating summary(Abstraction) vs Extracting summary(Extraction).
- Problem: Identifying top-k sentences that summarize a news cluster or event.
- Multiple source summary generation vs Single source summary
Algorithm
- Stemming and stop words removal
- Extracting feature vector for each sentence
- Generating a complete graph
- Scoring each sentence based on distance from all other sentences
- Correction factor due to other headlines
Representative Tags Generation
Background
- Form clusters out of news articles
- Generate Representative Tags
Solution
- Use Part of Speech Tagger
- Bigram Tagger
- Unigram Tagger
- Backoff Tagger (Custom Context Free Grammer)
- Tokenize Headlines
- Choose everything that is,
- Proper Noun
- Proper Noun + Common Noun
- Noun + Verb
- Count occurance of the tags and choose 3 most frequent tags for headlines
Results & Uses
- Representative tags give idea about what cluster is representing
- Generated tags are also used for generating trends graph
- Trends graph shows the media coverage of particular term in given period
- How well an upcoming movie is being covered
- Tracking product release and people's opinions
Tags shown for a news cluster
Trends graph for tag 'Panama'
List of articles with same tag
Conclusion and Novelty
- Proposed a novel system for cross platform news exploration which aggregate news from various web sources.
- Proposed a novel news cluster ranking algorithm based on popularity prediction using comments.
- Proposed a novel news comments ranking algorithm which uses wilson score.
deck
By Rajat Jain
deck
- 276