Rumor has it:

Identifying Misinformation in Microblogs

Vahed Qazvinian
Emily Rosengren
Dragomir R. Radev
Qiaozhu Mei

University of Michigan
Ann Arbor, MI

{vahed,emirose,radev,qmei}@umich.edu

Abstract

Rumor
- A statement whose true value is unverifiable.

Rumor

Misinformation
- False information
Disinformation
- Deliberately false informaion

Feature

Content-Based
Network-Based
Indentifying Rumors
- From microblog-specific memes.
Indentifying Disinformers

Introduction

Ambiguous Context Rumor
- Ex: Office renovation in a company.
Potential Threating Rumor
- Ex: Underarm deodorants cause breast cancer.

Definition

Rumor is defined as a statement whose truth-value is unverifiable or deliberately false.

Work

Retrieving a complete set of tweets that discuss a specific rumor.
Retrieving online microblogs that are rumor-related.
Identifing tweets in which rumor is endorsed.

Related Work

Analyzing rumors
Mining Microblogs
Sentiment Analysis
Subjectivity Detection

Rumor Indentification & Analysis

How rumors are manifested and spread.

Leskovec et al., 2009

Using the evolution of quotes reproduced online to identify memes and track their spread overtime.

Ratkiewicz et al., 2010

They created the "Truthy" system.
Identifying misleading political memes on Twitter.
Using hashtags, links, and mentions.

Ennals et al., 2010

Focus on highlighting disputed claims on the Internet
Using pattern matching techniques.

Mendoza et al.,

Analyzing the 2010 earthquake in Chile.
The behavior of Twitter users under the emergency.
Tthe patterns of propagation in rumors differ from news.

Sentiment Analysis

The automated detection of rumors is similar to traditional NLP sentiment analysis tasks.

Pang et al., 2002

Using machine learnging techniques.
To identify positive and negative movie reviews.

Hassan et al., 2010

Using a supervised Markov model, part of speech, and dependency patterns.
To identify attitudinal polarities in threads posted to Usenet discussion posts.

Godbole et al., 2007

Using algorithmically generated lexicons of positive and negative words.
To design sentiment scores for news stories and blog post.

Pang and Lee, 2008

Setiment analysis
Opinion mining

Rumor Classification

Closely related to opinion mining and sentiment analysis.
But with whether the statements is controversial.

Mining Twitter Data

Twitter API

Disadvantage

Posts are limited to 140 characters.
Containing information in an unusually compressed form.
Grammar used may be unconventional.

Problem Definition

Rumor Retrieval
Belief Classification

Retrieval Task

Non-Rumor
- "As Obama bow to Muslim leaders Americans are less safe not only at home but also overseas. ..."
Rumor
- "RT @johnnyA99 Ann Coulter Tells Larry King Why People Think Obama Ia A Muslim ..."

Belief Classification

Confirm
- "RT @moronwatch: Obama's a Muslim. Or if he's not, he sure looks like one #whyimvotingrepublican."
Deny
- "Barack Obama is a Christian man who had a Christian wedding with 2 kids baptised in Jesus name. Tea Party clowns call that muslim #p2 #gop"
Doubtful
- "President Barack Obama’s Religion: Christian, Muslim, or Agnostic? - The News of Today (Google): Share With Friend... http://bit.ly/bk42ZQ"

Data

Tweets that are written about a rumor.
Using Twitter search API.
Matching a given regular expression.
Collecting matching tweets once per hour.

Annotation

Two annotators
"1" if it is about a rumor.
- "Sarah and Todd Palin to divorce, according to local Alaska paper. http://ow.ly/iNxF"
"0" otherwise.
- "McCain Divorces Palin over her ‘untruths and out right lies’ in the book written for her. McCain’s team says Palin is a petty liar and phony"

Annotation

"11" if the tweet poster endorses the rumor.
- "Todd and Sarah Palin to divorce"
"12" if the user refutes the rumor.
- "Sarah Palin Divorce Rumor Debunked on Facebook"

Datasets

More than 10400 tweets.
35% are not rumor-related.
43% of the poster believe the rumor.

Inter-Judge Agreement

Annotated 500 instances twice.
Calculate the Kappa ceofficient

Approach

Whether it is a rumor-related statement.
Whether the user believes the rumor.

Classifiers

Building different Bayes classifiers.
Calculate the likelihood ratio for a given tweet \(t\).

Classifiers

To avoid dealing with very small numbers.
Using the log likelihood.

Content-Based Features

Lexical patterns
- Tokenized using the space.
Part-of-speech patterns
- Treatign hashtag as a word and labeling "TAG/"
- URLs labeled as "URL/"
Unigrams and bigrams of each representation.
- Each tweet will extract 2 x 2 = 4 features.

Content-Based Feature

Tweet \(t\) of length \(n\).
Lexically as \((w_1w_2...w_n)\)
POS tags as \((p_1p_2...p_n)\)

Unigram-Lexical Features (TXT1)

Bigram-Based Lexical Features (TXT2)

Content-Based Feature

Unigram-Lexical Features (TXT1)
Bigram-Based Lexical Features (TXT2)
Unigram POS Features (POS1)
Bigram POS Features (POS2)

Network-Based Feature

Focus on user behavior on Twitter.
User \(u_i\) re-tweet a message \(t\) from the user \(u_j\)
- (\(u_i\): "RT @\(u_j\) t")
\(t\) is more likely to be a rumor if
- \(u_j\) has posted or re-tweeted rumors.
- \(u_i \) has posted or re-tweeted rumors.

User Model

\(\theta^+\): Users who have interacted in a positive instance.
- First feature is the log-likelihood ratio that \(u_i\) is. (USR1)
\(\theta^-\): Users who have interacted in a negative instance.
- Second feature is the log-likelihood ratio that \(u_j\) is under \(\theta^+\) than \(\theta^-\). (USR2)

Twitter Specific Memes

Hashtags
URLs

Hashtags

Whether hashtags used in rumor-related tweets are different from other tweets.
Whether people who believe and spread rumors use hashtags that are different from those who are not.

Hashtags Feature

For a given tweet \(t\)
A set of \(m\) hashtags \((\#h_1\#h_2...\#h_m)\)
Hashtag feature (TAG)

Title Text

Bullet One
Bullet Two
Bullet Three

URLs

Refer to external sources.
Overcome the length limit of tweet.

URLs

If a tweet is a positive instance
- Will be similar to the content of URLs shared by other positive tweets.
If a tweet is a negative instance
- Should be more similar to the web pages shared by other negative instances.

Models

Build the \(\theta^+\) and \(\theta^-\) for unigrams and bigrams.
Calculate the log-likelihood ratio
- for unigrams (URL1)
- and bigrams (URL2)

Feature Summarizes

To build these language models
- Use the CMU Language Modeling toolkit.

Experiments

2 sets of experiments
- IR framework for rumor retrieval.
- Detect users's beliefs in rumors.

Rumor Retrieval

5-fold cross-validation
Single query \((Q)\)
The set of relevant documents \(\{d_1, ... , d_m\}\)
\(R_k\) is the set of ranked retrieval results from the top result to the \(k^{th}\) relevant documents, \(d_k\)

Baselines

Random method (Random)
- Ranked by random number.
Uniform method (Uniform)
- Ranked by the majority vote from the training set.
Regexp method (regexp)
- The regexp that was submitted to Twitter.

KL Divergence

Using the Lemur Toolkit to employ a KL divergence retrieval model with Dirichlet smoothing (KL).
Query Language Model \(\theta_Q\)
Document Language Model \(\theta_D\)
Documents are ranked by \(D(\theta_Q||\theta_D)\)

Using Bayesian smoothing with Dirichlet priors

KL Divergence

Default parameter value in Lemur \((\mu = 2000)\)
Tuned based on the data \((\mu = 10\))

Feature Analysis

Content-Based (TXT1+TXT2+POS1+POS2)
Network-Based (USR1+USR2)
Twitter Specific Memes (TAG+URL1+URL2)

Domain Training Data

Extract 400 randomly selected tweets.
Gradually add the rest of the obama tweets.
Exhibit a fast growth and reach 80% at 2000 data.

Belief Classification

6774 tweets in total
- 2971 belief
- 3803 not

Conclusion

Propose a general framework to retrieve rumorous tweets that match a more general query.
Capturing tweets that show user endorsement.
A manually annotated datasets of 10000 tweets.

Optimization

\(L_1\)-regularized log-linear model
A set of input \(x\)
\(\Phi: X \times Y \to \Reals^D \) maps each \((x, y)\) to a vector of feature values.
\(\theta \in \Reals^D\) assign a real-valued weight to each feature.
Choose \(\theta\) to minimize the sum of least squares and a regularization term \(R\).

\(\alpha\) is a parameter that controls the amount of regularization.

Paper Presentation: Rumor has it

By Penut Chen (PenutChen)

Paper Presentation: Rumor has it

Penut Chen (PenutChen)

I love oppai!

github.com/penut85420

Rumor has it:

Abstract

Rumor

Feature

Introduction

Definition

Work

Related Work

Rumor Indentification & Analysis

Leskovec et al., 2009

Ratkiewicz et al., 2010

Ennals et al., 2010

Mendoza et al.,

Sentiment Analysis

Pang et al., 2002

Hassan et al., 2010

Godbole et al., 2007

Pang and Lee, 2008

Rumor Classification

Mining Twitter Data

Disadvantage

Problem Definition

Retrieval Task

Belief Classification

Data

Annotation

Annotation

Datasets

Inter-Judge Agreement

Approach

Classifiers

Classifiers

Content-Based Features

Content-Based Feature

Content-Based Feature

Network-Based Feature

User Model

Twitter Specific Memes

Hashtags

Hashtags Feature

Title Text

URLs

URLs

Models

Feature Summarizes

Experiments

Rumor Retrieval

Baselines

KL Divergence

KL Divergence

Feature Analysis

Domain Training Data

Belief Classification

Conclusion

Optimization

Paper Presentation: Rumor has it

More from Penut Chen (PenutChen)