Predicting the Future With

Social Media

by

Sitaram Asur, Bernardo A. Huberma

2010 IEEE/WIC/ACM International Conference

Web Intelligence and Intelligent Agent Technology (WI-IAT)

 

Group 5

Why Social Media?

Ubiquitous

Why Social Media ?

  • Social Networking has attracted everyone
  • Large Content Sharing Medium
  • This data source is largely untapped

Social Media Examples

Social Media has EVERYTHING

  • Environment
  • Politics
  • Technology
  • Sports
  • Entertainment Industry (Movies)

Source and Spread of Content

  • People create own content
  • Share others content
  • Either way the content spreads among their friends
  • This widespread contents (opinions/ views) can be an important source of information for predictions.

How effective is social Media for future prediction ?

  • Can we correlate tweet rate with the success of a movie in box office
  • Based on previous movies and twitter chatter about them, can we create a reliable classifer
  • Can we do better predictions by coupling it with sentiment analysis

Related Work

  •     Studied sales spike based on online chatter
    •     Found outcome of carefully constructed queries can predict market trends
  • Other works involving movie success prediction require metadata such as movies genre, MPAA rating, running time, release data
  • Some works tried to correlate sentiments with box-office scores

Twitter

  • Extremely Popular
  • Microblogging Service
  • Content is in the form of Tweets
  • Tweet is a short message (140 characters)
  • Tweet can consist of 
    •     Text
    •     Links to Images, Video and Articles

Twitter

  • Retweet is a post originally made by one user that is
    forwarded by another user.
  • Twitter is a potential market for viral marketing.
  • Due to its huge reach, a number of businesses use Twitter to advertise products and disseminate information to stakeholders

Dataset

  • 2.89 Million Tweets
  • 1.2 Million Users
  • 24 Different Movies
  • Collected using Twitter Search API
  • Contains Text, Timestamp and User Info

Dataset

  • Movies are released mostly on Fridays and rarely on Wednesdays
  • Average of 2 movie releases per week
  • Data collected over 3 months (24 movies)

Data Consistency

  • Data Consistency: 
    • ​​Movie released only on Fridays are considered
    • Movie released on large number of theatres are considered

Movie List

  • Sherlock Homes
  • Avatar
  • Daybreakers
  • Legion
  • Leap Year
  • Twilight: New Moon
  • Spy Next Door
  • When In Rome

Movie Resolution

  • Movie titled "2012"
  • It is hard to classify 2012 as the title of the movie or year
  • So, sanity checks have been performed to remove such conflicting movies

Critical Period

  • The time from the week before it is released
    • when the promotional campaigns are in full swing
  • to two weeks after release 
    • when its initial popularity fades and opinions from people have been disseminated

Time-series of tweets over the critical period for different movies

Number of tweets per unique authors for different movies

 Log distribution of authors and tweets.

Linear Regression

  • Linear Fitting
  • Relating one known variable with a unknown variable
  • R-square (Co-efficient of determination)
  • p-value

Attention and Popularity

  • Pre-release attention
    • Includes promos, trailers, pictures
    • Most tweets should be URL based
    • Retweets should follow the same pattern
  • Post release chatter

Attention and Popularity

  • Pre-release attention
  •  
  •  
  •  
  • But is there a correlation between number of tweets with URLs and movie success?

Attention and Popularity

Some Positive correlation but co-efficient of determination is low

Prediction of Box-office revenues

  • Using the tweets referring to movies prior to their release, can we accurately predict the box-office revenue generated by the movie in its opening weekend?

Tweet rate and box office gross

  • The correlation of the average tweet rate with the box-office gross for the 24 movies considered showed a strong positive correlation, with a correlation coeffi- cient value of 0.90

  • Transylmania with 2.75 tweet per hour grossed only $263K

  • Twilight and Avatar having more than 1k tweet per hour grossed142M and 72M 

Comparison with HSX

  • Hollywood Stock Exchange is virtual stock market for movies

  • Players can buy "shares" in movies, actors, directors etc

  • Stock prices are adjusted based on the gross income

  • Earlier studies have shown correlation between HSX index and movie success in Box office

Comparison with HSX

Text

Predicting HSX index

Predicting for any week

Text

Sentiment Analysis

  • Sentiments to forecast the box-office values
    • Positive
    • Negative
    • Neutral
  • LingPipe Sentiment Analysis Classifier

Supervised Learning

  • How to find the class descriptor ?
    • Amazon Mechanical Turk (Manual Classification)
    • Thousands of workers were employed to manually classify all the tweets
    • Tweets with unanimous classification were taken to train the model

Preprocessing

  • Elimination of stop-words
  • Elimination of all special characters except exclamation marks which were replaced by < EX > and question marks (< QM >)
  • Removal of urls and user-ids
  • Replacing the movie title with < MOV >

Movie Subjectivity

  • More value for sentiments after the movie release.
  • Positive sentiments - recommendations by people
  • To capture this subjectivity we define

Movie Subjectivity

Movie Polarity

  • Movie with more positive tweets than negative is likely successful. So we define
  •  
  • The Blind Side (5.02 to 9.65) - 34M to 40.1M
  • New Moon(6.29 to 5) - 142M to 42M

Movie Polarity

Results of Regression Experiments

Text

Conclusion

  • Social media can be utilized to forecast future outcomes.
  • Constructed a linear regression model for
    predict box-office revenues of movies in advance of their release.
  • Analyzed the sentiments present in tweets and demonstrated their efficacy.
  • This method can be extended to a large panoply of topics eg: future product rating, election outcomes.

THANK YOU

 

Questions?

Made with Slides.com