From Oxen to Oscars®: A Machine Learning Journey

A pioneer of collecting data with questionnaires and surveys - Sir Francis Galton, cousin of Charles Darwin, is the creator of the statistical concept of correlation and inventor of the term “regression.”

The History

At a livestock fair in 1906 Galton came upon a contest where nearly 800 contestants guessed the slaughtered and dressed weight of an ox. No one guessed the exact weight, 1,198 pounds. However, when Galton looked at the individual guesses, he found the middlemost (median) guess of 1,207 pounds to be only .08 percent off from the correct weight. When he calculated the average (mean) of all guesses, he came up with 1,197 pounds.

Given the venue, many of the contestants would have had extensive experience with oxen, and on close inspection of the ox’s features (height, girth, etc.) they would have been able to make an educated prediction of its weight. This “wisdom of crowds” is a powerful idea that remains relevant in today’s world of “big data.”

The History

At Mithun, we build statistical models to predict outcomes important to our clients: foot traffic at retail locations, web site conversions, and viable test markets for new marketing initiatives. When building these models, we often leverage machine-learning techniques that capitalize on the power of Galton’s discovery. These techniques take a given data set, make it larger and more diverse through sampling, and then create an ensemble (crowd) of predictor algorithms aggregated to create the final prediction.

The Theory

To demonstrate the power of this approach, we built four distinct models to predict the winners of the four major categories of this year’s Academy Awards® – Best Picture, Best Director, Best Actor and Best Actress. Models were trained on the data from films released from 1928-1999 and then predicted outcomes for those released from 2000-2014. We used the predictions for 2000-2013 to calibrate the models’ accuracy. For 2014, all four models agreed on the winners in all four categories.

The following are our predictions by category...

The Test

Best Picture: Birdman

Highest probability of winning – 87%

Lowest probability of winning – 51%

Las Vegas probability – 60%

Key Indicators:

• Winning best picture at Producers Guild Awards

• Winning Golden Globe® for best drama

• Total Oscar® nominations

• Number of days between release and Academy Awards® ceremony

The Nominees

Best Director: Alejandro González Iñárritu (Birdman)

Highest probability of winning – 99%

Lowest probability of winning – 64%

Las Vegas probability – 64%

Key Indicators:

• Winning best director at Directors Guild Awards

• Winning Golden Globe® best director

• Total Oscar® nominations

The Nominees

Best Actor: Eddie Redmayne (Theory of Everything)

Highest probability of winning – 98%

Lowest probability of winning – 66%

Las Vegas probability – 80%

Key Indicators:

• Winning best actor (drama) at Golden Globes®

• Winning best actor Screen Actors Guild Awards®

• Age of actor in release year

The Nominees

Best Actress: Julianne Moore (Still Alice)

Highest probability of winning – 87%

Lowest probability of winning – 47%

Las Vegas probability – 98%

Key Indicators:

• Winning best actress (drama) at Golden Globes®

• Winning best actress Screen Actors Guild Awards®

• Age of actress in release year

The Nominees

To generate our predictions, we used these machine-learning ensembles: Random Forest, Boosted Trees, and MARSplines. All automatically optimize model complexity to minimize total error by leveraging Galton’s discovery:

The Science

They do this by building an ensemble of simpler models and then, using varying techniques, aggregate the multiple predictions of simpler models into the best possible estimate of the ensemble. We also had our staff statistician build a logistic regression model for each year, 2000-2014, and compared the overall accuracy for each model:

The Science

The Random Forest ensemble was the most accurate, followed closely by the other three models. All of them correctly predicted the winner 70-73% of the time over 14 years. The predictions have improved over time, so we feel confident about the predictions for 2014:

The Science

This year’s best picture category could be a toss-up. Both Birdman and Boyhood appear as strong contenders, with the models favoring Birdman slightly over Boyhood. Las Vegas seems to agree, projecting Birdman with a 60% chance of winning and Boyhood with 45%.

The Predictions

The 87th Academy Awards® are on Sunday, February 22, 2015. We will be watching with extra interest this year. Let us know if you disagree with our predictions!

The Sunday Awards

Academy Awards Database: http://awardsdatabase.oscars.org/ampas_awards/BasicSearchInput.jsp

Internet Movie Database: http://www.imdb.com/

Roger Ebert Reviews: http://www.rogerebert.com/

Screen Actors Guild: http://www.sagaftra.org/

Producers Guild: http://www.producersguild.org/

Directors Guild: http://www.dga.org/

The Data Sources

Pardoe, I. and Simonton D. K. (2007) Applying Discrete Choice Models to Predict Academy Award

Winners, J. R. Statist. Soc. A(2008), 171, Part 2, pp. 375-394

http://en.wikipedia.org/wiki/Francis_Galton

http://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

http://www.statsoft.com/Textbook/Multivariate-Adaptive-Regression-Splines

The References

From Oxen to Oscars®: A Machine Learning Journey

EverythingTalks.com