Fun with Machine Learning

Predicting the 2018 World Cup

Sorin Peste

Microsoft

An interdisciplinary field which unifies statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data.

 

-- anon.

Data Science

Data Science

The Data Science Process

You need a Sharp Question TM

Given a match between <Team1> and <Team2>, what is the expected goal differential at the end of the match?

http://aka.ms/PredictTheWorldCup

The           Language

  • Open Source, released 1993
  • Statistical features
    • linear modelling
    • statistical testing
    • time series analysis
    • classification
    • clustering
    • etc
  • Graphical features
  • CRAN package repository
  • RStudio IDE

R for Data Science

http://r4ds.had.co.nz/

Data Wrangling with R

> install.packages("dplyr")
> install.packages("tidyr")

Data Visualization with R

> install.packages("ggplot2")

Statistics with R

> library("stats")

http://aka.ms/PredictTheWorldCup

Your Data Needs Cleaning

Top Problems

https://www.kaggle.com/surveys/2017

Missing Data

Incorrect Data

source: dilbert.com

Irrelevant Data

Outliers

Lag Features

last10games_w_per = (number of wins in the past 10 games) / 10
last10games_d_per = (number of draws in the past 10 games) / 10
last10games_l_per = (number of losses in the past 10 games) / 10
last10games_gd_per = (goals scored - goals conceded in the past 10 games)/10

Rolling aggregates for performance metrics

Choosing ML Models

Decision Trees

Random Forests

Training, Testing, Validation

Training, Testing, Validation

Evaluation Metrics

Simulating the Matches

And... Results!

Made with Slides.com