Fun with Machine Learning
Predicting the 2018 World Cup


Sorin Peste
Microsoft
An interdisciplinary field which unifies statistics, data analysis and their related methods in order to understand and analyze actual phenomena with data.
-- anon.
Data Science

Data Science

The Data Science Process
You need a Sharp Question TM
Given a match between <Team1> and <Team2>, what is the expected goal differential at the end of the match?

http://aka.ms/PredictTheWorldCup



The Language
- Open Source, released 1993
- Statistical features
- linear modelling
- statistical testing
- time series analysis
- classification
- clustering
- etc
- Graphical features
- CRAN package repository
- RStudio IDE


R for Data Science
http://r4ds.had.co.nz/

Data Wrangling with R


> install.packages("dplyr")
> install.packages("tidyr")
Data Visualization with R
> install.packages("ggplot2")


Statistics with R
> library("stats")

http://aka.ms/PredictTheWorldCup
Your Data Needs Cleaning

Top Problems

https://www.kaggle.com/surveys/2017
Missing Data


Incorrect Data

source: dilbert.com

Irrelevant Data


Outliers



Lag Features
last10games_w_per = (number of wins in the past 10 games) / 10
last10games_d_per = (number of draws in the past 10 games) / 10
last10games_l_per = (number of losses in the past 10 games) / 10
last10games_gd_per = (goals scored - goals conceded in the past 10 games)/10
Rolling aggregates for performance metrics
Choosing ML Models

Decision Trees

Random Forests

Training, Testing, Validation

Training, Testing, Validation

Evaluation Metrics

Simulating the Matches


And... Results!

Fun with Machine Learning
By Sorin Peşte
Fun with Machine Learning
- 1,628