Play Data, Play Ball!
Summit Suen


Reminds
You
What?

Why
Baseball?
Discrete
v.s.
Continuous
Records
v.s.
Logs
Sabermetrics

Sabermetrics

History
Henry Chadwick

Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
History
Henry Chadwick

Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
History
Henry Chadwick

Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane

History
Henry Chadwick

Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
History
Henry Chadwick

Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane

Historical Data: Lahman

Historical Data: Lahman

teams = pd.read_csv('../lahman-csv_2015/Teams.csv')
Historical Data: Lahman


Real-time Data: MLBAM

Real-time Data: MLBAM

http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/plays.xml
Real-time Data: MLBAM

Meanwhile, use R:
library(Lahman)
library(dplyr)
totalRS <- Teams %>% select(yearID, R, G) %>% mutate(AvgRperG = R/G) %>% group_by(yearID) %>% summarise(sum(AvgRperG))
names(totalRS) <- c("yearID", "RUN")
head(totalRS)
## Source: local data frame [6 x 2]
##
## yearID RUN
## 1 1871 93.12897
## 2 1872 95.21474
## 3 1873 73.15998
## 4 1874 58.55903
## 5 1875 70.08774
## 6 1876 47.01267

Meanwhile, use R:
library(ggplot2)
ggplot(data = totalRS, aes(x = yearID, y = RUN)) + stat_smooth() + geom_line()

Meanwhile, use R:

Meanwhile, use R:
require(openWAR)
source("~/Documents/openWAR/R/GameDay.R")
getGameIds(date=as.Date("2015-04-05"))
##
## Retrieving data from 2015-04-05 ...
## ...found 2 games
## [1] "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_zznmlb_zzamlb_1"
gd = gameday(gameId="gid_2015_04_05_slnmlb_chnmlb_1")
## gid_2015_04_05_slnmlb_chnmlb_1
gd$url
## bis_boxscore.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/bis_boxscore.xml"
## inning_all.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/inning/inning_all.xml"
## inning_hit.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/inning/inning_hit.xml"
## game.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/game.xml"
## game_events.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/game_events.xml"
Meanwhile, use R:
str(gd$ds)
## 'data.frame': 75 obs. of 62 variables:
## $ pitcherId : num 452657 452657 452657 452657 452657 ...
## $ batterId : num 572761 518792 407812 425509 571431 ...
## $ field_teamId : chr "112" "112" "112" "112" ...
## $ ab_num : num 1 2 3 4 5 6 7 8 9 10 ...
## $ inning : num 1 1 1 1 1 1 1 1 1 2 ...
## $ half : Factor w/ 2 levels "bottom","top": 2 2 2 2 2 1 1 1 1 2 ...
## $ balls : num 2 1 2 0 1 1 1 2 2 1 ...
## $ strikes : num 2 0 0 3 3 0 2 3 1 3 ...
## $ endOuts : num 1 1 1 2 3 0 1 2 3 1 ...
## $ event : Factor w/ 18 levels "Caught Stealing 2B",..: 7 3 15 17 17 3 7 17 7 17 ...
## $ actionId : num NA NA NA NA NA NA NA NA NA NA ...
## $ description : Factor w/ 94 levels "Adam Wainwright called out on strikes. ",..: 51 23 55 32 49 20 39 4 79 93 ...
## $ stand : Factor w/ 2 levels "L","R": 1 1 2 2 1 1 2 1 2 2 ...
## $ throws : Factor w/ 2 levels "L","R": 1 1 1 1 1 2 2 2 2 1 ...
## $ runnerMovement: Factor w/ 45 levels "","[407812::1B::Walk]",..: 1 26 28 1 3 17 18 1 19 1 ...
## $ x : num 142 190 175 NA NA ...
## $ y : num 159 119 127 NA NA ...
## $ game_type : Factor w/ 1 level "R": 1 1 1 1 1 1 1 1 1 1 ...
## $ home_team : Factor w/ 1 level "chn": 1 1 1 1 1 1 1 1 1 1 ...
## $ home_teamId : num 112 112 112 112 112 112 112 112 112 112 ...
## $ home_lg : Factor w/ 1 level "NL": 1 1 1 1 1 1 1 1 1 1 ...
## $ away_team : Factor w/ 1 level "sln": 1 1 1 1 1 1 1 1 1 1 ...
## $ away_teamId : num 138 138 138 138 138 138 138 138 138 138 ...
## $ away_lg : Factor w/ 1 level "NL": 1 1 1 1 1 1 1 1 1 1 ...
## $ venueId : num 17 17 17 17 17 17 17 17 17 17 ...
## $ stadium : Factor w/ 1 level "Wrigley Field": 1 1 1 1 1 1 1 1 1 1 ...
## $ timestamp : chr "2015-04-06 00:16:58" "2015-04-06 00:19:47" "2015-04-06 00:18:55" "2015-04-06 00:20:42" ...
## $ playerId.C : num 424325 424325 424325 424325 424325 ...
## $ playerId.1B : num 519203 519203 519203 519203 519203 ...
## $ playerId.2B : num 6e+05 6e+05 6e+05 6e+05 6e+05 ...
## $ playerId.3B : num 592609 592609 592609 592609 592609 ...
## $ playerId.SS : num 516770 516770 516770 516770 516770 ...
## $ playerId.LF : num 458085 458085 458085 458085 458085 ...
## $ playerId.CF : num 451594 451594 451594 451594 451594 ...
## $ playerId.RF : num 624585 624585 624585 624585 624585 ...
## $ batterPos : chr "3B" "RF" "LF" "SS" ...
## $ batterName : Factor w/ 30 levels "Adams, M","Alcantara",..: 4 9 10 19 1 8 26 22 5 15 ...
## $ pitcherName : Factor w/ 30 levels "Adams, M","Alcantara",..: 13 13 13 13 13 28 28 28 28 13 ...
## $ runsOnPlay : int 0 0 1 0 0 0 0 0 0 0 ...
## $ startOuts : num 0 1 1 1 2 0 0 1 2 0 ...
## $ runsInInning : int 1 1 1 1 1 0 0 0 0 1 ...
## $ runsITD : num 0 0 0 1 1 0 0 0 0 0 ...
## $ runsFuture : num 1 1 1 0 0 0 0 0 0 1 ...
## $ start1B : chr NA NA NA "407812" ...
## $ start2B : chr NA NA "518792" NA ...
## $ start3B : chr NA NA NA NA ...
## $ end1B : chr NA NA "407812" "407812" ...
## $ end2B : chr NA "518792" NA NA ...
## $ end3B : chr NA NA NA NA ...
## $ outsInInning : num 3 3 3 3 3 3 3 3 3 3 ...
## $ startCode : num 0 0 2 1 1 0 2 4 4 0 ...
## $ endCode : num 0 2 1 1 0 2 4 4 0 0 ...
## $ fielderId : num 6e+05 NA NA NA NA ...
## $ gameId : chr "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" ...
## $ isPA : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ isAB : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ isHit : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
## $ isBIP : logi TRUE TRUE TRUE FALSE FALSE TRUE ...
## $ our.x : num 42.2 163.2 124.3 NA NA ...
## $ our.y : num 99.1 200.4 180.7 NA NA ...
## $ r : num 108 258 219 NA NA ...
## $ theta : num 1.168 0.887 0.968 NA NA ...
Meanwhile, use R:
ggplot(data = gd$ds, aes(x = x, y = y, color = isHit)) + geom_point(size = 3) + coord_fixed()


Other Data: Crawler

Other Data: Crawler

import pandas as pd
hrTable = pd.io.html.read_html("http://www.cpbl.com.tw/stats_hr.aspx", header = 0)[0]
FAQ
Using the analyzed/predicted results to bet?
FAQ
What about other sports?
Q&A
More Questions?
Thanks!
https://github.com/suensummit
@SummitSuen
https://www.facebook.com/summit.suen
summit.suen@gmail.com



Play Data, Play Ball! - pyconapac2015
By Summit Suen
Play Data, Play Ball! - pyconapac2015
cc-by-sa
- 4,338