Play Data, Play Ball!

Summit Suen

Reminds

You

What?

Why
Baseball?
 

Discrete

v.s.

Continuous

Records

v.s.

Logs

Sabermetrics

Sabermetrics

History

Henry Chadwick

Hugh Fullerton

Earnshaw Cook 

⋯⋯

Bill James

Billy Beane

History

Henry Chadwick

Hugh Fullerton

Earnshaw Cook 

⋯⋯

Bill James

Billy Beane

History

Henry Chadwick

Hugh Fullerton

Earnshaw Cook 

⋯⋯

Bill James

Billy Beane

History

Henry Chadwick

Hugh Fullerton

Earnshaw Cook 

⋯⋯

Bill James

Billy Beane

History

Henry Chadwick

Hugh Fullerton

Earnshaw Cook 

⋯⋯

Bill James

Billy Beane

Historical Data: Lahman

Historical Data: Lahman

teams = pd.read_csv('../lahman-csv_2015/Teams.csv')

Historical Data: Lahman

Real-time Data: MLBAM

Real-time Data: MLBAM

http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/plays.xml

Real-time Data: MLBAM

Meanwhile, use R:

library(Lahman)
library(dplyr)
totalRS <- Teams %>% select(yearID, R, G) %>% mutate(AvgRperG = R/G) %>% group_by(yearID) %>% summarise(sum(AvgRperG))
names(totalRS) <- c("yearID", "RUN")
head(totalRS)
## Source: local data frame [6 x 2]
## 
##   yearID      RUN
## 1   1871 93.12897
## 2   1872 95.21474
## 3   1873 73.15998
## 4   1874 58.55903
## 5   1875 70.08774
## 6   1876 47.01267

Meanwhile, use R:

library(ggplot2)
ggplot(data = totalRS, aes(x = yearID, y = RUN)) + stat_smooth() + geom_line()

Meanwhile, use R:

Meanwhile, use R:

require(openWAR)
source("~/Documents/openWAR/R/GameDay.R")
getGameIds(date=as.Date("2015-04-05"))
## 
## Retrieving data from 2015-04-05 ...
## ...found 2 games
## [1] "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_zznmlb_zzamlb_1"
gd = gameday(gameId="gid_2015_04_05_slnmlb_chnmlb_1")
## gid_2015_04_05_slnmlb_chnmlb_1
gd$url
##                                                                                                        bis_boxscore.xml 
##      "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/bis_boxscore.xml" 
##                                                                                                          inning_all.xml 
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/inning/inning_all.xml" 
##                                                                                                          inning_hit.xml 
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/inning/inning_hit.xml" 
##                                                                                                                game.xml 
##              "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/game.xml" 
##                                                                                                         game_events.xml 
##       "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/game_events.xml"

Meanwhile, use R:

str(gd$ds)
## 'data.frame':    75 obs. of  62 variables:
##  $ pitcherId     : num  452657 452657 452657 452657 452657 ...
##  $ batterId      : num  572761 518792 407812 425509 571431 ...
##  $ field_teamId  : chr  "112" "112" "112" "112" ...
##  $ ab_num        : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ inning        : num  1 1 1 1 1 1 1 1 1 2 ...
##  $ half          : Factor w/ 2 levels "bottom","top": 2 2 2 2 2 1 1 1 1 2 ...
##  $ balls         : num  2 1 2 0 1 1 1 2 2 1 ...
##  $ strikes       : num  2 0 0 3 3 0 2 3 1 3 ...
##  $ endOuts       : num  1 1 1 2 3 0 1 2 3 1 ...
##  $ event         : Factor w/ 18 levels "Caught Stealing 2B",..: 7 3 15 17 17 3 7 17 7 17 ...
##  $ actionId      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ description   : Factor w/ 94 levels "Adam Wainwright called out on strikes.  ",..: 51 23 55 32 49 20 39 4 79 93 ...
##  $ stand         : Factor w/ 2 levels "L","R": 1 1 2 2 1 1 2 1 2 2 ...
##  $ throws        : Factor w/ 2 levels "L","R": 1 1 1 1 1 2 2 2 2 1 ...
##  $ runnerMovement: Factor w/ 45 levels "","[407812::1B::Walk]",..: 1 26 28 1 3 17 18 1 19 1 ...
##  $ x             : num  142 190 175 NA NA ...
##  $ y             : num  159 119 127 NA NA ...
##  $ game_type     : Factor w/ 1 level "R": 1 1 1 1 1 1 1 1 1 1 ...
##  $ home_team     : Factor w/ 1 level "chn": 1 1 1 1 1 1 1 1 1 1 ...
##  $ home_teamId   : num  112 112 112 112 112 112 112 112 112 112 ...
##  $ home_lg       : Factor w/ 1 level "NL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ away_team     : Factor w/ 1 level "sln": 1 1 1 1 1 1 1 1 1 1 ...
##  $ away_teamId   : num  138 138 138 138 138 138 138 138 138 138 ...
##  $ away_lg       : Factor w/ 1 level "NL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ venueId       : num  17 17 17 17 17 17 17 17 17 17 ...
##  $ stadium       : Factor w/ 1 level "Wrigley Field": 1 1 1 1 1 1 1 1 1 1 ...
##  $ timestamp     : chr  "2015-04-06 00:16:58" "2015-04-06 00:19:47" "2015-04-06 00:18:55" "2015-04-06 00:20:42" ...
##  $ playerId.C    : num  424325 424325 424325 424325 424325 ...
##  $ playerId.1B   : num  519203 519203 519203 519203 519203 ...
##  $ playerId.2B   : num  6e+05 6e+05 6e+05 6e+05 6e+05 ...
##  $ playerId.3B   : num  592609 592609 592609 592609 592609 ...
##  $ playerId.SS   : num  516770 516770 516770 516770 516770 ...
##  $ playerId.LF   : num  458085 458085 458085 458085 458085 ...
##  $ playerId.CF   : num  451594 451594 451594 451594 451594 ...
##  $ playerId.RF   : num  624585 624585 624585 624585 624585 ...
##  $ batterPos     : chr  "3B" "RF" "LF" "SS" ...
##  $ batterName    : Factor w/ 30 levels "Adams, M","Alcantara",..: 4 9 10 19 1 8 26 22 5 15 ...
##  $ pitcherName   : Factor w/ 30 levels "Adams, M","Alcantara",..: 13 13 13 13 13 28 28 28 28 13 ...
##  $ runsOnPlay    : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ startOuts     : num  0 1 1 1 2 0 0 1 2 0 ...
##  $ runsInInning  : int  1 1 1 1 1 0 0 0 0 1 ...
##  $ runsITD       : num  0 0 0 1 1 0 0 0 0 0 ...
##  $ runsFuture    : num  1 1 1 0 0 0 0 0 0 1 ...
##  $ start1B       : chr  NA NA NA "407812" ...
##  $ start2B       : chr  NA NA "518792" NA ...
##  $ start3B       : chr  NA NA NA NA ...
##  $ end1B         : chr  NA NA "407812" "407812" ...
##  $ end2B         : chr  NA "518792" NA NA ...
##  $ end3B         : chr  NA NA NA NA ...
##  $ outsInInning  : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ startCode     : num  0 0 2 1 1 0 2 4 4 0 ...
##  $ endCode       : num  0 2 1 1 0 2 4 4 0 0 ...
##  $ fielderId     : num  6e+05 NA NA NA NA ...
##  $ gameId        : chr  "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" ...
##  $ isPA          : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ isAB          : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ isHit         : logi  FALSE TRUE TRUE FALSE FALSE TRUE ...
##  $ isBIP         : logi  TRUE TRUE TRUE FALSE FALSE TRUE ...
##  $ our.x         : num  42.2 163.2 124.3 NA NA ...
##  $ our.y         : num  99.1 200.4 180.7 NA NA ...
##  $ r             : num  108 258 219 NA NA ...
##  $ theta         : num  1.168 0.887 0.968 NA NA ...

Meanwhile, use R:

ggplot(data = gd$ds, aes(x = x, y = y, color = isHit)) + geom_point(size = 3) + coord_fixed()

Other Data: Crawler

Other Data: Crawler

import pandas as pd
hrTable = pd.io.html.read_html("http://www.cpbl.com.tw/stats_hr.aspx", header = 0)[0]

FAQ

Using the analyzed/predicted results to bet?

FAQ

What about other sports?

Q&A

More Questions?

Thanks!

https://github.com/suensummit

@SummitSuen

https://www.facebook.com/summit.suen

summit.suen@gmail.com

Play Data, Play Ball! - pyconapac2015

By Summit Suen

Play Data, Play Ball! - pyconapac2015

cc-by-sa

  • 4,262