Play Data, Play Ball!
Summit Suen
data:image/s3,"s3://crabby-images/52d59/52d59feec00fe12ce4716c919192a7b79e844d7a" alt=""
data:image/s3,"s3://crabby-images/f334b/f334b78d2309c64aee353d05e555a3028873b9a0" alt=""
Reminds
You
What?
data:image/s3,"s3://crabby-images/f6f0e/f6f0e04655cf28387771f2ff5bb11c18f2a8df4b" alt=""
Why
Baseball?
Discrete
v.s.
Continuous
Records
v.s.
Logs
Sabermetrics
data:image/s3,"s3://crabby-images/16a81/16a81751550b3434bff168770cc384f0b9d215f1" alt=""
Sabermetrics
data:image/s3,"s3://crabby-images/55a2b/55a2b7ee09d400c416ef55ee9dec4c9f812a3309" alt=""
History
Henry Chadwick
data:image/s3,"s3://crabby-images/2049c/2049c94c1b4fc074576f4d47d330754f48e2e014" alt=""
Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
History
Henry Chadwick
data:image/s3,"s3://crabby-images/f743b/f743bfab0d51601848b306ee332ff7accd9c9952" alt=""
Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
History
Henry Chadwick
data:image/s3,"s3://crabby-images/f88c0/f88c0aa87bf71f422664ecf8d467d481fd166b89" alt=""
Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
data:image/s3,"s3://crabby-images/2d492/2d49291819a37bbe6b8eec178a18f1ed9170db8f" alt=""
History
Henry Chadwick
data:image/s3,"s3://crabby-images/e760c/e760c2c2b5af50b3f406fcee6f5bfa9b6f860966" alt=""
Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
History
Henry Chadwick
data:image/s3,"s3://crabby-images/35294/35294225439dfba3b77a8ca420cd21c0455e5167" alt=""
Hugh Fullerton
Earnshaw Cook
⋯⋯
Bill James
Billy Beane
data:image/s3,"s3://crabby-images/58d5a/58d5aaed26392827920143e2727e14deddec394d" alt=""
Historical Data: Lahman
data:image/s3,"s3://crabby-images/19e28/19e2867b596f3f0f9581b4bfc1454064d90a905f" alt=""
Historical Data: Lahman
data:image/s3,"s3://crabby-images/5337f/5337f6241d808929c831cab62a7d45b52d8ff9d3" alt=""
teams = pd.read_csv('../lahman-csv_2015/Teams.csv')
Historical Data: Lahman
data:image/s3,"s3://crabby-images/da121/da1217f2eea93904e6b61e254d464d7ce2a5c393" alt=""
data:image/s3,"s3://crabby-images/7ac83/7ac838aaa42a9ef9e7526ce067214efae1ca0dd7" alt=""
Real-time Data: MLBAM
data:image/s3,"s3://crabby-images/0e01d/0e01d05059b9713027bb1b2c5f3d11d5bd85ceb3" alt=""
Real-time Data: MLBAM
data:image/s3,"s3://crabby-images/6d676/6d676f1eece557def0da32c1d55be393b97d1d44" alt=""
http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/plays.xml
Real-time Data: MLBAM
data:image/s3,"s3://crabby-images/3669b/3669b58bed7d7b1eed4a3b5f2237918556356bf5" alt=""
Meanwhile, use R:
library(Lahman)
library(dplyr)
totalRS <- Teams %>% select(yearID, R, G) %>% mutate(AvgRperG = R/G) %>% group_by(yearID) %>% summarise(sum(AvgRperG))
names(totalRS) <- c("yearID", "RUN")
head(totalRS)
## Source: local data frame [6 x 2]
##
## yearID RUN
## 1 1871 93.12897
## 2 1872 95.21474
## 3 1873 73.15998
## 4 1874 58.55903
## 5 1875 70.08774
## 6 1876 47.01267
data:image/s3,"s3://crabby-images/65eb6/65eb61a1a1e032c6b1e4b5a28ac74e5bc7fdd7df" alt=""
Meanwhile, use R:
library(ggplot2)
ggplot(data = totalRS, aes(x = yearID, y = RUN)) + stat_smooth() + geom_line()
data:image/s3,"s3://crabby-images/327f4/327f49ffd6d18aed7735b8a65cdc99bb7d3b4c34" alt=""
Meanwhile, use R:
data:image/s3,"s3://crabby-images/a3718/a3718da2bbc119adc8812860fbc899d0dc32e16c" alt=""
Meanwhile, use R:
require(openWAR)
source("~/Documents/openWAR/R/GameDay.R")
getGameIds(date=as.Date("2015-04-05"))
##
## Retrieving data from 2015-04-05 ...
## ...found 2 games
## [1] "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_zznmlb_zzamlb_1"
gd = gameday(gameId="gid_2015_04_05_slnmlb_chnmlb_1")
## gid_2015_04_05_slnmlb_chnmlb_1
gd$url
## bis_boxscore.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/bis_boxscore.xml"
## inning_all.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/inning/inning_all.xml"
## inning_hit.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/inning/inning_hit.xml"
## game.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/game.xml"
## game_events.xml
## "http://gd2.mlb.com/components/game/mlb/year_2015/month_04/day_05/gid_2015_04_05_slnmlb_chnmlb_1/game_events.xml"
Meanwhile, use R:
str(gd$ds)
## 'data.frame': 75 obs. of 62 variables:
## $ pitcherId : num 452657 452657 452657 452657 452657 ...
## $ batterId : num 572761 518792 407812 425509 571431 ...
## $ field_teamId : chr "112" "112" "112" "112" ...
## $ ab_num : num 1 2 3 4 5 6 7 8 9 10 ...
## $ inning : num 1 1 1 1 1 1 1 1 1 2 ...
## $ half : Factor w/ 2 levels "bottom","top": 2 2 2 2 2 1 1 1 1 2 ...
## $ balls : num 2 1 2 0 1 1 1 2 2 1 ...
## $ strikes : num 2 0 0 3 3 0 2 3 1 3 ...
## $ endOuts : num 1 1 1 2 3 0 1 2 3 1 ...
## $ event : Factor w/ 18 levels "Caught Stealing 2B",..: 7 3 15 17 17 3 7 17 7 17 ...
## $ actionId : num NA NA NA NA NA NA NA NA NA NA ...
## $ description : Factor w/ 94 levels "Adam Wainwright called out on strikes. ",..: 51 23 55 32 49 20 39 4 79 93 ...
## $ stand : Factor w/ 2 levels "L","R": 1 1 2 2 1 1 2 1 2 2 ...
## $ throws : Factor w/ 2 levels "L","R": 1 1 1 1 1 2 2 2 2 1 ...
## $ runnerMovement: Factor w/ 45 levels "","[407812::1B::Walk]",..: 1 26 28 1 3 17 18 1 19 1 ...
## $ x : num 142 190 175 NA NA ...
## $ y : num 159 119 127 NA NA ...
## $ game_type : Factor w/ 1 level "R": 1 1 1 1 1 1 1 1 1 1 ...
## $ home_team : Factor w/ 1 level "chn": 1 1 1 1 1 1 1 1 1 1 ...
## $ home_teamId : num 112 112 112 112 112 112 112 112 112 112 ...
## $ home_lg : Factor w/ 1 level "NL": 1 1 1 1 1 1 1 1 1 1 ...
## $ away_team : Factor w/ 1 level "sln": 1 1 1 1 1 1 1 1 1 1 ...
## $ away_teamId : num 138 138 138 138 138 138 138 138 138 138 ...
## $ away_lg : Factor w/ 1 level "NL": 1 1 1 1 1 1 1 1 1 1 ...
## $ venueId : num 17 17 17 17 17 17 17 17 17 17 ...
## $ stadium : Factor w/ 1 level "Wrigley Field": 1 1 1 1 1 1 1 1 1 1 ...
## $ timestamp : chr "2015-04-06 00:16:58" "2015-04-06 00:19:47" "2015-04-06 00:18:55" "2015-04-06 00:20:42" ...
## $ playerId.C : num 424325 424325 424325 424325 424325 ...
## $ playerId.1B : num 519203 519203 519203 519203 519203 ...
## $ playerId.2B : num 6e+05 6e+05 6e+05 6e+05 6e+05 ...
## $ playerId.3B : num 592609 592609 592609 592609 592609 ...
## $ playerId.SS : num 516770 516770 516770 516770 516770 ...
## $ playerId.LF : num 458085 458085 458085 458085 458085 ...
## $ playerId.CF : num 451594 451594 451594 451594 451594 ...
## $ playerId.RF : num 624585 624585 624585 624585 624585 ...
## $ batterPos : chr "3B" "RF" "LF" "SS" ...
## $ batterName : Factor w/ 30 levels "Adams, M","Alcantara",..: 4 9 10 19 1 8 26 22 5 15 ...
## $ pitcherName : Factor w/ 30 levels "Adams, M","Alcantara",..: 13 13 13 13 13 28 28 28 28 13 ...
## $ runsOnPlay : int 0 0 1 0 0 0 0 0 0 0 ...
## $ startOuts : num 0 1 1 1 2 0 0 1 2 0 ...
## $ runsInInning : int 1 1 1 1 1 0 0 0 0 1 ...
## $ runsITD : num 0 0 0 1 1 0 0 0 0 0 ...
## $ runsFuture : num 1 1 1 0 0 0 0 0 0 1 ...
## $ start1B : chr NA NA NA "407812" ...
## $ start2B : chr NA NA "518792" NA ...
## $ start3B : chr NA NA NA NA ...
## $ end1B : chr NA NA "407812" "407812" ...
## $ end2B : chr NA "518792" NA NA ...
## $ end3B : chr NA NA NA NA ...
## $ outsInInning : num 3 3 3 3 3 3 3 3 3 3 ...
## $ startCode : num 0 0 2 1 1 0 2 4 4 0 ...
## $ endCode : num 0 2 1 1 0 2 4 4 0 0 ...
## $ fielderId : num 6e+05 NA NA NA NA ...
## $ gameId : chr "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" "gid_2015_04_05_slnmlb_chnmlb_1" ...
## $ isPA : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ isAB : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ isHit : logi FALSE TRUE TRUE FALSE FALSE TRUE ...
## $ isBIP : logi TRUE TRUE TRUE FALSE FALSE TRUE ...
## $ our.x : num 42.2 163.2 124.3 NA NA ...
## $ our.y : num 99.1 200.4 180.7 NA NA ...
## $ r : num 108 258 219 NA NA ...
## $ theta : num 1.168 0.887 0.968 NA NA ...
Meanwhile, use R:
ggplot(data = gd$ds, aes(x = x, y = y, color = isHit)) + geom_point(size = 3) + coord_fixed()
data:image/s3,"s3://crabby-images/a85e7/a85e76915914ab30cc054d23ac22128872e2220c" alt=""
data:image/s3,"s3://crabby-images/cbfba/cbfba79dc40434b4d7ff39483233c490d0d3b46b" alt=""
Other Data: Crawler
data:image/s3,"s3://crabby-images/5cfea/5cfeac3e18b540942eb676ac2357ef6696c2cd99" alt=""
Other Data: Crawler
data:image/s3,"s3://crabby-images/76b1d/76b1d66f7221252d38080a8a6cd0a8897ca5dd86" alt=""
import pandas as pd
hrTable = pd.io.html.read_html("http://www.cpbl.com.tw/stats_hr.aspx", header = 0)[0]
FAQ
Using the analyzed/predicted results to bet?
FAQ
What about other sports?
Q&A
More Questions?
Thanks!
https://github.com/suensummit
@SummitSuen
https://www.facebook.com/summit.suen
summit.suen@gmail.com
data:image/s3,"s3://crabby-images/f334b/f334b78d2309c64aee353d05e555a3028873b9a0" alt=""
data:image/s3,"s3://crabby-images/16916/16916c535f8594a1001874402a98fb7980ef6e34" alt=""
data:image/s3,"s3://crabby-images/629cc/629ccc1889f842604c9ef233b49414cfa6bcd737" alt=""
Play Data, Play Ball! - pyconapac2015
By Summit Suen
Play Data, Play Ball! - pyconapac2015
cc-by-sa
- 4,338