Short Course
QUT, Brisbane, Australia
February 17, 2016
Sponsored by the ARC Centre of excellence for Mathematical and Statistical Frontiers
Instructor: Ryan Hafen (@hafenstats)
Activity | Time |
---|---|
Tessera Overview | 09:00 - 10:30 |
Break (morning tea) | 10:30 - 10:45 |
Hands-on examples with small data | 10:45 - 12:00 |
Break (lunch) | 12:00 - 13:00 |
Hands-on examples with small and medium data | 13:00 - 15:15 |
Break (afternoon tea) | 15:15 - 15:30 |
Hands-on examples with medium data | 15:30 - 16:30 |
Large scale setup overview and demonstration | 16:30 - 17:00 |
setwd("path/to/course/material/download")
options(repos = c(tessera = "http://packages.tessera.io",
getOption("repos")))
install.packages(c("datadr", "trelliscope", "rbokeh",
"housingData", "lubridate"))
# make sure you have set your working directory correctly
list.files()
# [1] "housing.R" "taxi.R" "taxidata"
# test installed packages with a simple example
library(trelliscope)
library(rbokeh)
library(housingData)
panel <- function(x)
figure() %>% ly_points(time, medListPriceSqft, data = x)
housing %>%
qtrellis(by = c("county", "state"), panel, layout = c(2, 4))
# note: this will open up a web browser with a shiny app running
# use ctrl-c or esc to get back to your R prompt
Test Installation:
Extension of multi-panel display systems, e.g. Trellis Display or faceting in ggplot
Number of panels can be very large (in the millions)
Panels can be interactively navigated through the use of cognostics
Provides flexible, scalable, detailed visualization of large, complex data
trelliscope example
Data most often does not come with a model
If we already (think we) know the algorithm / model to apply and simply apply it to the data and do nothing else to assess validity, we are not doing analysis, we are processing
Deep analysis means detailed, comprehensive analysis that does not lose important information in the data
It means trial and error, an iterative process of hypothesizing, fitting, validating, learning
It means a lot of visualization
It means access to 1000s of statistical, machine learning, and visualization methods, regardless of data size
It means access to a high-level language to make iteration efficient
Any or all of the following:
Often complex data is more of a challenge than large data, but most large data sets are also complex
We will see examples of creating Trelliscope displays in the following hands-on section
Tessera is an implementation of D&R built on R
Must be able to break data down into pieces for independent storage / computation
Recall the potential for: “Complex data structures not readily put into tabular form of cases by variables”
Key-value pairs: a flexible storage paradigm for partitioned, distributed data
each subset of a division is an R list with two elements: key, value
keys and values can be any R object
[[1]]
$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
...
[[2]]
$key
[1] "versicolor"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 1.5
5.5 2.3 4.0 1.3
6.5 2.8 4.6 1.5
...
[[3]]
$key
[1] "virginica"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
...
We can represent the data divided by species as a collection of key-value pairs
Measurements in centimeters of the variables sepal length and width and petal length and width for 50 flowers from each of 3 species of iris: "setosa", "versicolor", and "virginica"
Each key-value pair is an independent unit and can be stored across disks, machines, etc.
[[1]]
$key
[1] "setosa"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
...
[[2]]
$key
[1] "versicolor"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width
7.0 3.2 4.7 1.4
6.4 3.2 4.5 1.5
6.9 3.1 4.9 1.5
5.5 2.3 4.0 1.3
6.5 2.8 4.6 1.5
...
[[3]]
$key
[1] "virginica"
$value
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.3 3.3 6.0 2.5
5.8 2.7 5.1 1.9
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
...
A collection of key-value pairs that forms a data set is called a distributed data object (ddo) and is the fundamental data type that is passed to and returned from most of the functions in datadr.
A ddo for which every value in each key value pair is a data frame of the same schema is called a distributed data frame (ddf)
This is a ddf (and a ddo)
[[1]]
$key
[1] "setosa"
$value
Call:
loess(formula = Sepal.Length ~ Sepal.Width, data = x)
Number of Observations: 50
Equivalent Number of Parameters: 5.53
Residual Standard Error: 0.2437
[[2]]
$key
[1] "versicolor"
$value
Call:
loess(formula = Sepal.Length ~ Sepal.Width, data = x)
Number of Observations: 50
Equivalent Number of Parameters: 5
Residual Standard Error: 0.4483
[[3]]
$key
[1] "virginica"
$value
Call:
loess(formula = Sepal.Length ~ Sepal.Width, data = x)
Number of Observations: 50
Equivalent Number of Parameters: 5.59
Residual Standard Error: 0.5683
Suppose we fit a loess model to each subset and store the result in a new collection of key-value pairs.
This is a ddo but not a ddf
Distributed data types / backend connections
localDiskConn(), hdfsConn(), sparkDataConn() connections to ddo / ddf objects persisted on a backend storage system
ddo(): instantiate a ddo from a backend connection
ddf(): instantiate a ddf from a backend connection
Conversion methods between data stored on different backends
A general-purpose paradigm for processing big data sets
The standard processing paradigm in Hadoop
Shuffle: group map outputs by intermediate key
Reduce: for each unique intermediate key, gather all values and process with a reduce function
This is a very general and scalable approach to processing data.
Compute maximum sepal length by species for randomly divided iris data
maxMap <- expression({
# map.values: list of input values
for(v in map.values) {
# v is a data.frame
# for each species, "collect" a
# k/v pair to be sent to reduce
by(v, v$Species, function(x) {
collect(
x$Species[1], # key
max(x$Sepal.Length) # value
)
})
}
})
Map
# reduce is called for unique intermediate key
# "pre" is an initialization called once
# "reduce" is called for batches of reduce.values
# "post" runs after processing all reduce.values
maxReduce <- expression(
pre={
speciesMax <- NULL
},
reduce={
curValues <- do.call(c, reduce.values)
speciesMax <- max(c(speciesMax, curValues))
},
post={
collect(reduce.key, speciesMax)
}
)
Reduce
That's not very intuitive
But could be run on hundreds of terabytes of data
And it's much easier than what you'd write in Java
As often as possible, functions exist in datadr to shield the analyst from having to write MapReduce code
A divide() function takes a ddf and splits it using conditioning or random replicate division
Analytic methods are applied to a ddo/ddf with the addTransform() function
Recombinations are specified with recombine(), which provides some standard combiner methods, such as combRbind(), which binds transformed results into single data frame
Division of ddos with arbitrary data structures must typically be done with custom MapReduce code (unless data can be temporarily transformed into a ddf)
drLapply(): apply a function to each subset of a ddo/ddf and obtain a new ddo/ddf
drJoin(): join multiple ddo/ddf objects by key
drSample(): take a random sample of subsets of a ddo/ddf
drFilter(): filter out subsets of a ddo/ddf that do not meet a specified criteria
drSubset(): return a subset data frame of a ddf
drRead.table() and friends
mrExec(): run a traditional MapReduce job on a ddo/ddf
drQuantile(): estimate all-data quantiles, optionally by a grouping variable
drAggregate(): all-data tabulation
drHexbin(): all-data hexagonal binning aggregation
summary() method computes numerically stable moments, other summary stats (freq table, range, #NA, etc.)
More to come
dplyr is great for subsetting, aggregating large tabular data
datadr is great for scalable deep analysis of large, complex data
For more information (docs, code, papers, user group, blog, etc.): http://tessera.io
library(housingData)
head(housing)
# fips county state time nSold medListPriceSqft medSoldPriceSqft
# 1 06001 Alameda County CA 2008-10-01 NA 307.9787 325.8118
# 2 06001 Alameda County CA 2008-11-01 NA 299.1667 NA
# 3 06001 Alameda County CA 2008-11-01 NA NA 318.1150
# 4 06001 Alameda County CA 2008-12-01 NA 289.8815 305.7878
# 5 06001 Alameda County CA 2009-01-01 NA 288.5000 291.5977
# 6 06001 Alameda County CA 2009-02-01 NA 287.0370 NA
Variable | Description |
---|---|
fips | Federal Information Processing Standard - a 5 digit county code |
county | US county name |
state | US state name |
time | date (the data is aggregated monthly) |
nSold | number sold this month |
medListPriceSqft | median list price per square foot (USD) |
medSoldPriceSqft | median sold price per square foot (USD) |
## these are examples - don't run this code
# get data from a text file on HDFS and save to ddf on hdfs
myData1 <- drRead.table(
hdfsConn("/home/input_text_files", header = TRUE, sep = "\t"),
output = hdfsConn("/home/data_ddf")
)
# read from a csv file on disk and save ddf to disk
myData2 <- drRead.csv(
localDiskConn("c:/my/local/data.csv"),
output = localDiskConn("c:/my/local/data_ddf")
)
# convert in memory data frame to ddf:
myData3 <- ddf(some_data_frame)
Our housing data is already in memory, so we can use the simple approach of casting the data frame as a ddf:
library(datadr)
library(housingData)
housingDdf <- ddf(housing)
# print result to see what it looks like
housingDdf
# Distributed data frame backed by 'kvMemory' connection
#
# attribute | value
# ----------------+----------------------------------------------------------------
# names | fips(fac), county(fac), state(fac), time(Dat), and 3 more
# nrow | 224369
# size (stored) | 10.58 MB
# size (object) | 10.58 MB
# # subsets | 1
#
# * Other attributes: getKeys()
# * Missing attributes: splitSizeDistn, splitRowDistn, summary
One useful way to look at the data is divided by county and state
byCounty <- divide(housingDdf,
by = c("county", "state"), update = TRUE)
* update = TRUE tells datadr to compute and update the attributes of the ddf (seen in printout)
byCounty
# Distributed data frame backed by 'kvMemory' connection
#
# attribute | value
# ----------------+----------------------------------------------------------------
# names | fips(cha), time(Dat), nSold(num), and 2 more
# nrow | 224369
# size (stored) | 15.73 MB
# size (object) | 15.73 MB
# # subsets | 2883
#
# * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
# * Conditioning variables: county, state
Recall that a ddf is a collection of key-value pairs
# look at first key-value pair
byCounty[[1]]
# $key
# [1] "county=Abbeville County|state=SC"
#
# $value
# fips time nSold medListPriceSqft medSoldPriceSqft
# 1 45001 2008-10-01 NA 73.06226 NA
# 2 45001 2008-11-01 NA 70.71429 NA
# 3 45001 2008-12-01 NA 70.71429 NA
# 4 45001 2009-01-01 NA 73.43750 NA
# 5 45001 2009-02-01 NA 78.69565 NA
# ...
# look at first two key-value pairs
byCounty[1:2]
# [[1]]
# $key
# [1] "county=Abbeville County|state=SC"#
#
# $value
# fips time nSold medListPriceSqft medSoldPriceSqft
# 1 45001 2008-10-01 NA 73.06226 NA
# 2 45001 2008-11-01 NA 70.71429 NA
# 3 45001 2008-12-01 NA 70.71429 NA
# 4 45001 2009-01-01 NA 73.43750 NA
# 5 45001 2009-02-01 NA 78.69565 NA
# ...#
#
#
# [[2]]
# $key
# [1] "county=Acadia Parish|state=LA"#
#
# $value
# fips time nSold medListPriceSqft medSoldPriceSqft
# 1 22001 2008-10-01 NA 67.77494 NA
# 2 22001 2008-11-01 NA 67.16418 NA
# 3 22001 2008-12-01 NA 63.59129 NA
# 4 22001 2009-01-01 NA 65.78947 NA
# 5 22001 2009-02-01 NA 65.78947 NA
# ...
# lookup by key
byCounty[["county=Benton County|state=WA"]]
# $key
# [1] "county=Benton County|state=WA"
#
# $value
# fips time nSold medListPriceSqft medSoldPriceSqft
# 1 53005 2008-10-01 137 106.6351 106.2179
# 2 53005 2008-11-01 80 106.9650 NA
# 3 53005 2008-11-01 NA NA 105.2370
# 4 53005 2008-12-01 95 107.6642 105.6311
# 5 53005 2009-01-01 73 107.6868 105.8892
# ...
press down arrow key for solution when you're ready
byState <- divide(housing, by="state")
byMonth <- divide(housing, by="time")
summary(byCounty)
# fips time nSold
# ------------------- ------------------ -------------------
# levels : 2883 missing : 0 missing : 164370
# missing : 0 min : 08-10-01 min : 11
# > freqTable head < max : 14-03-01 max : 35619
# 26077 : 140 mean : 274.6582
# 51069 : 140 std dev : 732.2429
# 08019 : 139 skewness : 10.338
# 13311 : 139 kurtosis : 222.8995
# ------------------- ------------------ -------------------
# ...
names(byCounty)
# [1] "fips" "time" "nSold" "medListPriceSqft"
# [5] "medSoldPriceSqft"
length(byCounty)
# [1] 2883
getKeys(byCounty)
# [[1]]
# [1] "county=Abbeville County|state=SC"
# ...
addTransform(data, fun) applies a function fun to each value of a key-value pair in the input ddo/ddf data
e.g. calculate a summary statistic for each subset
The transformation is not applied immediately, it is deferred until:
A function that kicks off a MapReduce job is called, e.g. recombine()
A subset of the data is requested (e.g. byCounty[[1]])
The drPersist() function explicitly forces transformation computation to persist
Compute the slope of a simple linear regression of list price vs. time
# function to calculate a linear model and extract the slope parameter
lmCoef <- function(x) {
coef(lm(medListPriceSqft ~ time, data = x))[2]
}
# best practice tip: test transformation function on one subset
lmCoef(byCounty[[1]]$value)
# time
# -0.0002323686
# apply the transformation function to the ddf
byCountySlope <- addTransform(byCounty, lmCoef)
# look at first key-value pair
byCountySlope[[1]]
# $key
# [1] "county=Abbeville County|state=SC"
#
# $value
# time
# -0.0002323686
# what does this object look like?
byCountySlope
# Transformed distributed data object backed by 'kvMemory' connection
#
# attribute | value
# ----------------+----------------------------------------------------------------
# size (stored) | 15.73 MB (before transformation)
# size (object) | 15.73 MB (before transformation)
# # subsets | 2883
#
# * Other attributes: getKeys()
# * Conditioning variables: county, state
*note: this is a ddo (value isn't a data frame) but it is still pointing to byCounty's data
Create a new transformed data set using addTransform()
# get a subset
x <- byCounty[[1]]$value
# now experiment with x and stick resulting code into a transformation function
# some scaffolding for your solution:
transformFn <- function(x) {
## you fill in here
}
xformedData <- addTransform(byCounty, transformFn)
Hint: it is helpful to grab a subset and experiment with it and then stick that code into a function:
# example 1 - total sold over time for the county
totalSold <- function(x) {
sum(x$nSold, na.rm=TRUE)
}
byCountySold <- addTransform(byCounty, totalSold)
# example 2 - range of time for which data exists for county
timeRange <- function(x) {
range(x$time)
}
byCountyTime <- addTransform(byCounty, timeRange)
# example 3 - convert usd/sqft to aud/metre
us2aus <- function(x)
x * 15.2453603
byCountyAus <- addTransform(byCounty, function(x) {
x$medListPriceMetre <- us2aus(x$medListPriceSqft)
x$medSoldPriceMetre <- us2aus(x$medSoldPriceSqft)
x
})
Note that the us2aus function was detected - this becomes useful in distributed computing when shipping code across many computers
countySlopes <- recombine(byCountySlope, combine = combRbind)
Let's recombine the slopes of list price vs. time into a data frame.
This is an example of statistic recombination where the result is small and is pulled back into R for further investigation.
head(countySlopes)
# county state val
# Abbeville County SC -0.0002323686
# Acadia Parish LA 0.0019518441
# Accomack County VA -0.0092717711
# Ada County ID -0.0030197554
# Adair County IA -0.0308381951
# Adair County KY 0.0034399585
combine parameter controls the form of the result
combine=combRbind: rbind is used to combine all of the transformed key-value pairs into a local data frame in your R session - this is a frequently used option
combine=combCollect: transformed key-value pairs are collected into a local list in your R session
combine=combDdo: results are combined into a new ddo object
Others can be written for more sophisticated goals such as model coefficient averaging, etc.
us2aus <- function(x)
x * 15.2453603
byCountyAus <- addTransform(byCounty, function(x) {
x$medListPriceMetre <- us2aus(x$medListPriceSqft)
x$medSoldPriceMetre <- us2aus(x$medSoldPriceSqft)
x
})
byCountyAus <- drPersist(byCountyAus)
Often we want to persist a transformation, creating a new ddo or ddf that contains the transformed data. Transforming and persisting a transformation is an analytic recombination.
Let's persist our US to Australian transformation using drPersist():
*note we can also use recombine(combine = combDdf) - drPersist() is for convenience
We have a few additional data sets we'd like to join with our home price data divided by county
head(geoCounty)
# fips county state lon lat rMapState rMapCounty
# 1 01001 Autauga County AL -86.64565 32.54009 alabama autauga
# 2 01003 Baldwin County AL -87.72627 30.73831 alabama baldwin
# 3 01005 Barbour County AL -85.39733 31.87403 alabama barbour
# 4 01007 Bibb County AL -87.12526 32.99902 alabama bibb
# 5 01009 Blount County AL -86.56271 33.99044 alabama blount
# 6 01011 Bullock County AL -85.71680 32.10634 alabama bullock
head(wikiCounty)
# fips county state pop2013 href
# 1 01001 Autauga County AL 55246 http://en.wikipedia.org/wiki/Autauga_County,_Alabama
# 2 01003 Baldwin County AL 195540 http://en.wikipedia.org/wiki/Baldwin_County,_Alabama
# 3 01005 Barbour County AL 27076 http://en.wikipedia.org/wiki/Barbour_County,_Alabama
# 4 01007 Bibb County AL 22512 http://en.wikipedia.org/wiki/Bibb_County,_Alabama
# 5 01009 Blount County AL 57872 http://en.wikipedia.org/wiki/Blount_County,_Alabama
# 6 01011 Bullock County AL 10639 http://en.wikipedia.org/wiki/Bullock_County,_Alabama
byCountyGeo <- # ...
byCountyWiki <- # ...
byCountyGeo <- divide(geoCounty,
by=c("county", "state"))
byCountyWiki <- divide(wikiCounty,
by=c("county", "state"))
byCountyGeo
# Distributed data frame backed by 'kvMemory' connection
#
# attribute | value
# ----------------+----------------------------------------------------------------
# names | fips(cha), lon(num), lat(num), rMapState(cha), rMapCounty(cha)
# nrow | 3075
# size (stored) | 8.14 MB
# size (object) | 8.14 MB
# # subsets | 3075
# ...
byCountyGeo[[1]]
# $key
# [1] "county=Abbeville County|state=SC"
#
# $value
# fips lon lat rMapState rMapCounty
# 1 45001 -82.45851 34.23021 south carolina abbeville
drJoin() takes any number of named parameters specifying ddo/ddf objects and returns a ddo of the objects joined by key
joinedData <- drJoin(
housing = byCountyAus,
slope = byCountySlope,
geo = byCountyGeo,
wiki = byCountyWiki)
joinedData
# Distributed data object backed by 'kvMemory' connection
#
# attribute | value
# ----------------+----------------------------------------------------------------
# size (stored) | 29.23 MB
# size (object) | 29.23 MB
# # subsets | 3144
What does this object look like?
str(joinedData[[1]])
# List of 2
# $ key : chr "county=Abbeville County|state=SC"
# $ value:List of 4
# ..$ housing:'data.frame': 66 obs. of 7 variables:
# .. ..$ fips : chr [1:66] "45001" "45001" "45001" "45001" ...
# .. ..$ time : Date[1:66], format: "2008-10-01" "2008-11-01" ...
# .. ..$ nSold : num [1:66] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ medListPriceSqft : num [1:66] 73.1 70.7 70.7 73.4 78.7 ...
# .. ..$ medSoldPriceSqft : num [1:66] NA NA NA NA NA NA NA NA NA NA ...
# .. ..$ medListPriceMetre: num [1:66] 1114 1078 1078 1120 1200 ...
# .. ..$ medSoldPriceMetre: num [1:66] NA NA NA NA NA NA NA NA NA NA ...
# ..$ slope : Named num -0.000232
# ..$ geo :'data.frame': 1 obs. of 5 variables:
# .. ..$ fips : chr "45001"
# .. ..$ lon : num -82.5
# .. ..$ lat : num 34.2
# .. ..$ rMapState : chr "south carolina"
# .. ..$ rMapCounty: chr "abbeville"
# ..$ wiki :'data.frame': 1 obs. of 3 variables:
# .. ..$ fips : chr "45001"
# .. ..$ pop2013: int 25007
# .. ..$ href : chr "http://en.wikipedia.org/wiki/Abbeville_County,_South Carolina"
# ...
for a given county/state, the value is a list with elements according to data provided in drJoin()
Note that joinedData has more subsets than byCountyAus
length(byCountyAus)
# [1] 2883
length(joinedData)
# [1] 3144
# Adams County Iowa only has geo and wiki data
joinedData[["county=Adams County|state=IA"]]$value
# $geo
# fips lon lat rMapState rMapCounty
# 1 19003 -94.70909 41.03166 iowa adams
#
# $wiki
# fips pop2013 href
# 1 19003 3894 http://en.wikipedia.org/wiki/Adams_County,_Iowa
noHousing <- joinedData %>%
addTransform(function(x) is.null(x$housing)) %>%
recombine(combRbind)
head(subset(noHousing, val == TRUE))
# county state val
# 2884 Adams County IA TRUE
# 2885 Adams County ND TRUE
# 2886 Antelope County NE TRUE
# 2887 Arthur County NE TRUE
# 2888 Ashley County AR TRUE
# 2889 Aurora County SD TRUE
tx <- addTransform(joinedData, function(x) is.null(x$housing))
noHousing <- recombine(tx, combRbind)
(note about pipes)
Suppose we only want to keep subsets of joinedData that have housing price data
joinedData <- drFilter(joinedData,
function(x) {
!is.null(x$housing)
})
# joinedData should now have as many subsets as byCountyAus
length(joinedData)
# [1] 2883
library(trelliscope)
# establish a connection to a VDB located in a directory "housing_vdb"
vdbConn("housing_vdb", name = "shortcourse")
Plot list price vs. time using rbokeh
library(rbokeh)
timePanel <- function(x) {
xlim <- c("2008-07-01", "2014-06-01")
figure(ylab = "Price / Sq. Metre (AUD)",
xlim = as.Date(xlim)) %>%
ly_points(time, medListPriceMetre,
data = x$housing,
hover = list(time, medListPriceMetre))
}
# test it on a subset
timePanel(joinedData[[1]]$value)
makeDisplay(joinedData,
name = "list_vs_time_rbokeh",
desc = "List price over time for each county.",
panelFn = timePanel
)
view()
# to go directly to this display
view("list_vs_time_rbokeh")
Ideas: recreate the previous plot with your favorite plotting package, or try something new (histogram of prices, etc.)
library(lattice)
library(ggplot2)
latticeTimePanel <- function(x) {
xyplot(medListPriceMetre + medSoldPriceMetre ~ time,
data = x$housing, auto.key = TRUE,
ylab = "Price / Metre (AUD)")
}
makeDisplay(joinedData,
name = "list_vs_time_lattice",
desc = "List price over time for each county.",
panelFn = latticeTimePanel
)
ggTimePanel <- function(x) {
qplot(time, medListPriceMetre, data = x$housing,
ylab = "Price / Metre (AUD)") +
geom_point()
}
makeDisplay(joinedData,
name = "list_vs_time_ggplot2",
desc = "List price over time for each county.",
panelFn = ggTimePanel
)
zillowCog <- function(x) {
# build a string for a link to zillow
st <- getSplitVar(x, "state")
ct <- getSplitVar(x, "county")
zillowString <- gsub(" ", "-", paste(ct, st))
# return a list of cognostics
list(
slope = cog(x$slope, desc = "list price slope"),
meanList = cogMean(x$housing$medListPriceMetre),
meanSold = cogMean(x$housing$medSoldPriceMetre),
lat = cog(x$geo$lat, desc = "county latitude"),
lon = cog(x$geo$lon, desc = "county longitude"),
wikiHref = cogHref(x$wiki$href, desc="wiki link"),
zillowHref = cogHref(
sprintf("http://www.zillow.com/homes/%s_rb/", zillowString),
desc="zillow link")
)
}
# test it on a subset
zillowCog(joinedData[[1]]$value)
# $slope
# time
# -0.0002323686
#
# $meanList
# [1] 1109.394
#
# $meanSold
# [1] NaN
#
# $lat
# [1] 34.23021
#
# $lon
# [1] -82.45851
#
# $wikiHref
# [1] "<a href=\"http://en.wikipedia.org/wiki/Abbeville_County,_South Carolina\" target=\"_blank\">link</a>"
#
# $zillowHref
# [1] "<a href=\"http://www.zillow.com/homes/Abbeville-County-SC_rb/\" target=\"_blank\">link</a>"
We simply add our cognostics function to makeDisplay()
makeDisplay(joinedData,
name = "list_vs_time_cog",
desc = "List price over time for each county.",
panelFn = timePanel,
cogFn = zillowCog
)
read.csv("taxidata/raw_csv/taxi.csv", nrows = 3)
# medal hack vendor ratelevel flag pickdatetime dropdatetime
# 1 hkm bvca VTS 1 2013-04-07 08:43:00 2013-04-07 08:47:00
# 2 vso vxga VTS 1 2013-09-21 00:08:00 2013-09-21 00:32:00
# 3 nho ukvb CMT 1 N 2013-04-30 02:10:34 2013-04-30 02:13:21
# passnum ridetimemet ridedis picklong picklat droplong droplat paytype
# 1 1 240 1.04 -74.00241 40.74785 -73.99725 40.73743 CSH
# 2 3 1440 6.69 -74.00109 40.73394 -73.96391 40.80785 CRD
# 3 1 166 0.70 -73.97559 40.74529 -73.97772 40.75217 CRD
# meter surcharge tax tip toll fare ridetime pickregion dropregion
# 1 5.0 0.0 0.5 0.00 0 5.50 240 man4 man2
# 2 24.0 0.5 0.5 3.50 0 28.50 1440 man2 man9
# 3 4.5 0.5 0.5 1.37 0 6.87 167 man6 GrandCentral
medal | yellow-cab medallion ID (obfuscated) |
hack | driver's yellow cab license ID (obfuscated) |
vendor | cab communication system |
ratelevel | taximeter rate charging type |
flag | unknown |
pickdatetime | pick-up time of the ride |
dropdatetime | drop-off time of the ride |
passnum | number of passengers |
ridetimemet | ride time measured by the taximeter in seconds |
ridedis | ride distance measured by the taximeter in miles |
picklong | longitude of the pick-up location |
picklat | latitude of the pick-up location |
droplong | longitude of the drop-off location |
droplat | latitude of the drop-off location |
paytype | payment type: "CRD" "CSH" "DIS" "NOC" "UNK" |
meter | fare showing on meter in USD |
surcharge | extra fees in USD |
tax | tax in USD |
tip | tip in USD |
toll | tolls in USD |
fare | total charge in USD |
ridetime | ride time - drop-off time minus pick-up time, in seconds |
pickregion | pick-up region |
dropregion | drop-off region |
The pick up and drop off locations have been categorized into 12 districts plus three "hot spots" - Penn Station, Grand Central Station, and Central Park
Load packages and set up local "cluster"
# make sure you have set your working directory correctly
list.files()
# [1] "housing.R" "taxi.R" "taxidata"
library(datadr)
library(trelliscope)
library(parallel)
library(lubridate)
library(rbokeh)
# set the default backend to be R multicore with 4 cores
# feel free to replace 4 with 1,2,3
options(defaultLocalDiskControl = localDiskControl(makeCluster(4)))
# this steps through blocks of the input file (serially)
# and creates a ddf with one subset for each block
raw <- drRead.csv("taxidata/raw_csv/taxi.csv",
output = localDiskConn("taxidata/raw"))
raw
# Distributed data frame backed by 'kvLocalDisk' connection
# attribute | value
# ----------------+----------------------------------------------------------------
# names | medal(cha), hack(cha), vendor(cha), ratelevel(int), and 20 more
# nrow | [empty] call updateAttributes(dat) to get this value
# size (stored) | 41.89 MB
# size (object) | [empty] call updateAttributes(dat) to get this value
# # subsets | 20
# * Missing attributes: keys, splitSizeDistn, splitRowDistn, summary
list.files("taxidata/raw")
# [1] "_meta"
# [2] "0aee9b78301d7ec8998971363be87c03.Rdata"
# [3] "1937e7b7bc06d5dc1531976d1c355670.Rdata"
# [4] "23c80a31c0713176016e6e18d76a5f31.Rdata"
# [5] "2522027d230e3dfe02d8b6eba1fd73e1.Rdata"
# [6] "30986e242495ba4a216a86700f7ece29.Rdata"
# [7] "3f880ddf0d5f56067f2833ad76b3b226.Rdata"
# [8] "4a5d7d50676e6d0ea065f445d8a5539d.Rdata"
# [9] "5511291e2e560122bcb2e3261d58e289.Rdata"
# [10] "5e338704a8e069ebd8b38ca71991cf94.Rdata"
# [11] "6717f2823d3202449301145073ab8719.Rdata"
# [12] "709a7faa72cb6f3be79d683e234ccb25.Rdata"
# [13] "7260985235f3e5b1cd7097b1c768c19a.Rdata"
# [14] "90a7653d717dc1553ee564aa27b749b9.Rdata"
# [15] "a86d0670df7fb4f1da7b38943f5ee4e7.Rdata"
# [16] "c72d0198505540505acfb4b2b49911e6.Rdata"
# [17] "c8f1af0c7b420d99059a2c8d079c51bf.Rdata"
# [18] "db8e490a925a60e62212cefc7674ca02.Rdata"
# [19] "dbc09cba9fe2583fb01d63c70e1555a8.Rdata"
# [20] "e5b57f323c7b3719bbaaf9f96b260d39.Rdata"
# [21] "e8c6c490beabf8e7ab89dcd59a75f389.Rdata"
raw <- ddf(localDiskConn("taxidata/raw"))
# make sure it still looks the same
raw
# Distributed data frame backed by 'kvLocalDisk' connection
# attribute | value
# ----------------+----------------------------------------------------------------
# names | medal(cha), hack(cha), vendor(cha), ratelevel(int), and 20 more
# nrow | [empty] call updateAttributes(dat) to get this value
# size (stored) | 41.89 MB
# size (object) | [empty] call updateAttributes(dat) to get this value
# # subsets | 20
# * Missing attributes: keys, splitSizeDistn, splitRowDistn, summary
str(raw[[1]])
# List of 2
# $ key : num 6
# $ value:'data.frame': 50000 obs. of 24 variables:
# ..$ medal : chr [1:50000] "nae" "tal" "ehi" "kpe" ...
# ..$ hack : chr [1:50000] "badc" "erua" "lqva" "lbhb" ...
# ..$ vendor : chr [1:50000] "CMT" "CMT" "VTS" "CMT" ...
# ..$ ratelevel : int [1:50000] 1 1 1 1 1 1 1 1 1 1 ...
# ..$ flag : chr [1:50000] "N" "N" "" "N" ...
# ..$ pickdatetime: chr [1:50000] "2013-05-12 22:58:32" "2013-05-08 13:41:10" "2013-05-07 22:13:00" "2013-02-08 09:52:33" ...
# ..$ dropdatetime: chr [1:50000] "2013-05-12 23:08:05" "2013-05-08 13:50:26" "2013-05-07 22:16:00" "2013-02-08 10:03:19" ...
# ..$ passnum : int [1:50000] 1 2 1 1 1 1 1 6 1 3 ...
# ..$ ridetimemet : int [1:50000] 271 556 180 646 840 1080 493 120 300 1020 ...
# ..$ ridedis : num [1:50000] 1 1.3 0.85 2 3.4 1.54 0.9 0.05 1.15 6.28 ...
# ..$ picklong : num [1:50000] -74 -74 -74 -74 -74 ...
# ..$ picklat : num [1:50000] 40.8 40.8 40.7 40.8 40.7 ...
# ..$ droplong : num [1:50000] -74 -74 -74 -74 -74 ...
# ..$ droplat : num [1:50000] 40.8 40.7 40.7 40.7 40.8 ...
# ..$ paytype : chr [1:50000] "CRD" "CRD" "CSH" "CRD" ...
# ..$ meter : num [1:50000] 5.5 8 4.5 9.5 13 12 7 3.5 6 19.5 ...
# ..$ surcharge : num [1:50000] 0.5 0 0.5 0 0.5 0 0 0.5 0 0.5 ...
# ..$ tax : num [1:50000] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# ..$ tip : num [1:50000] 1.3 1.7 0 2 2 2 0 0 1.2 0 ...
# ..$ toll : num [1:50000] 0 0 0 0 0 0 0 0 0 0 ...
# ..$ fare : num [1:50000] 7.8 10.2 5.5 12 16 14.5 7.5 4.5 7.7 20.5 ...
# ..$ ridetime : int [1:50000] 573 556 180 646 840 1080 493 120 300 1020 ...
# ..$ pickregion : chr [1:50000] "GrandCentral" "man4" "man3" "PennStation" ...
# ..$ dropregion : chr [1:50000] "PennStation" "man5" "man6" "man3" ...
* note: keys were assigned by drRead.csv() to be 1, 2, ...
nrow(raw)
# [1] NA
# Warning message:
# In nrow(raw) :
# Number of rows has not been computed - run updateAttributes on this object
However, note that not all attributes are available:
raw <- updateAttributes(raw)
raw
# Distributed data frame backed by 'kvLocalDisk' connection
# attribute | value
# ----------------+----------------------------------------------------------------
# names | medal(cha), hack(cha), vendor(cha), ratelevel(int), and 20 more
# nrow | 1000000
# size (stored) | 41.89 MB
# size (object) | 337.85 MB
# # subsets | 20
# * Other attributes: getKeys(), splitSizeDistn(), splitRowDistn(), summary()
nrow(raw)
# [1] 1000000
However, note that not all attributes are available:
summary(raw)
# medal hack vendor
# --------------------- --------------------- ------------------
# levels : 10000+ levels : 10000+ levels : 2
# missing : 0 missing : 0 missing : 0
# > freqTable head < > freqTable head < > freqTable head <
# fah : 163 ogtb : 109 VTS : 500811
# jyr : 157 vssa : 106 CMT : 499189
# rfp : 154 qdia : 102
# iet : 153 rwxb : 102
# --------------------- --------------------- ------------------
# ratelevel flag pickdatetime
# -------------------- ------------------ -----------------------
# missing : 0 levels : 3 levels : 10000+
# min : 0 missing : 0 missing : 0
# max : 6 > freqTable head < > freqTable head <
# mean : 1.003798 : 500946 2013-02-12 09:19:00 : 9
# std dev : 0.1100981 N : 487911 2013-03-02 18:55:00 : 9
# skewness : 30.84596 Y : 11143 2013-03-06 07:37:00 : 9
# kurtosis : 1035.261 2013-04-24 08:23:00 : 9
# -------------------- ------------------ -----------------------
# dropdatetime passnum ridetimemet
# ------------------------ ------------------- -------------------
# levels : 10000+ missing : 0 missing : 0
# missing : 0 min : 1 min : -10
# > freqTable head < max : 6 max : 4294739
# 2013-01-26 22:23:00 : 10 mean : 1.707755 mean : 741.4466
# 2013-01-25 22:27:00 : 9 std dev : 1.377656 std dev : 17647.23
# 2013-03-05 18:53:00 : 9 skewness : 2.016229 skewness : 242.322
# 2013-04-03 22:16:00 : 9 kurtosis : 5.836064 kurtosis : 58757.79
# ------------------------ ------------------- -------------------
# ridedis picklong picklat
# ------------------- --------------------- ---------------------
# missing : 0 missing : 0 missing : 0
# min : 0 min : -74.02251 min : 40.68776
# max : 250000 max : -73.91147 max : 40.87636
# mean : 2.434937 mean : -73.98106 mean : 40.75546
# std dev : 250.0045 std dev : 0.01697669 std dev : 0.02392806
# skewness : 999.9183 skewness : 0.3124089 skewness : 0.5125662
# kurtosis : 999891 kurtosis : 2.591296 kurtosis : 3.927271
# ------------------- --------------------- ---------------------
# droplong droplat paytype
# --------------------- --------------------- ------------------
# missing : 0 missing : 0 levels : 5
# min : -74.01847 min : 40.7009 missing : 0
# max : -73.91145 max : 40.87631 > freqTable head <
# mean : -73.98015 mean : 40.75685 CRD : 533697
# std dev : 0.01782197 std dev : 0.02586537 CSH : 462464
# skewness : 0.3684803 skewness : 0.7079717 NOC : 2030
# kurtosis : 2.725493 kurtosis : 4.416856 UNK : 1151
# --------------------- --------------------- ------------------
# meter surcharge tax
# ------------------- -------------------- ---------------------
# missing : 0 missing : 0 missing : 0
# min : -5 min : -1 min : -0.5
# max : 353 max : 7 max : 0.5
# mean : 10.29182 mean : 0.3180572 mean : 0.4997165
# std dev : 5.739544 std dev : 0.3656158 std dev : 0.01219302
# skewness : 2.641248 skewness : 0.6997366 skewness : -44.82106
# kurtosis : 43.3373 kurtosis : 2.413585 kurtosis : 2153.374
# ------------------- -------------------- ---------------------
# tip toll fare
# ------------------- --------------------- -------------------
# missing : 0 missing : 0 missing : 0
# min : 0 min : 0 min : -6
# max : 150 max : 40.75 max : 403
# mean : 1.097984 mean : 0.00499029 mean : 12.21265
# std dev : 1.489039 std dev : 0.1787053 std dev : 6.521503
# skewness : 8.299564 skewness : 48.87434 skewness : 2.666064
# kurtosis : 472.5249 kurtosis : 4429.004 kurtosis : 41.10018
# ------------------- --------------------- -------------------
# ridetime pickregion dropregion
# ------------------- ------------------ ------------------
# missing : 0 levels : 15 levels : 15
# min : -3100 missing : 0 missing : 0
# max : 16213 > freqTable head < > freqTable head <
# mean : 669.5678 man5 : 216790 man5 : 223404
# std dev : 435.7753 man8 : 150980 man8 : 146494
# skewness : 2.055302 man6 : 115442 man6 : 110488
# kurtosis : 19.57769 man4 : 107727 man4 : 105844
# ------------------- ------------------ ------------------
A good place to start is to look at the variable summaries (provided by updateAttributes)
pickft <- summary(raw)$pickregion$freqTable
head(pickft)
# value Freq
# 10 man5 216790
# 13 man8 150980
# 11 man6 115442
# 9 man4 107727
# 7 man2 105491
# 12 man7 93193
figure(ylim = rev(pickft$value),
ylab = NULL) %>%
ly_points(Freq, value,
data = pickft, hover = pickft)
We can extract data from the summaries
pick <- summary(raw)$pickregion$freqTable
drop <- summary(raw)$dropregion$freqTable
both <- make.groups(pick = pick, drop = drop)
figure() %>%
ly_bar(value, Freq, color = which,
data = both, position = "dodge") %>%
theme_axis("x",
major_label_orientation = 90)
Pick up and drop off distributions
# hack frequency table
hft <- summary(raw)$hack$freqTable
plot(sort(hft$Freq),
ylab = "Hack frequency (top/bottom 5000)")
summary(raw)$tip
# $nna
# [1] 0
# $stats
# $stats$mean
# [1] 1.097984
# $stats$var
# [1] 2.217236
# $stats$skewness
# [1] 8.299564
# $stats$kurtosis
# [1] 472.5249
# $range
# [1] 0 150
pc_quant <- drQuantile(raw,
var = "passnum", tails = 1)
head(pc_quant)
# fval q
# 1 0.000 1
# 2 0.005 1
# 3 0.010 1
# 4 0.015 1
# 5 0.020 1
# 6 0.025 1
figure() %>%
ly_points(fval, q, data = pc_quant)
Compute the distrubution of the number of passengers
Over 70% of rides in Manhattan have a single passenger, etc.
rt_quant <- drQuantile(raw,
var = "ridetime", tails = 1)
figure(ylab = "ride time (minutes)") %>%
ly_points(fval, q / 60, data = rt_quant)
summary(raw)$ridedis$range
# [1] 0 250000
# a ride distance of 250k miles!
# do a log + 1 transformation
rd_quant <- drQuantile(raw,
var = "ridedis", tails = 1,
varTransFn = function(x) log10(x + 1))
figure(ylab = "ride distance log10(miles + 1)") %>%
ly_points(fval, q, data = rd_quant,
hover = rd_quant)
* note: it's a good idea to look at the range of a variable prior to computing its distribution
tip_quant <- drQuantile(raw, var = "tip",
by = "paytype", tails = 1,
probs = seq(0, 1, 0.05))
head(tip_quant)
# fval q paytype
# 1 0.00 0.0000000 CRD
# 2 0.05 0.7950795 CRD
# 3 0.10 1.0051005 CRD
# 4 0.15 1.0051005 CRD
# 5 0.20 1.0051005 CRD
# 6 0.25 1.0951095 CRD
figure(ylab = "tip amount (USD)",
legend_location = "top_left") %>%
ly_points(fval, q, color = paytype,
data = subset(tip_quant, q < 100))
pickup_latlon <- drHexbin("picklong", "picklat",
data = raw, xbins = 300, shape = 1.7)
hexbin::plot(pickup_latlon,
trans = log, inv = exp,
style = "colorscale",
xlab = "longitude", ylab = "latitude",
colramp = hexbin::LinOCS)
# get all records that went more than 100 miles
# or have negative ride time or fare
bad <- drSubset(raw, ridedis > 100 | ridetime < 0 | fare < 0)
head(bad)
# medal hack vendor ratelevel flag pickdatetime dropdatetime
# 1 tad bupa VTS 1 2013-08-25 14:47:10 2013-08-25 14:51:00
# 2 rbb qjyb VTS 1 2013-08-07 11:52:10 2013-08-07 11:55:00
# 3 aaq xyna VTS 1 2013-08-08 05:33:10 2013-08-08 05:37:00
# 4 ofs hlka VTS 1 2013-08-21 01:04:10 2013-08-21 01:06:00
# 5 jzq eiub CMT 1 N 2013-03-10 03:51:52 2013-03-10 03:00:12
# 6 yij hdab VTS 1 2013-08-14 20:23:10 2013-08-14 20:27:00
# passnum ridetimemet ridedis picklong picklat droplong droplat paytype
# 1 1 230 0.50 -73.97270 40.78086 -73.97875 40.78270 CSH
# 2 2 170 0.50 -73.98562 40.74410 -73.99101 40.74375 CSH
# 3 1 230 0.68 -73.99831 40.76085 -73.99741 40.75383 CSH
# 4 1 110 0.66 -73.96195 40.77056 -73.96514 40.76356 CSH
# 5 1 499 2.20 -73.97720 40.77970 -73.97684 40.75352 CSH
# 6 1 230 0.74 -73.94568 40.79301 -73.94023 40.78880 CSH
# meter surcharge tax tip toll fare ridetime pickregion dropregion
# 1 -4.5 0.0 -0.5 0 0 -5.0 230 man7 man7
# 2 -4.0 0.0 -0.5 0 0 -4.5 170 man5 man5
# 3 -5.0 -0.5 -0.5 0 0 -6.0 230 man4 man4
# 4 -4.0 -0.5 -0.5 0 0 -5.0 110 man8 man8
# 5 9.0 0.5 0.5 0 0 10.0 -3100 man7 GrandCentral
# 6 -5.0 -0.5 -0.5 0 0 -6.0 230 man11 man11
# function to convert string to POSIXct
char2datetime <- function(x)
fast_strptime(as.character(x),
format = "%Y-%m-%d %H:%M:%S", tz = "EST")
# get rid of records that were strange
# fix datetime variables and create pickup hour variable
rawtr <- addTransform(raw, function(x) {
x <- subset(x, ridedis < 100 & ridetime >= 0 & fare >= 0)
x$ratelevel <- as.character(x$ratelevel)
x$pickdatetime <- char2datetime(x$pickdatetime)
x$dropdatetime <- char2datetime(x$dropdatetime)
x$pickhour <- format(x$pickdatetime, "%Y-%m-%d %H")
x
})
# persist this transformation (need to put it in a new location on disk)
# and update its attributes
rawtr <- drPersist(rawtr, output = localDiskConn("taxidata/rawtr"))
rawtr <- updateAttributes(rawtr)
trns <- function(x) log2(x + 1)
amount_vs_dist <- drHexbin("ridedis", "fare", data = rawtr, by = "ratelevel",
xbins = 50, shape = 1, xTransFn = trns, yTransFn = trns)
panels <- lapply(amount_vs_dist, function(x) {
figure(width = 300, height = 300,
xlab = "log2(ride distance + 1)",
ylab = "log2(fare + 1)") %>%
ly_hexbin(x) %>%
ly_abline(2.3, 0.842, type = 2) %>%
ly_abline(h = log2(3 + 1), type = 2) %>%
ly_abline(h = log2(17.8 + 1), type = 2) %>%
ly_abline(h = log2(52 + 1), type = 2)
})
names(panels) <- names(amount_vs_dist)
panels <- panels[order(names(panels))]
grid_plot(panels, ncol = 4, same_axes = TRUE)
hourly <- drAggregate(~ pickhour, data = rawtr)
head(hourly)
# pickhour Freq
# 1 2012-12-31 19:00:00 145
# 2 2012-12-31 20:00:00 169
# 3 2012-12-31 21:00:00 156
# 4 2012-12-31 22:00:00 141
# 5 2012-12-31 23:00:00 82
# 6 2013-01-01 00:00:00 47
# turn result pickhour into POSIXct and order by it
hourly$pickhour <- as.POSIXct(hourly$pickhour, format = "%Y-%m-%d %H")
hourly <- hourly[order(hourly$pickhour),]
hourly <- hourly[complete.cases(hourly),]
Let's plot the result:
figure(width = 1000) %>%
ly_lines(hourly) %>%
ly_points(hourly, hover = hourly, alpha = 0) %>%
y_axis(log = 2) %>%
tool_pan(dimensions = "width") %>%
tool_wheel_zoom(dimensions = "width")
regionct <- drAggregate(~ pickregion + dropregion, data = rawtr)
head(regionct)
# pickregion dropregion Freq
# 145 man5 man5 66255
# 193 man8 man8 54641
# 148 man8 man5 29694
# 177 man7 man7 28660
# 146 man6 man5 28481
# 144 man4 man5 27110
# some division ideas
byPick <- divide(rawtr, by = c("pickregion"),
output = localDiskConn("taxidata/byPick"))
byPickDrop <- divide(rawtr, by = c("pickregion", "dropregion"),
output = localDiskConn("taxidata/byPickDrop"))
When do we really need a massive data set vs. a sample?
./tessera-emr.sh -n 10 -s s3://tessera-emr \
-m m3.xlarge -w m3.xlarge -d s3n://tessera-taxi/raw
# example
raw <- drRead.csv(hdfsConn("/tmp/s3data", type = "text"),
output = hdfsConn("/tmp/raw"))