Analysis and Visualization of Large Complex Data with Tessera
Purdue University
April 7th, 2016
Barret Schloerke
Deep Analysis of
Large Complex Data
http://wombat2016.org/slides/hadley.pdf
Deep Analysis of
Large Complex Data
Any or all of the following:
 Large number of records
 Many variables
 Complex data structures not readily put into tabular form of cases by variables
 Intricate patterns and dependencies
 require complex models and methods of analysis
 Not i.i.d.!
Often complex data is more of a challenge than large data, but most large data sets are also complex
Goals for
Analysis of Large Complex Data
 Work in familiar highlevel statistical programming environment
 Have access to the 1000s of statistical, machine learning, and visualization methods
 Thinking
 Minimize thinking about code
 Minimize thinking about distributed systems
 Maximize thinking about the data
 Be able to analyze large complex data with nearly as much flexibility and ease as small data
What is Tessera?
 tessera.io
 A set high level R interface for analyzing complex data large and small
 Code is simple and consistent regardless of size
 Powered by statistical methodology Divide and Recombine (D&R)
 Provides access to 1000s of statistical, machine learning, and visualization methods
 Detailed, flexible, scalable visualization with Trelliscope
Tessera Environment
 Front end: two R packages, datadr & trelliscope
 Back ends: R, Hadoop, Spark, etc.
 R <> backend bridges: RHIPE, SparkR, etc.
Back End Agnostic Interface
Divide and Recombine
 Specify meaningful, persistent divisions of the data
 Analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion
 Results are recombined to yield a statistically valid D&R result for the analytic method

plyr "split apply combine" idea over multiple machines
 http://vita.had.co.nz/papers/plyr.pdf
Divide and Recombine
Visual Recombination: Trelliscope

Most tools and approaches for big data either
 Summarize lot of data and make a single plot
 Are very specialized for a particular domain
 Summaries are critical
 But we must be able to visualize complex data in detail even when they are large!
 Trelliscope does this by building on Trellis Display
Trellis Display
 Tufte, Edward (1983). Visual Display of Quantitative Information
 Data are split into meaningful subsets, usually conditioning on variables of the dataset
 A visualization method is applied to each subset
 The image for each subset is called a "panel"
 Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis
Scaling Trellis

Big data lends itself nicely to the idea of small multiples
 small multiple: series of similar graphs or charts using the same scale + axes, allowing them to be easily compared
 Typically "big data" is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc.
 Potentially thousands or millions of panels
 We can create millions of plots, but we will never be able to (or want to) view all of them!
Scaling Trellis

To scale, we can apply the same steps as in Trellis display, with one extra step:
 Data are split into meaningful subsets, usually conditioning on variables of the dataset
 A visualization method is applied to each subset
 A set of cognostic metrics is computed for each subset
 Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being specified through interactions with the cognostics
Trelliscope

Extension of multipanel display systems, e.g. Trellis Display or faceting in ggplot

Number of panels can be very large (in the millions)

Panels can be interactively navigated through the use of cognostics (each subset's metrics)

Provides flexible, scalable, detailed visualization of large, complex data
Trelliscope is Scalable
 6 months of high frequency trading data
 Hundreds of gigabytes of data
 Split by stock symbol and day
 Nearly 1 million subsets
Tessera
Tessera is an implementation of D&R built on R
 Front end R packages that can tie to scalable back ends:
 trelliscope: visual recombination through interactive multipanel exploration with cognostics
 datadr: provides an interface to data operations, division, and analytical recombination methods
datadr vs. dplyr
 dplyr
 "A fast, consistent tool for working with data frame like objects, both in memory and out of memory"
 Provides a simple interface for quickly performing a wide variety of operations on data frames
 Often datadr is confused as a dplyr alternative or competitor
 There are some similarities:
 Both are extensible interfaces for data anlaysis / manipulation
 Both have a flavor of splitapplycombine
datadr vs. dplyr

Back end architecture:
 dplyr ties to SQLlike back ends
 datadr ties to keyvalue stores

Scalability:
 At scale, dplyr is a wrapper to SQL (all computations must be translatable to SQL operations  no R code)
 datadr's fundamental algorithm is MapReduce, which scales to extremely large volumes and allows ad hoc R code to be applied

Flexibility: dplyr data must be tabular while datadr data can be
any R data structure  Speed: dplyr will probably always be faster because its focused set of operations are optimized and usually applied against indexed databases whereas MapReduce always processes the entire data set
Differences
dplyr is great for subsetting, aggregating medium tabular data
datadr is great for scalable deep analysis of large, complex data
For more information (docs, code, papers, user group, blog, etc.): http://tessera.io
Example Code
library(magrittr); library(dplyr); library(tidyr); library(ggplot2)
library(trelliscope)
library(datadr)
library(housingData)
# divide housing data by county and state
divide(housing, by = c("county", "state")) %>%
drFilter(function(x){nrow(x) > 10}) >
# drFilter(function(x){nrow(x) > 120}) >
byCounty
# calculate the min and max y range
byCounty %>%
drLapply(function(x){
range(x[,c("medListPriceSqft", "medSoldPriceSqft")], na.rm = TRUE)
}) %>%
as.list() %>%
lapply("[[", 2) %>%
unlist() %>%
range() >
yRanges
# for every subset 'x', calculate this information
priceCog < function(x) {
zillowString < gsub(" ", "", do.call(paste, getSplitVars(x)))
list(
slopeList = cog(
coef(lm(medListPriceSqft ~ time, data = x))[2],
desc = "list price slope"
),
meanList = cogMean(x$medListPriceSqft),
meanSold = cogMean(x$medSoldPriceSqft),
nObsList = cog(
length(which(!is.na(x$medListPriceSqft))),
desc = "number of nonNA list prices"
),
zillowHref = cogHref(
sprintf("http://www.zillow.com/homes/%s_rb/", zillowString),
desc = "zillow link"
)
)
}
# for every subset 'x', generate this plot
latticePanel < function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth() +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
# make this display
makeDisplay(
byCounty,
group = "fields",
panelFn = latticePanel,
cogFn = priceCog,
name = "list_vs_time_ggplot",
desc = "List and sold priceover time w/ggplot2",
conn = vdbConn("vdb", autoYes = TRUE)
)
# make a second display
latticePanelLM < function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth(method = "lm") +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
makeDisplay(
byCounty,
group = "fields",
panelFn = latticePanelLM,
cogFn = priceCog,
name = "list_vs_time_ggplot_lm",
desc = "List and sold priceover time w/ggplot2 with lm line",
conn = vdbConn("vdb")
)
view()
More Information
 website: http://tessera.io
 code: http://github.com/tesseradata
 @TesseraIO
 Google user group

Try it out
 If you have some applications in mind, give it a try!
 You don’t need big data or a cluster to use Tessera
 Ask us for help, give us feedback
Original Slides: http://slides.com/hafen/tesseraqut2016
Tessera  Purdue
By Barret Schloerke