Analysis and Visualization of Large Complex Data with Tessera

Spring Research Conference
Barret Schloerke
Purdue University
May 25th, 2016
Purdue University
- PhD Candidate in Statistics (4th Year)
- Dr. William Cleveland and Dr. Ryan Hafen
- Research in large data visualization using R
- - 1.5 years
- San Francisco startup
- Front end engineer - coffee script / node.js
Iowa State University
- B.S. in Computer Engineering
Research in statistical data visualization with R
- Dr. Di Cook, Dr. Hadley Wickham, and Dr. Heike Hofmann
Big Data Deserves a Big Screen

"Big Data"
- Great buzzword!
- But imprecise when put to action
- Needs a floating definition
- Small Data
- In memory
- Medium Data
- Single machine
- Large Data
- Multiple machines
- Small Data
Large and Complex Data
- Large number of records
- Large number of variables
- Complex data structures not readily put into tabular form
- Intricate patterns and dependencies
- Require complex models and methods of analysis
- Not i.i.d.!
Often, complex data is more of a challenge than large data, but most large data sets are also complex
(Any / all of the following)
Large Data Computation
Computational analysis performance also depends on
- Computational complexity of methods used
- Issue for all sizes of data
- Hardware computing power
- More machines ≈ more power
- Computational complexity of methods used

Divide and Recombine (D&R)
- Statistical Approach for High Performance Computing for Data Analysis
- Specify meaningful, persistent divisions of the data
- Analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion
- No communication between subsets
- Results are recombined to yield a statistically valid D&R result for the analytic method
plyr "split apply combine" idea, but using multiple machines
- Dr. Wickham:
Divide and Recombine

What is Tessera?
- A set high level R interfaces for analyzing complex data
for small, medium, and large data - Powered by statistical methodology of Divide & Recombine
- Code is simple and consistent regardless of size
- Provides access to 1000s of statistical, machine learning, and visualization methods
- Detailed, flexible, scalable visualization with Trelliscope

Tessera Environment
- User Interface: two R packages, datadr & trelliscope
- Data Interface: Rhipe
- Can use many different data back ends: R, Hadoop, Spark, etc.
- R <-> backend bridges: Rhipe, SparkR, etc.

Data Back End: Rhipe
- R Hadoop Interface Programming Environment
- R package that communicates with Hadoop
- Hadoop
- Built to handle Large Data
- Already does distributed Divide & Recombine
- Saves data as R objects

Front End: datadr
- R package
- Interface to small, medium, and large data
- Analyst provides
- divisions
- analytics methods
- recombination method
- Protects users from the ugly
details of distributed data- less time thinking
about systems - more time thinking
about data
- less time thinking

datadr vs. dplyr
- dplyr
- "A fast, consistent tool for working with data frame like objects, both in memory and out of memory"
- Provides a simple interface for quickly performing a wide variety of operations on data frames
- Built for data.frames
- Similarities
- Both are extensible interfaces for data anlaysis / manipulation
- Both have a flavor of split-apply-combine
- Often datadr is confused as a dplyr alternative or competitor
- Not true!
dplyr is great for subsetting, aggregating up to medium tabular data
datadr is great for scalable deep analysis of large, complex data
Visual Recombination: Trelliscope
Most tools and approaches for big data either
- Summarize lot of data and make a single plot
- Are very specialized for a particular domain
- Summaries are critical...
- But we must be able to visualize complex data in detail even when they are large!
- Trelliscope does this by building on Trellis Display
Trellis Display
- Tufte, Edward (1983). Visual Display of Quantitative Information
- Data are split into meaningful subsets, usually conditioning on variables of the dataset
- A visualization method is applied to each subset
- The image for each subset is called a "panel"
- Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis

Scaling Trellis
Big data lends itself nicely to the idea of small multiples
- small multiple: series of similar graphs or charts using the same scale + axes, allowing them to be easily compared
- Typically "big data" is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc.
- Potentially thousands or millions of panels
- We can create millions of plots, but we will never be able to (or want to) view all of them!
Scaling Trellis
To scale, we can apply the same steps as in Trellis display, with one extra step:
- Data are split into meaningful subsets, usually conditioning on variables of the dataset
- A visualization method is applied to each subset
- A set of cognostic metrics is computed for each subset
- Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being specified through interactions with the cognostics
Extension of multi-panel display systems, e.g. Trellis Display or faceting in ggplot
Number of panels can be very large (in the millions)
Panels can be interactively navigated through the use of cognostics (each subset's metrics)
Provides flexible, scalable, detailed visualization of large, complex data

Trelliscope is Scalable
- 6 months of high frequency trading data
- Hundreds of gigabytes of data
- Split by stock symbol and day
- Nearly 1 million subsets
For more information (docs, code, papers, user group, blog, etc.):
More Information
- website:
- code:
- @TesseraIO
- Google user group
Try it out
- If you have some applications in mind, give it a try!
- You don’t need big data or a cluster to use Tessera
- Ask us for help, give us feedback
Example Code
library(magrittr); library(dplyr); library(tidyr); library(ggplot2)
# divide housing data by county and state
divide(housing, by = c("county", "state")) %>%
drFilter(function(x){nrow(x) > 10}) ->
# drFilter(function(x){nrow(x) > 120}) ->
# calculate the min and max y range
byCounty %>%
range(x[,c("medListPriceSqft", "medSoldPriceSqft")], na.rm = TRUE)
}) %>%
as.list() %>%
lapply("[[", 2) %>%
unlist() %>%
range() ->
# for every subset 'x', calculate this information
priceCog <- function(x) {
zillowString <- gsub(" ", "-",, getSplitVars(x)))
slopeList = cog(
coef(lm(medListPriceSqft ~ time, data = x))[2],
desc = "list price slope"
meanList = cogMean(x$medListPriceSqft),
meanSold = cogMean(x$medSoldPriceSqft),
nObsList = cog(
desc = "number of non-NA list prices"
zillowHref = cogHref(
sprintf("", zillowString),
desc = "zillow link"
# for every subset 'x', generate this plot
latticePanel <- function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth() +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
# make this display
group = "fields",
panelFn = latticePanel,
cogFn = priceCog,
name = "list_vs_time_ggplot",
desc = "List and sold priceover time w/ggplot2",
conn = vdbConn("vdb", autoYes = TRUE)
# make a second display
latticePanelLM <- function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth(method = "lm") +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
group = "fields",
panelFn = latticePanelLM,
cogFn = priceCog,
name = "list_vs_time_ggplot_lm",
desc = "List and sold priceover time w/ggplot2 with lm line",
conn = vdbConn("vdb")
