Analysis and Visualization of Large Complex Data with Tessera
Spring Research Conference
Barret Schloerke
Purdue University
May 25th, 2016
Background

Purdue University
 PhD Candidate in Statistics (4th Year)
 Dr. William Cleveland and Dr. Ryan Hafen
 Research in large data visualization using R
 www.tessera.io

Metamarkets.com  1.5 years
 San Francisco startup
 Front end engineer  coffee script / node.js

Iowa State University
 B.S. in Computer Engineering

Research in statistical data visualization with R
 Dr. Di Cook, Dr. Hadley Wickham, and Dr. Heike Hofmann
Big Data Deserves a Big Screen
"Big Data"
 Great buzzword!
 But imprecise when put to action
 Needs a floating definition
 Small Data
 In memory
 Medium Data
 Single machine
 Large Data
 Multiple machines
 Small Data
Large and Complex Data
 Large number of records
 Large number of variables
 Complex data structures not readily put into tabular form
 Intricate patterns and dependencies
 Require complex models and methods of analysis
 Not i.i.d.!
Often, complex data is more of a challenge than large data, but most large data sets are also complex
(Any / all of the following)
Large Data Computation

Computational analysis performance also depends on
 Computational complexity of methods used
 Issue for all sizes of data
 Hardware computing power
 More machines ≈ more power
 Computational complexity of methods used
Divide and Recombine (D&R)
 Statistical Approach for High Performance Computing for Data Analysis
 Specify meaningful, persistent divisions of the data
 Analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion
 No communication between subsets
 Results are recombined to yield a statistically valid D&R result for the analytic method

plyr "split apply combine" idea, but using multiple machines
 Dr. Wickham: http://vita.had.co.nz/papers/plyr.pdf
Divide and Recombine
What is Tessera?
 tessera.io
 A set high level R interfaces for analyzing complex data
for small, medium, and large data  Powered by statistical methodology of Divide & Recombine
 Code is simple and consistent regardless of size
 Provides access to 1000s of statistical, machine learning, and visualization methods
 Detailed, flexible, scalable visualization with Trelliscope
Tessera Environment
 User Interface: two R packages, datadr & trelliscope
 Data Interface: Rhipe
 Can use many different data back ends: R, Hadoop, Spark, etc.
 R <> backend bridges: Rhipe, SparkR, etc.
Tessera
Computing
Location
{
{
{
Data Back End: Rhipe
 R Hadoop Interface Programming Environment
 R package that communicates with Hadoop
 Hadoop
 Built to handle Large Data
 Already does distributed Divide & Recombine
 Saves data as R objects
Front End: datadr
 R package
 Interface to small, medium, and large data
 Analyst provides
 divisions
 analytics methods
 recombination method
 Protects users from the ugly
details of distributed data less time thinking
about systems  more time thinking
about data
 less time thinking
datadr vs. dplyr
 dplyr
 "A fast, consistent tool for working with data frame like objects, both in memory and out of memory"
 Provides a simple interface for quickly performing a wide variety of operations on data frames
 Built for data.frames
 Similarities
 Both are extensible interfaces for data anlaysis / manipulation
 Both have a flavor of splitapplycombine
 Often datadr is confused as a dplyr alternative or competitor
 Not true!
dplyr is great for subsetting, aggregating up to medium tabular data
datadr is great for scalable deep analysis of large, complex data
Visual Recombination: Trelliscope
 www.tessera.io

Most tools and approaches for big data either
 Summarize lot of data and make a single plot
 Are very specialized for a particular domain
 Summaries are critical...
 But we must be able to visualize complex data in detail even when they are large!
 Trelliscope does this by building on Trellis Display
Trellis Display
 Tufte, Edward (1983). Visual Display of Quantitative Information
 Data are split into meaningful subsets, usually conditioning on variables of the dataset
 A visualization method is applied to each subset
 The image for each subset is called a "panel"
 Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis
Scaling Trellis

Big data lends itself nicely to the idea of small multiples
 small multiple: series of similar graphs or charts using the same scale + axes, allowing them to be easily compared
 Typically "big data" is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc.
 Potentially thousands or millions of panels
 We can create millions of plots, but we will never be able to (or want to) view all of them!
Scaling Trellis

To scale, we can apply the same steps as in Trellis display, with one extra step:
 Data are split into meaningful subsets, usually conditioning on variables of the dataset
 A visualization method is applied to each subset
 A set of cognostic metrics is computed for each subset
 Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being specified through interactions with the cognostics
Trelliscope

Extension of multipanel display systems, e.g. Trellis Display or faceting in ggplot

Number of panels can be very large (in the millions)

Panels can be interactively navigated through the use of cognostics (each subset's metrics)

Provides flexible, scalable, detailed visualization of large, complex data
Trelliscope is Scalable
 6 months of high frequency trading data
 Hundreds of gigabytes of data
 Split by stock symbol and day
 Nearly 1 million subsets
For more information (docs, code, papers, user group, blog, etc.): http://tessera.io
More Information
 website: http://tessera.io
 code: http://github.com/tesseradata
 @TesseraIO
 Google user group

Try it out
 If you have some applications in mind, give it a try!
 You don’t need big data or a cluster to use Tessera
 Ask us for help, give us feedback
Example Code
library(magrittr); library(dplyr); library(tidyr); library(ggplot2)
library(trelliscope)
library(datadr)
library(housingData)
# divide housing data by county and state
divide(housing, by = c("county", "state")) %>%
drFilter(function(x){nrow(x) > 10}) >
# drFilter(function(x){nrow(x) > 120}) >
byCounty
# calculate the min and max y range
byCounty %>%
drLapply(function(x){
range(x[,c("medListPriceSqft", "medSoldPriceSqft")], na.rm = TRUE)
}) %>%
as.list() %>%
lapply("[[", 2) %>%
unlist() %>%
range() >
yRanges
# for every subset 'x', calculate this information
priceCog < function(x) {
zillowString < gsub(" ", "", do.call(paste, getSplitVars(x)))
list(
slopeList = cog(
coef(lm(medListPriceSqft ~ time, data = x))[2],
desc = "list price slope"
),
meanList = cogMean(x$medListPriceSqft),
meanSold = cogMean(x$medSoldPriceSqft),
nObsList = cog(
length(which(!is.na(x$medListPriceSqft))),
desc = "number of nonNA list prices"
),
zillowHref = cogHref(
sprintf("http://www.zillow.com/homes/%s_rb/", zillowString),
desc = "zillow link"
)
)
}
# for every subset 'x', generate this plot
latticePanel < function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth() +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
# make this display
makeDisplay(
byCounty,
group = "fields",
panelFn = latticePanel,
cogFn = priceCog,
name = "list_vs_time_ggplot",
desc = "List and sold priceover time w/ggplot2",
conn = vdbConn("vdb", autoYes = TRUE)
)
# make a second display
latticePanelLM < function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth(method = "lm") +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
makeDisplay(
byCounty,
group = "fields",
panelFn = latticePanelLM,
cogFn = priceCog,
name = "list_vs_time_ggplot_lm",
desc = "List and sold priceover time w/ggplot2 with lm line",
conn = vdbConn("vdb")
)
view()
Tessera  Spring Research Conference
By Barret Schloerke
Tessera  Spring Research Conference
 524