## Analysis and Visualization of Large Complex Data with Tessera

Spring Research Conference

Barret Schloerke

Purdue University

May 25th, 2016

## Background

• Purdue University
• PhD Candidate in Statistics (4th Year)
• Dr. William Cleveland and Dr. Ryan Hafen
• Research in large data visualization using R
• www.tessera.io
• Metamarkets.com - 1.5 years
• San Francisco startup
• Front end engineer - coffee script / node.js
• Iowa State University
• B.S. in Computer Engineering
• Research in statistical data visualization with R
• ​Dr. Di Cook, Dr. Hadley Wickham, and Dr. Heike Hofmann

## "Big Data"

• Great buzzword!
• But imprecise when put to action
• Needs a floating definition
• Small Data
• In memory
• Medium Data
• Single machine
• Large Data
• Multiple machines

## Large and Complex Data

• Large number of records
• Large number of variables
• Complex data structures not readily put into tabular form
• Intricate patterns and dependencies
• Require complex models and methods of analysis
• Not i.i.d.!

Often, complex data is more of a challenge than large data, but most large data sets are also complex

(Any / all of the following)

## Large Data Computation

• Computational analysis performance also depends on

• Computational complexity of methods used
• Issue for all sizes of data
• Hardware computing power
• More machines ≈ more power

## Divide and Recombine (D&R)

• Statistical Approach for High Performance Computing for Data Analysis
• Specify meaningful, persistent divisions of the data
• Analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion
• No communication between subsets
• Results are recombined to yield a statistically valid D&R result for the analytic method
• plyr "split apply combine" idea, but using multiple machines
• Dr. Wickham: http://vita.had.co.nz/papers/plyr.pdf

## Divide and Recombine

What is Tessera?

• tessera.io
• A set high level R interfaces for analyzing complex data
for small, medium, and large data
• Powered by statistical methodology of Divide & Recombine
• Code is simple and consistent regardless of size
• Provides access to 1000s of statistical, machine learning, and visualization methods
• Detailed, flexible, scalable visualization with Trelliscope

## Tessera Environment

• User Interface: two R packages, datadr & trelliscope
• Data Interface: Rhipe
• Can use many different data back ends: R, Hadoop, Spark, etc.
• R <-> backend bridges: Rhipe, SparkR, etc.

Tessera

Computing

Location

{

{

{

## Data Back End: Rhipe

• R Hadoop Interface Programming Environment
• R package that communicates with Hadoop
• Built to handle Large Data
• Already does distributed Divide & Recombine
• Saves data as R objects

## Front End: datadr

• R package
• Interface to small, medium, and large data
• Analyst provides
• divisions
• analytics methods
• recombination method
• Protects users from the ugly
details of distributed data
• less time thinking
• more time thinking

## datadr vs. dplyr

• dplyr
• "A fast, consistent tool for working with data frame like objects, both in memory and out of memory"
• Provides a simple interface for quickly performing a wide variety of operations on data frames
• Built for data.frames
• Similarities
• Both are extensible interfaces for data anlaysis / manipulation
• Both have a flavor of split-apply-combine
• Often datadr is confused as a dplyr alternative or competitor
• Not true!

dplyr is great for subsetting, aggregating up to medium tabular data

datadr is great for scalable deep analysis of large, complex data

## Visual Recombination: Trelliscope

• www.tessera.io
• Most tools and approaches for big data either
• Summarize lot of data and make a single plot
• Are very specialized for a particular domain
• Summaries are critical...
• But we must be able to visualize complex data in detail even when they are large!
• Trelliscope does this by building on Trellis Display

## Trellis Display

• Tufte, Edward (1983). Visual Display of Quantitative Information
• Data are split into meaningful subsets, usually conditioning on variables of the dataset
• A visualization method is applied to each subset
• The image for each subset is called a "panel"
• Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis

## Scaling Trellis

• Big data lends itself nicely to the idea of small multiples
• small multiple: series of similar graphs or charts using the same scale + axes, allowing them to be easily compared
• Typically "big data" is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc.
• Potentially thousands or millions of panels
• We can create millions of plots, but we will never be able to (or want to) view all of them!

## Scaling Trellis

• To scale, we can apply the same steps as in Trellis display, with one extra step:
• Data are split into meaningful subsets, usually           conditioning on variables of the dataset
• A visualization method is applied to each subset
• A set of cognostic metrics is computed for each subset
• Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being specified through interactions with the cognostics

## Trelliscope

• Extension of multi-panel display systems, e.g. Trellis Display or faceting in ggplot

• Number of panels can be very large (in the millions)

• Panels can be interactively navigated through the use of cognostics (each subset's metrics)

• Provides flexible, scalable, detailed visualization of large, complex data

### Trelliscope is Scalable

• 6 months of high frequency trading data
• Hundreds of gigabytes of data
• Split by stock symbol and day
• Nearly 1 million subsets

For more information (docs, code, papers, user group, blog, etc.): http://tessera.io

## Example Code

``````library(magrittr); library(dplyr); library(tidyr); library(ggplot2)

library(trelliscope)
library(housingData)

# divide housing data by county and state
divide(housing, by = c("county", "state")) %>%
drFilter(function(x){nrow(x) > 10}) ->
# drFilter(function(x){nrow(x) > 120}) ->
byCounty

# calculate the min and max y range
byCounty %>%
drLapply(function(x){
range(x[,c("medListPriceSqft", "medSoldPriceSqft")], na.rm = TRUE)
}) %>%
as.list() %>%
lapply("[[", 2) %>%
unlist() %>%
range() ->
yRanges

# for every subset 'x', calculate this information
priceCog <- function(x) {
zillowString <- gsub(" ", "-", do.call(paste, getSplitVars(x)))
list(
slopeList = cog(
coef(lm(medListPriceSqft ~ time, data = x))[2],
desc = "list price slope"
),
meanList = cogMean(x\$medListPriceSqft),
meanSold = cogMean(x\$medSoldPriceSqft),
nObsList = cog(
length(which(!is.na(x\$medListPriceSqft))),
desc = "number of non-NA list prices"
),
zillowHref = cogHref(
sprintf("http://www.zillow.com/homes/%s_rb/", zillowString),
desc = "zillow link"
)
)
}

# for every subset 'x', generate this plot
latticePanel <- function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth() +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}

# make this display
makeDisplay(
byCounty,
group   = "fields",
panelFn = latticePanel,
cogFn   = priceCog,
name    = "list_vs_time_ggplot",
desc    = "List and sold priceover time w/ggplot2",
conn    = vdbConn("vdb", autoYes = TRUE)
)

# make a second display
latticePanelLM <- function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth(method = "lm") +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
makeDisplay(
byCounty,
group   = "fields",
panelFn = latticePanelLM,
cogFn   = priceCog,
name    = "list_vs_time_ggplot_lm",
desc    = "List and sold priceover time w/ggplot2 with lm line",
conn    = vdbConn("vdb")
)

view()
``````

#### Tessera - Spring Research Conference

By Barret Schloerke

• 3,047