Analysis and Visualization of Large Complex Data with Tessera
Spring Research Conference
Barret Schloerke
Purdue University
May 25th, 2016
Background
-
Purdue University
- PhD Candidate in Statistics (4th Year)
- Dr. William Cleveland and Dr. Ryan Hafen
- Research in large data visualization using R
- www.tessera.io
-
Metamarkets.com - 1.5 years
- San Francisco startup
- Front end engineer - coffee script / node.js
-
Iowa State University
- B.S. in Computer Engineering
-
Research in statistical data visualization with R
- Dr. Di Cook, Dr. Hadley Wickham, and Dr. Heike Hofmann
Big Data Deserves a Big Screen
"Big Data"
- Great buzzword!
- But imprecise when put to action
- Needs a floating definition
- Small Data
- In memory
- Medium Data
- Single machine
- Large Data
- Multiple machines
- Small Data
Large and Complex Data
- Large number of records
- Large number of variables
- Complex data structures not readily put into tabular form
- Intricate patterns and dependencies
- Require complex models and methods of analysis
- Not i.i.d.!
Often, complex data is more of a challenge than large data, but most large data sets are also complex
(Any / all of the following)
Large Data Computation
-
Computational analysis performance also depends on
- Computational complexity of methods used
- Issue for all sizes of data
- Hardware computing power
- More machines ≈ more power
- Computational complexity of methods used
Divide and Recombine (D&R)
- Statistical Approach for High Performance Computing for Data Analysis
- Specify meaningful, persistent divisions of the data
- Analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion
- No communication between subsets
- Results are recombined to yield a statistically valid D&R result for the analytic method
-
plyr "split apply combine" idea, but using multiple machines
- Dr. Wickham: http://vita.had.co.nz/papers/plyr.pdf
Divide and Recombine
What is Tessera?
- tessera.io
- A set high level R interfaces for analyzing complex data
for small, medium, and large data - Powered by statistical methodology of Divide & Recombine
- Code is simple and consistent regardless of size
- Provides access to 1000s of statistical, machine learning, and visualization methods
- Detailed, flexible, scalable visualization with Trelliscope
Tessera Environment
- User Interface: two R packages, datadr & trelliscope
- Data Interface: Rhipe
- Can use many different data back ends: R, Hadoop, Spark, etc.
- R <-> backend bridges: Rhipe, SparkR, etc.
Tessera
Computing
Location
{
{
{
Data Back End: Rhipe
- R Hadoop Interface Programming Environment
- R package that communicates with Hadoop
- Hadoop
- Built to handle Large Data
- Already does distributed Divide & Recombine
- Saves data as R objects
Front End: datadr
- R package
- Interface to small, medium, and large data
- Analyst provides
- divisions
- analytics methods
- recombination method
- Protects users from the ugly
details of distributed data- less time thinking
about systems - more time thinking
about data
- less time thinking
datadr vs. dplyr
- dplyr
- "A fast, consistent tool for working with data frame like objects, both in memory and out of memory"
- Provides a simple interface for quickly performing a wide variety of operations on data frames
- Built for data.frames
- Similarities
- Both are extensible interfaces for data anlaysis / manipulation
- Both have a flavor of split-apply-combine
- Often datadr is confused as a dplyr alternative or competitor
- Not true!
dplyr is great for subsetting, aggregating up to medium tabular data
datadr is great for scalable deep analysis of large, complex data
Visual Recombination: Trelliscope
- www.tessera.io
-
Most tools and approaches for big data either
- Summarize lot of data and make a single plot
- Are very specialized for a particular domain
- Summaries are critical...
- But we must be able to visualize complex data in detail even when they are large!
- Trelliscope does this by building on Trellis Display
Trellis Display
- Tufte, Edward (1983). Visual Display of Quantitative Information
- Data are split into meaningful subsets, usually conditioning on variables of the dataset
- A visualization method is applied to each subset
- The image for each subset is called a "panel"
- Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis
Scaling Trellis
-
Big data lends itself nicely to the idea of small multiples
- small multiple: series of similar graphs or charts using the same scale + axes, allowing them to be easily compared
- Typically "big data" is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc.
- Potentially thousands or millions of panels
- We can create millions of plots, but we will never be able to (or want to) view all of them!
Scaling Trellis
-
To scale, we can apply the same steps as in Trellis display, with one extra step:
- Data are split into meaningful subsets, usually conditioning on variables of the dataset
- A visualization method is applied to each subset
- A set of cognostic metrics is computed for each subset
- Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being specified through interactions with the cognostics
Trelliscope
-
Extension of multi-panel display systems, e.g. Trellis Display or faceting in ggplot
-
Number of panels can be very large (in the millions)
-
Panels can be interactively navigated through the use of cognostics (each subset's metrics)
-
Provides flexible, scalable, detailed visualization of large, complex data
Trelliscope is Scalable
- 6 months of high frequency trading data
- Hundreds of gigabytes of data
- Split by stock symbol and day
- Nearly 1 million subsets
For more information (docs, code, papers, user group, blog, etc.): http://tessera.io
More Information
- website: http://tessera.io
- code: http://github.com/tesseradata
- @TesseraIO
- Google user group
-
Try it out
- If you have some applications in mind, give it a try!
- You don’t need big data or a cluster to use Tessera
- Ask us for help, give us feedback
Example Code
library(magrittr); library(dplyr); library(tidyr); library(ggplot2)
library(trelliscope)
library(datadr)
library(housingData)
# divide housing data by county and state
divide(housing, by = c("county", "state")) %>%
drFilter(function(x){nrow(x) > 10}) ->
# drFilter(function(x){nrow(x) > 120}) ->
byCounty
# calculate the min and max y range
byCounty %>%
drLapply(function(x){
range(x[,c("medListPriceSqft", "medSoldPriceSqft")], na.rm = TRUE)
}) %>%
as.list() %>%
lapply("[[", 2) %>%
unlist() %>%
range() ->
yRanges
# for every subset 'x', calculate this information
priceCog <- function(x) {
zillowString <- gsub(" ", "-", do.call(paste, getSplitVars(x)))
list(
slopeList = cog(
coef(lm(medListPriceSqft ~ time, data = x))[2],
desc = "list price slope"
),
meanList = cogMean(x$medListPriceSqft),
meanSold = cogMean(x$medSoldPriceSqft),
nObsList = cog(
length(which(!is.na(x$medListPriceSqft))),
desc = "number of non-NA list prices"
),
zillowHref = cogHref(
sprintf("http://www.zillow.com/homes/%s_rb/", zillowString),
desc = "zillow link"
)
)
}
# for every subset 'x', generate this plot
latticePanel <- function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth() +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
# make this display
makeDisplay(
byCounty,
group = "fields",
panelFn = latticePanel,
cogFn = priceCog,
name = "list_vs_time_ggplot",
desc = "List and sold priceover time w/ggplot2",
conn = vdbConn("vdb", autoYes = TRUE)
)
# make a second display
latticePanelLM <- function(x) {
x %>%
select(time, medListPriceSqft, medSoldPriceSqft) %>%
gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
ggplot(aes(x = time, y = value, color = variable)) +
geom_smooth(method = "lm") +
geom_point() +
ylim(yRanges) +
labs(y = "Price / Sq. Ft.") +
theme(legend.position = "bottom")
}
makeDisplay(
byCounty,
group = "fields",
panelFn = latticePanelLM,
cogFn = priceCog,
name = "list_vs_time_ggplot_lm",
desc = "List and sold priceover time w/ggplot2 with lm line",
conn = vdbConn("vdb")
)
view()
Tessera - Spring Research Conference
By Barret Schloerke
Tessera - Spring Research Conference
- 3,249