Analysis and Visualization of Large Complex Data with Tessera

Spring Research Conference

 

Barret Schloerke

Purdue University

May 25th, 2016

Background

  • Purdue University
    • PhD Candidate in Statistics (4th Year)
    • Dr. William Cleveland and Dr. Ryan Hafen
    • Research in large data visualization using R
    • www.tessera.io
  • Metamarkets.com - 1.5 years
    • San Francisco startup
    • Front end engineer - coffee script / node.js
  • Iowa State University
    • B.S. in Computer Engineering
    • Research in statistical data visualization with R
      • ​Dr. Di Cook, Dr. Hadley Wickham, and Dr. Heike Hofmann

Big Data Deserves a Big Screen

"Big Data"

  • Great buzzword!
    • But imprecise when put to action
  • Needs a floating definition
    • Small Data
      • In memory
    • Medium Data
      • Single machine
    • Large Data
      • Multiple machines

Large and Complex Data

 

  • Large number of records
  • Large number of variables
  • Complex data structures not readily put into tabular form
  • Intricate patterns and dependencies
  • Require complex models and methods of analysis
  • Not i.i.d.!

Often, complex data is more of a challenge than large data, but most large data sets are also complex

(Any / all of the following)

Large Data Computation

  • Computational analysis performance also depends on

    • Computational complexity of methods used
      • Issue for all sizes of data
    • Hardware computing power
      • More machines ≈ more power

Divide and Recombine (D&R)

  • Statistical Approach for High Performance Computing for Data Analysis
  • Specify meaningful, persistent divisions of the data
  • Analytic or visual methods are applied independently to each subset of the divided data in embarrassingly parallel fashion
    • No communication between subsets
  • Results are recombined to yield a statistically valid D&R result for the analytic method
  • plyr "split apply combine" idea, but using multiple machines
    • Dr. Wickham: http://vita.had.co.nz/papers/plyr.pdf

Divide and Recombine

What is Tessera?

  • tessera.io
  • A set high level R interfaces for analyzing complex data
    for small, medium, and large data
  • Powered by statistical methodology of Divide & Recombine
  • Code is simple and consistent regardless of size
  • Provides access to 1000s of statistical, machine learning, and visualization methods
  • Detailed, flexible, scalable visualization with Trelliscope

Tessera Environment

  • User Interface: two R packages, datadr & trelliscope
  • Data Interface: Rhipe
  • Can use many different data back ends: R, Hadoop, Spark, etc.
  • R <-> backend bridges: Rhipe, SparkR, etc.

Tessera

Computing

Location

{

{

{

Data Back End: Rhipe

  • R Hadoop Interface Programming Environment
  • R package that communicates with Hadoop
  • Hadoop
    • Built to handle Large Data
    • Already does distributed Divide & Recombine
  • Saves data as R objects

Front End: datadr

  • R package
  • Interface to small, medium, and large data
  • Analyst provides
    • divisions
    • analytics methods
    • recombination method
  • Protects users from the ugly
    details of distributed data
    • less time thinking
      about systems
    • more time thinking
      about data

datadr vs. dplyr

  • dplyr
    • "A fast, consistent tool for working with data frame like objects, both in memory and out of memory"
    • Provides a simple interface for quickly performing a wide variety of operations on data frames
    • Built for data.frames
  • Similarities
    • Both are extensible interfaces for data anlaysis / manipulation
    • Both have a flavor of split-apply-combine
  • Often datadr is confused as a dplyr alternative or competitor
    • Not true!

dplyr is great for subsetting, aggregating up to medium tabular data

datadr is great for scalable deep analysis of large, complex data

Visual Recombination: Trelliscope

  • www.tessera.io
  • Most tools and approaches for big data either
    • Summarize lot of data and make a single plot
    • Are very specialized for a particular domain
  • Summaries are critical...
  • But we must be able to visualize complex data in detail even when they are large!
  • Trelliscope does this by building on Trellis Display

Trellis Display

  • Tufte, Edward (1983). Visual Display of Quantitative Information
  • Data are split into meaningful subsets, usually conditioning on variables of the dataset
  • A visualization method is applied to each subset
  • The image for each subset is called a "panel"
  • Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis

Scaling Trellis

  • Big data lends itself nicely to the idea of small multiples
    • small multiple: series of similar graphs or charts using the same scale + axes, allowing them to be easily compared
    • Typically "big data" is big because it is made up of collections of smaller data from many subjects, sensors, locations, time periods, etc.
  • Potentially thousands or millions of panels
    • We can create millions of plots, but we will never be able to (or want to) view all of them!

Scaling Trellis

  • To scale, we can apply the same steps as in Trellis display, with one extra step:
    • Data are split into meaningful subsets, usually           conditioning on variables of the dataset
    • A visualization method is applied to each subset
    • A set of cognostic metrics is computed for each subset
    • Panels are arranged in an array of rows, columns, and pages, resembling a garden trellis, with the arrangement being specified through interactions with the cognostics

Trelliscope

  • Extension of multi-panel display systems, e.g. Trellis Display or faceting in ggplot

  • Number of panels can be very large (in the millions)

  • Panels can be interactively navigated through the use of cognostics (each subset's metrics)

  • Provides flexible, scalable, detailed visualization of large, complex data

Trelliscope is Scalable

  • 6 months of high frequency trading data
  • Hundreds of gigabytes of data
  • Split by stock symbol and day
  • Nearly 1 million subsets

For more information (docs, code, papers, user group, blog, etc.): http://tessera.io

More Information

Example Code

library(magrittr); library(dplyr); library(tidyr); library(ggplot2)

library(trelliscope)
library(datadr)
library(housingData)

# divide housing data by county and state
divide(housing, by = c("county", "state")) %>%
  drFilter(function(x){nrow(x) > 10}) ->
  # drFilter(function(x){nrow(x) > 120}) ->
  byCounty

# calculate the min and max y range
byCounty %>%
  drLapply(function(x){
    range(x[,c("medListPriceSqft", "medSoldPriceSqft")], na.rm = TRUE)
  }) %>%
  as.list() %>%
  lapply("[[", 2) %>%
  unlist() %>%
  range() ->
  yRanges


# for every subset 'x', calculate this information
priceCog <- function(x) {
   zillowString <- gsub(" ", "-", do.call(paste, getSplitVars(x)))
   list(
      slopeList = cog(
        coef(lm(medListPriceSqft ~ time, data = x))[2],
        desc = "list price slope"
      ),
      meanList = cogMean(x$medListPriceSqft),
      meanSold = cogMean(x$medSoldPriceSqft),
      nObsList = cog(
        length(which(!is.na(x$medListPriceSqft))),
        desc = "number of non-NA list prices"
      ),
      zillowHref = cogHref(
        sprintf("http://www.zillow.com/homes/%s_rb/", zillowString),
        desc = "zillow link"
      )
   )
}


# for every subset 'x', generate this plot
latticePanel <- function(x) {
  x %>%
    select(time, medListPriceSqft, medSoldPriceSqft) %>%
    gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
    ggplot(aes(x = time, y = value, color = variable)) +
      geom_smooth() +
      geom_point() +
      ylim(yRanges) +
      labs(y = "Price / Sq. Ft.") +
      theme(legend.position = "bottom")
}

# make this display
makeDisplay(
  byCounty,
  group   = "fields",
  panelFn = latticePanel,
  cogFn   = priceCog,
  name    = "list_vs_time_ggplot",
  desc    = "List and sold priceover time w/ggplot2",
  conn    = vdbConn("vdb", autoYes = TRUE)
)

# make a second display
latticePanelLM <- function(x) {
  x %>%
    select(time, medListPriceSqft, medSoldPriceSqft) %>%
    gather(key = "variable", value = "value", medListPriceSqft, medSoldPriceSqft, -time) %>%
    ggplot(aes(x = time, y = value, color = variable)) +
      geom_smooth(method = "lm") +
      geom_point() +
      ylim(yRanges) +
      labs(y = "Price / Sq. Ft.") +
      theme(legend.position = "bottom")
}
makeDisplay(
  byCounty,
  group   = "fields",
  panelFn = latticePanelLM,
  cogFn   = priceCog,
  name    = "list_vs_time_ggplot_lm",
  desc    = "List and sold priceover time w/ggplot2 with lm line",
  conn    = vdbConn("vdb")
)


view()

Tessera - Spring Research Conference

By Barret Schloerke

Tessera - Spring Research Conference

  • 3,249