Ryan Hafen
http://bit.ly/trelliscopejs1
Modern Approaches to Data Exploration with Trellis Display
install.packages(c("tidyverse", "gapminder", "rbokeh", "visNetwork", "plotly")) devtools::install_github("hafen/trelliscopejs") library(tidyverse) library(gapminder) library(rbokeh) library(visNetwork) library(trelliscopejs)
All examples in this talk are reproducible after installing and loading the following packages:
TrelliscopeJS is an htmlwidget
TrelliscopeJS is a layout engine for collections of plots (including htmlwidgets)
TrelliscopeJS is a framework for creating interactive displays of small multiples, suitable for visualizing large datasets in detail
A series of similar plots, usually each based on a different slice of data, arranged in a grid
"For a wide range of problems in data presentation, small multiples are the best design solution."
Edward Tufte (Envisioning Information)
This idea was formalized and popularized in S/S-PLUS and subsequently R with the trellis and lattice packages
In my experience, small multiples are much more effective than more flashy things like animation, linked brushing, custom interactive vis, etc.
Trelliscope is a general solution that allows small multiple displays to come alive by providing the ability to interactively sort and filter the panels based on summary statistics, cognostics, automatically computed for each panel
JavaScript Library
R Package
trelliscopejs-lib
trelliscopejs
Suppose we want to understand mortality over time for each country
Observations: 1,704 Variables: 6 $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgh... $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As... $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199... $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4... $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,... $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
glimpse(gapminder)
qplot(year, lifeExp, data = gapminder, color = country, geom = "line")
Yikes! There are a lot of countries...
qplot(year, lifeExp, data = gapminder, color = continent, group = country, geom = "line")
I can't see what's going on...
qplot(year, lifeExp, data = gapminder, color = continent,
group = country, geom = "line") +
facet_wrap(~ continent, nrow = 1)
That helped a little...
p <- qplot(year, lifeExp, data = gapminder, color = continent, group = country, geom = "line") + facet_wrap(~ continent, nrow = 1) plotly::ggplotly(p)
This helps but there is still too much overplotting...
(and hovering for additional info is too much work and we can only see more info one at a time)
qplot(year, lifeExp, data = gapminder) + xlim(1948, 2011) + ylim(10, 95) + theme_bw() + facet_wrap(~ country + continent)
Turning a ggplot2 faceted display into a Trelliscope display is as easy as changing:
to:
facet_wrap()
or:
facet_grid()
facet_trelliscope()
qplot(year, lifeExp, data = gapminder) +
xlim(1948, 2011) + ylim(10, 95) + theme_bw() +
facet_trelliscope(~ country + continent, nrow = 2, ncol = 7, width = 300)
Note: this and future plots in this presentation are interactive - feel free to explore!
qplot(year, lifeExp, data = gapminder) +
xlim(1948, 2011) + ylim(10, 95) + theme_bw() +
facet_trelliscope(~ country + continent,
nrow = 2, ncol = 7, width = 300, as_plotly = TRUE)
country_model <- function(df) lm(lifeExp ~ year, data = df) by_country <- gapminder %>% group_by(country, continent) %>% nest() %>% mutate( model = map(data, country_model), resid_mad = map_dbl(model, function(x) mad(resid(x)))) by_country
Example adapted from "R for Data Science"
# A tibble: 142 × 5 country continent data model resid_mad <fctr> <fctr> <list> <list> <dbl> 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm> 1.4058780 2 Albania Europe <tibble [12 × 4]> <S3: lm> 2.2193278 3 Algeria Africa <tibble [12 × 4]> <S3: lm> 0.7925897 4 Angola Africa <tibble [12 × 4]> <S3: lm> 1.4903085 5 Argentina Americas <tibble [12 × 4]> <S3: lm> 0.2376178 6 Australia Oceania <tibble [12 × 4]> <S3: lm> 0.7934372 7 Austria Europe <tibble [12 × 4]> <S3: lm> 0.3928605 8 Bahrain Asia <tibble [12 × 4]> <S3: lm> 1.8201766 9 Bangladesh Asia <tibble [12 × 4]> <S3: lm> 1.1947475 10 Belgium Europe <tibble [12 × 4]> <S3: lm> 0.2353342 # ... with 132 more rows
Excerpt from "R for Data Science"
country_plot <- function(data, model) { figure(xlim = c(1948, 2011), ylim = c(10, 95), tools = NULL) %>% ly_points(year, lifeExp, data = data, hover = data) %>% ly_abline(model) } country_plot(by_country$data[[1]], by_country$model[[1]])
We'll use the rbokeh package to make a plot function and apply it to the first row of our data
by_country <- by_country %>% mutate(plot = map2_plot(data, model, country_plot)) by_country
Example adapted from "R for Data Science"
# A tibble: 142 × 6 country continent data model resid_mad plot <fctr> <fctr> <list> <list> <dbl> <list> 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm> 1.4058780 <S3: rbokeh> 2 Albania Europe <tibble [12 × 4]> <S3: lm> 2.2193278 <S3: rbokeh> 3 Algeria Africa <tibble [12 × 4]> <S3: lm> 0.7925897 <S3: rbokeh> 4 Angola Africa <tibble [12 × 4]> <S3: lm> 1.4903085 <S3: rbokeh> 5 Argentina Americas <tibble [12 × 4]> <S3: lm> 0.2376178 <S3: rbokeh> 6 Australia Oceania <tibble [12 × 4]> <S3: lm> 0.7934372 <S3: rbokeh> 7 Austria Europe <tibble [12 × 4]> <S3: lm> 0.3928605 <S3: rbokeh> 8 Bahrain Asia <tibble [12 × 4]> <S3: lm> 1.8201766 <S3: rbokeh> 9 Bangladesh Asia <tibble [12 × 4]> <S3: lm> 1.1947475 <S3: rbokeh> 10 Belgium Europe <tibble [12 × 4]> <S3: lm> 0.2353342 <S3: rbokeh> # ... with 132 more rows
Plots as list-columns!!!
by_country %>%
trelliscope(name = "by_country_lm", nrow = 2, ncol = 4)
With plots as columns, TrelliscopeJS provides nearly effortless detailed, flexible, interactive visualization in the Tidyverse
by_country %>%
arrange(-resid_mad) %>%
trelliscope(name = "by_country_lm", nrow = 2, ncol = 4)
Order the data frame to set initial ordering of display
by_country %>%
filter(continent == "Africa") %>%
trelliscope(name = "by_country_africa_lm", nrow = 2, ncol = 4)
Filter the data to only include plots you want in the display
pokemon <- read_csv("http://bit.ly/plot_pokemon") %>% mutate_at(vars(matches("_id$")), as.character) %>% mutate(panel = img_panel(url_image)) pokemon
trelliscope(pokemon, name = "pokemon", nrow = 3, ncol = 6,
state = list(labels = c("pokemon", "pokedex")))
library(visNetwork) nnodes <- 100 nnedges <- 1000 nodes <- data.frame( id = 1:nnodes, label = 1:nnodes, value = rep(1, nnodes)) edges <- data.frame( from = sample(1:nnodes, nnedges, replace = T), to = sample(1:nnodes, nnedges, replace = T)) %>% group_by(from, to) %>% summarise(value = n()) network_plot <- function(id, hide_select = TRUE) { style <- ifelse(hide_select, "visibility: hidden; position: absolute", "") visNetwork(nodes, edges) %>% visIgraphLayout(layout = "layout_in_circle") %>% visNodes(fixed = TRUE, scaling = list(min = 20, max = 50, label = list(min = 35, max = 70, drawThreshold = 1, maxVisible = 100))) %>% visEdges(scaling = list(min = 5, max = 30)) %>% visOptions(highlightNearest = list(enabled = TRUE, degree = 0, hideColor = "rgba(200,200,200,0.2)"), nodesIdSelection = list(selected = as.character(id), style = style)) } network_plot(1, hide_select = FALSE)
nodedat <- edges %>% group_by(from) %>% summarise(n_nodes = n(), tot_conns = sum(value)) %>% rename(id = from) %>% arrange(-n_nodes) %>% mutate(panel = map_plot(id, network_plot)) nodedat
# A tibble: 100 × 4 id n_nodes tot_conns panel <int> <int> <int> <list> 1 58 17 19 <S3: visNetwork> 2 45 16 17 <S3: visNetwork> 3 9 15 18 <S3: visNetwork> 4 31 15 16 <S3: visNetwork> 5 14 14 15 <S3: visNetwork> 6 42 14 15 <S3: visNetwork> 7 90 14 14 <S3: visNetwork> 8 21 13 14 <S3: visNetwork> 9 37 13 14 <S3: visNetwork> 10 43 13 13 <S3: visNetwork> # ... with 90 more rows
We create a one-row-per-node data frame with number of nodes connected to and total number of connections as cognostics and add a plot panel column
nodedat %>%
arrange(-n_nodes) %>%
trelliscope(name = "connections", nrow = 2, ncol = 4)
instadf %>%
arrange(-likes_count) %>%
trelliscope(name = "posts", width = 320, height = 320, nrow = 3, ncol = 6,
state = list(labels = c("caption", "post_link", "likes_count")))
If you have an app that has multiple inputs and produces a plot output, the idea is simply to enumerate all possible inputs as rows of a data frame and add the plot that corresponds to these parameters as column and plot it
Trelliscope displays are most useful as exploratory plots to guide the data scientist (because they can be created rapidly)
However, in many cases Trelliscope displays can be used as interactive applications for end-users, domain experts, etc. with the bonus that they are much easier to create than a custom app
library(shiny) library(ggplot2) library(gapminder) server <- function(input, output) { output$countryPlot <- renderPlot({ qplot(year, lifeExp, data = subset(gapminder, country == input$country)) + xlim(1948, 2011) + ylim(10, 95) + theme_bw() }) } choices <- sort(unique(gapminder$country))
ui <- fluidPage( titlePanel("Gampinder Life Expectancy"), sidebarLayout( sidebarPanel( selectInput("country", label = "Select country: ", choices = choices, selected = "Afghanistan") ), mainPanel( plotOutput("countryPlot", height = "500px") ) ) ) runApp(list(ui = ui, server = server))
Just because you can't look at all panels in a display doesn't mean it isn't useful or practical to make a large display - it's in fact beneficial because you get an unprecedented level of detail in your displays, and every corner of your data can be conceptually viewed
One insight is all you need for a display to serve a purpose (provided it is quick to create)
We used the previous implementation of Trelliscope to visualize millions of subsets of terabytes of data
SparklyR is the natural solution
But we need a few things...
qplot(year, lifeExp, data = gapminder) +
facet_trelliscope(~ country + continent, nrow = 2, ncol = 7, width = 300)