Ryan Hafen
Context: Exploratory analysis and statistical model building
Image source: Hadley Wickham
A series of similar plots, usually each based on a different slice of data, arranged in a grid
"For a wide range of problems in data presentation, small multiples are the best design solution."
Edward Tufte (Envisioning Information)
This idea was formalized and popularized in S/S-PLUS and subsequently R with the trellis and lattice packages
A series of similar plots, usually each based on a different slice of data, arranged in a grid
"For a wide range of problems in data presentation, small multiples are the best design solution."
Edward Tufte (Envisioning Information)
This idea was formalized and popularized in S/S-PLUS and subsequently R with the trellis and lattice packages
Trelliscope is a general solution that allows small multiple displays to come alive by providing the ability to interactively sort and filter the panels based on summary statistics, cognostics, that capture attributes of interest in the data being plotted
Suppose we want to understand mortality over time for each country
Observations: 1,704 Variables: 6 $ country <fctr> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgh... $ continent <fctr> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, As... $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199... $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4... $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,... $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...
glimpse(gapminder)
qplot(year, lifeExp, data = gapminder, color = country, geom = "line")
Yikes! There are a lot of countries...
qplot(year, lifeExp, data = gapminder, color = continent, group = country, geom = "line")
Still too much going on...
qplot(year, lifeExp, data = gapminder, color = continent,
group = country, geom = "line") +
facet_wrap(~ continent, nrow = 1)
That helped a little...
p <- qplot(year, lifeExp, data = gapminder, color = continent, group = country, geom = "line") + facet_wrap(~ continent, nrow = 1) plotly::ggplotly(p)
This helps but there is still too much overplotting...
(and hovering for additional info is too much work and we can only see more info one at a time)
qplot(year, lifeExp, data = gapminder) + theme_bw() + facet_wrap(~ country + continent)
qplot(year, lifeExp, data = gapminder) + theme_bw() +
facet_trelliscope(~ country + continent, nrow = 2, ncol = 7, width = 300)
Note: this and future plots in this presentation are interactive - feel free to explore!
qplot(year, lifeExp, data = gapminder) + theme_bw() +
facet_trelliscope(~ country + continent,
nrow = 2, ncol = 7, width = 300, as_plotly = TRUE)
JavaScript Library
R Package
trelliscopejs-lib
trelliscopejs
devtools::install_github("hafen/trelliscopejs")
country_model <- function(df) lm(lifeExp ~ year, data = df) by_country <- gapminder %>% group_by(country, continent) %>% nest() %>% mutate( model = map(data, country_model), resid_mad = map_dbl(model, function(x) mad(resid(x)))) by_country
Example adapted from "R for Data Science"
# A tibble: 142 × 5 country continent data model resid_mad <fctr> <fctr> <list> <list> <dbl> 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm> 1.4058780 2 Albania Europe <tibble [12 × 4]> <S3: lm> 2.2193278 3 Algeria Africa <tibble [12 × 4]> <S3: lm> 0.7925897 4 Angola Africa <tibble [12 × 4]> <S3: lm> 1.4903085 5 Argentina Americas <tibble [12 × 4]> <S3: lm> 0.2376178 6 Australia Oceania <tibble [12 × 4]> <S3: lm> 0.7934372 7 Austria Europe <tibble [12 × 4]> <S3: lm> 0.3928605 8 Bahrain Asia <tibble [12 × 4]> <S3: lm> 1.8201766 9 Bangladesh Asia <tibble [12 × 4]> <S3: lm> 1.1947475 10 Belgium Europe <tibble [12 × 4]> <S3: lm> 0.2353342 # ... with 132 more rows
country_plot <- function(data, model) { figure(xlim = c(1948, 2011), ylim = c(10, 95), tools = NULL) %>% ly_points(year, lifeExp, data = data, hover = data) %>% ly_abline(model) } country_plot(by_country$data[[1]], by_country$model[[1]])
We'll use the rbokeh package to make a plot function and apply it to the first row of our data
by_country <- by_country %>% mutate(plot = map2_plot(data, model, country_plot)) by_country
Example adapted from "R for Data Science"
# A tibble: 142 × 6 country continent data model resid_mad plot <fctr> <fctr> <list> <list> <dbl> <list> 1 Afghanistan Asia <tibble [12 × 4]> <S3: lm> 1.4058780 <S3: rbokeh> 2 Albania Europe <tibble [12 × 4]> <S3: lm> 2.2193278 <S3: rbokeh> 3 Algeria Africa <tibble [12 × 4]> <S3: lm> 0.7925897 <S3: rbokeh> 4 Angola Africa <tibble [12 × 4]> <S3: lm> 1.4903085 <S3: rbokeh> 5 Argentina Americas <tibble [12 × 4]> <S3: lm> 0.2376178 <S3: rbokeh> 6 Australia Oceania <tibble [12 × 4]> <S3: lm> 0.7934372 <S3: rbokeh> 7 Austria Europe <tibble [12 × 4]> <S3: lm> 0.3928605 <S3: rbokeh> 8 Bahrain Asia <tibble [12 × 4]> <S3: lm> 1.8201766 <S3: rbokeh> 9 Bangladesh Asia <tibble [12 × 4]> <S3: lm> 1.1947475 <S3: rbokeh> 10 Belgium Europe <tibble [12 × 4]> <S3: lm> 0.2353342 <S3: rbokeh> # ... with 132 more rows
A plot for each model
by_country %>%
trelliscope(name = "by_country_lm", nrow = 2, ncol = 4)
pokemon <- read_csv("http://bit.ly/plot_pokemon") %>% mutate_at(vars(matches("_id$")), as.character) %>% mutate(panel = img_panel(url_image)) pokemon
trelliscope(pokemon, name = "pokemon", nrow = 3, ncol = 6,
state = list(labels = c("pokemon", "pokedex")))
read_csv("http://bit.ly/trs-mri") %>% mutate(img = img_panel(img)) %>% trelliscope("brain_MRI", nrow = 2, ncol = 5)
(offline demo)
Observations: 44,509,207
Variables: 27
$ dn_number <chr> "05558306", "05559894", "05559900", "10660701", "...
$ birth_place <fct> Hospital, Hospital, Hospital, Hospital, Hospital,...
$ health_estbl_code <chr> "0000001", "0000009", "0000009", "0000006", "0001...
$ birth_muni_code <int> 110009, 110002, 110002, 110002, 120070, 120070, 1...
$ m_age_yrs <dbl> 25, 15, 35, 17, 31, 23, 24, 16, 15, 19, 19, 19, 2...
$ marital_status <fct> Single, Single, Married, Single, Single, Single, ...
$ m_educ <fct> 4 to 7 years, 1 to 3 years, 8 to 11 years, 1 to 3...
$ occ_code <chr> "00800", NA, NA, NA, "31000", "00800", "00800", N...
$ n_live_child <dbl> 1, NA, 2, 1, 1, 1, 1, NA, NA, 2, 3, NA, 4, NA, NA...
$ n_dead_child <dbl> 0, NA, NA, NA, 0, 0, 0, NA, NA, NA, NA, NA, 1, NA...
$ m_muni_code <int> 120040, 120010, 120025, 120040, 120070, 120070, 1...
$ gest_weeks <fct> 37-41 weeks, 37-41 weeks, 37-41 weeks, 37-41 week...
$ preg_type <fct> Singleton, Singleton, Singleton, Singleton, Singl...
$ deliv_type <fct> Vaginal, Vaginal, Cesarean, Vaginal, Vaginal, Vag...
$ n_prenatal_visit <fct> 1 - 3, 4 - 6, 7+, 1 - 3, 4 - 6, 1 - 3, 7+, 7+, 7+...
$ birth_date <date> 2001-02-20, 2001-03-30, 2001-06-07, 2001-12-05, ...
$ sex <fct> Male, Male, Male, Female, Female, Female, Male, F...
$ apgar1 <int> 8, 8, 8, 9, 7, 6, NA, NA, 3, 5, 5, 5, 5, 5, NA, 7...
$ apgar5 <int> 10, 10, 10, 10, 8, 8, NA, 8, 7, 8, 8, 9, 9, 10, 8...
$ race <fct> White, White, White, White, White, White, Multira...
$ brthwt_g <dbl> 3800, 3100, 3300, 3200, 3600, 3700, 3750, 3500, 3...
$ cong_anom <fct> No, No, No, No, NA, NA, No, No, No, No, No, No, N...
$ cong_icd10 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ birth_year <int> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2...
$ m_state_code <chr> "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC", "...
$ birth_state_code <chr> "RO", "RO", "RO", "RO", "AC", "AC", "AC", "AC", "...
$ m_age_bin <fct> 20-29, 10-19, 30-39, 10-19, 30-39, 20-29, 20-29, ...
ggplot(by_muni_lbwt_time, aes(birth_year, pct_low_bwt)) + geom_point(size = 3, alpha = 0.6) + geom_line(stat = "smooth", method = rlm, color = "blue", size = 1, alpha = 0.5) + theme_bw() + labs(y = "Percent Low Birth Weight", x = "Year") + facet_trelliscope(~ state_name + muni_name, nrow = 2, ncol = 4, width = 400, height = 400, name = "pct_low_bwt_muni", desc = "percent low birth weight yearly by municipality")
(offline demo - for now)
Just because you can't look at all panels in a display doesn't mean it isn't useful or practical to make a large display - it's in fact beneficial because you get an unprecedented level of detail in your displays, and every corner of your data can be conceptually viewed
One insight is all you need for a display to serve a purpose (provided it is quick to create)
We used the previous implementation of Trelliscope to visualize millions of subsets of terabytes of data
SparklyR is the natural solution
But we need a few things...
install.packages(c("tidyverse", "gapminder", "rbokeh", "plotly")) devtools::install_github("hafen/trelliscopejs") library(tidyverse) library(gapminder) library(rbokeh) library(trelliscopejs)
Most examples in this talk are reproducible after installing and loading the following packages:
library(shiny) library(ggplot2) library(gapminder) server <- function(input, output) { output$countryPlot <- renderPlot({ qplot(year, lifeExp, data = subset(gapminder, country == input$country)) + xlim(1948, 2011) + ylim(10, 95) + theme_bw() }) } choices <- sort(unique(gapminder$country))
ui <- fluidPage( titlePanel("Gampinder Life Expectancy"), sidebarLayout( sidebarPanel( selectInput("country", label = "Select country: ", choices = choices, selected = "Afghanistan") ), mainPanel( plotOutput("countryPlot", height = "500px") ) ) ) runApp(list(ui = ui, server = server))
If you have an app that has multiple inputs and produces a plot output, the idea is simply to enumerate all possible inputs as rows of a data frame and add the plot that corresponds to these parameters as column and plot it
Trelliscope displays are most useful as exploratory plots to guide the data scientist (because they can be created rapidly)
However, in many cases Trelliscope displays can be used as interactive applications for end-users, domain experts, etc. with the bonus that they are much easier to create than a custom app
Turning a ggplot2 faceted display into a Trelliscope display is as easy as changing:
to:
facet_wrap()
or:
facet_grid()
facet_trelliscope()
With plots as columns, TrelliscopeJS provides nearly effortless detailed, flexible, interactive visualization in the Tidyverse
library(visNetwork) nnodes <- 100 nnedges <- 1000 nodes <- data.frame( id = 1:nnodes, label = 1:nnodes, value = rep(1, nnodes)) edges <- data.frame( from = sample(1:nnodes, nnedges, replace = T), to = sample(1:nnodes, nnedges, replace = T)) %>% group_by(from, to) %>% summarise(value = n()) network_plot <- function(id, hide_select = TRUE) { style <- ifelse(hide_select, "visibility: hidden; position: absolute", "") visNetwork(nodes, edges) %>% visIgraphLayout(layout = "layout_in_circle") %>% visNodes(fixed = TRUE, scaling = list(min = 20, max = 50, label = list(min = 35, max = 70, drawThreshold = 1, maxVisible = 100))) %>% visEdges(scaling = list(min = 5, max = 30)) %>% visOptions(highlightNearest = list(enabled = TRUE, degree = 0, hideColor = "rgba(200,200,200,0.2)"), nodesIdSelection = list(selected = as.character(id), style = style)) } network_plot(1, hide_select = FALSE)
nodedat <- edges %>% group_by(from) %>% summarise(n_nodes = n(), tot_conns = sum(value)) %>% rename(id = from) %>% arrange(-n_nodes) %>% mutate(panel = map_plot(id, network_plot)) nodedat
# A tibble: 100 × 4 id n_nodes tot_conns panel <int> <int> <int> <list> 1 58 17 19 <S3: visNetwork> 2 45 16 17 <S3: visNetwork> 3 9 15 18 <S3: visNetwork> 4 31 15 16 <S3: visNetwork> 5 14 14 15 <S3: visNetwork> 6 42 14 15 <S3: visNetwork> 7 90 14 14 <S3: visNetwork> 8 21 13 14 <S3: visNetwork> 9 37 13 14 <S3: visNetwork> 10 43 13 13 <S3: visNetwork> # ... with 90 more rows
We create a one-row-per-node data frame with number of nodes connected to and total number of connections as cognostics and add a plot panel column
nodedat %>%
arrange(-n_nodes) %>%
trelliscope(name = "connections", nrow = 2, ncol = 4)