Data Visualization in Practice

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Overview

Module 1:

  1. Know your data
  2. Grammar of graphics
  3. Statistical judgment                  

Module 2:

  1. Functionalist approach:            
    1. Distribution       
    2. Composition
    3. Comparison
    4. Relationship

Module 3:

  1. Interactive charts                     
    1. Reactive
    2. Interactive
    3. Online publication

What is data visualization?

  • Data visualization is to deliver a message from your data.

  • It is like telling a story using the chart or data applications.

  • Sometimes the data is huge or the story to too long to tell.

  • Visualization provides an ability to comprehend huge amounts of data. The important information from more than a million measurements is immediately available.

What is data visualization?

  • Visualization often enables problems with the data to become immediately apparent.

  • A visualization commonly reveals things not only about the data itself but also about the way it is collected. With an appropriate visualization, errors and artifacts in the data often jump out at you.

  • Visualization facilitates understanding of both large-scale and small-scale features of the data. It can be especially valuable in allowing the perception of patterns linking local features.

What is data visualization?

  • Visualization facilitates hypothesis formation, inviting further inquiries into building a theory (Colin Ware 2012).

  • It is exploratory data analysis (EDA) but can also provide the tools for hypothesis confirmation.

What to visualize in data?

  1. Data Generating Process

  2. Property

  3. Distribution

  4. Pattern

  5. Differences

  6. Relationship

  7. Dimensionality

Elements of a Chart

  1. Dimensionality

    1. How many dimensions are there?

  2. Relationships

    1. ​Strength

    2. Fit

    3. Error bands

    4. Panels

What is data visualization?

  • Learn to read your data

    • Visual thinking
    • Educated eyes

Data Story:

Source: Yau 2011

  • Color
  • Font 
    • Size
    • Family
  • Axis
    • Vertical
    • Slant
  • Canvas
    • Size
    • Theme

Data Story:

Data

Message

Mechanical process

 Data                  Messenger                           Message     

>                                              >

=                                             =

Know your data: data types

  1. Numeric data

    1. Scale

      1. Nominal 
      2. Ordinal
      3. Interval
      4. Ratio
  2. Categories
  3. Events
  4. Time series

Quantitative vs. Qualitative Data

  1. Numbers vs. Labels

  2. Quantity vs. Quality

  3. Ordinal, Interval, Ratio vs. Nominal

  4. e.g. Yes/No--> Qualitative

  5. e.g. How much--> Quantitative

Quantitative vs. Qualitative Data

  1. Higher quantity means higher quality?

  2. Higher quality leads to higher quantity?

Time series data

  1. Nature

    1. Temporal dependency: non-stationarity autocorrelation

    2. Periodicity: seasonality, cycle

  2. Zeros -> events?

  3. Scale linearity

Time series data

  1. Nature

    1. Temporal dependency: non-stationarity autocorrelation

    2. Periodicity: seasonality, cycle

  2. Zeros -> events?

  3. Scale linearity

Event count data

  1. Nature

    1. Distribution

    2. Bounds

      1. No upper bounds

      2. One lower bound: zero

    3. Zeros

  2. Continuous vs. discrete

  3. Intervals vs. duration

Anscombe example (1973)

Anscombe example (1973)

Anscombe example (1973)

Anscombe example (1973)

Anscombe example (1973)

Tufte: Same relationship? (2001)

Tufte: Same relationship? (2001)

Source: https://en.wikipedia.org/wiki/Charles_Joseph_Minard

One of the best data visualizations in history

How much information?

1. Latitude of army & features (Y-coordinate) .  2. Longitude of army & features (X-coordinate)
3. Size of army (width of line, numerals) .           4. Advance vs. Retreat color of line
5. Division of army splitting of line                       6. Temperature linked lineplot
7. Time linked lineplot

Adelson's Checker-Shadow

Colors of A and B boxes different?

Adelson's Checker-Shadow

Colors of A and B boxes different?

Coffer Illusion by Anthony Norcia

See any circles?  How many?

Know your data: Data Literacy

  1. Data generating process

  2. Graphic grammar

  3. Statistical judgement

 

  1. Data generating process

    1. ​How data are generated

    2. Distribution

    3. Missing values

    4. Wrong data

Know your data: Data Literacy

  1. Graphic grammar

    1. Bad charts deliver incorrect message

    2. Poor design

    3. Color

    4. Label

    5. Scale

    6. Dimensionality

Know your data: Data Literacy

  1. Statistical understanding

    1. Size does (not) matter

    2. Representativeness does

    3. Forecast/prediction minded

    4. Explanation

Know your data: Data Literacy

Functional Approach

  1. What to plot?

    1. Quantitative/Numeric data
    2. Qualitative/Categorical data
  2. One variable: Univariate

    1. Distribution
    2. Composition
  3. Two or multiple: Multivariate

    1. Comparison
    2. Relationship

Thought starter

Thought starter

ggplot2

Author: Hadley Wickham (from New Zealand, Rstudio Chief Scientist)

 

Based on based Leland Wilkinson's book, The Grammar of Graphics (2005, available online at UTD library) 

ggplot2 vs. lattice

  • ggplot2's default appearance of plots has been carefully chosen with visual perception in mind, like the defaults for lattice plots. The ggplot2 style may be more appealing to some people than the lattice style.

  • The arrangement of plot components and the inclusion of legends is automated. This is also like lattice, but the ggplot2 facility is more comprehensive and sophisticated.

 

Why ggplot2?

  • Although the conceptual framework in ggplot2 can take a little getting used to, once mastered, it provides a very powerful language for concisely expressing a wide variety of plots.

  • The ggplot2 package uses grid for rendering, which provides a lot of flexibility available for annotating, editing, and embedding ggplot2 output.

Wickham on ggplot2

 Aim of grammar:

 

“bring together in a coherent way things that previously appeared unrelated and which also will provide a basis for dealing systematically with new situations”.

 

- D. R. Cox 1978

 

Wilkinson

Grammar gives language rules. 

The word stems from the Greek noun for letter or mark \(\gamma\rho\acute{\alpha}\mu\mu\alpha\).

That derives from the Greek verb for writing \(\gamma\rho\acute{\alpha}\phi\omega\), which is the source of our English word graph.

Wilkinson

A grammar is a formal system of rules for generating lawful statements in a language.

The grammar of graphics goes beyond a limited set of charts (words) to an  unlimited world of graphical forms (statements). 

The rules of graphics grammar are sometimes mathematical and sometimes aesthetic.

Wilkinson

Mathematics provides symbolic tools for representing abstractions. 

Aesthetics, in the original Greek sense, offers principles for relating sensory attributes (color, shape, sound, etc.) to abstractions.. 

GG focuses on rules for constructing graphs mathematically and then representing them as graphics aesthetically.

Language of Graphics

A scatterplot is a point graphic embedded in a frame. 

A bar chart is an interval graphic bound to an aggregation function embedded in a frame.

 A pie chart is a polar, stacked, interval graphic mapped on proportions. 

A radar chart is a line graphic in polar parallel coordinates. 

A SPLOM (scatterplot matrix) is a crossing of nested scatterplots in rectangular coordinates. 

A trellis display is a graphic faceted on crossed categorical variables in a rectangular coordinate system.

Wickham

A grammar of graphics is a tool that enables  concise description of the components of a graphic.

Such a grammar allows moving beyond named graphics and gain insight into the deep structure that underlies statistical graphics.

 

Wickham on ggplot2

"layered grammar of graphics"

  1. develop a hierarchy of defaults based on Wilkinson

  2. embed a graphical grammar into a programming language.

  3. build on the grammar to learn how to create graphical “poems.”

 

ggplot2

Two ways of plotting a graphic:

  1. qplot() - quick plot

  2. ggplot() - grammar of graphics plot

ggplot2

Two ways of plotting a graphic:

  1. qplot() - quick plot                              

    1. very similar to plot()

    2. simple to use

    3. quick to produce basic graphs

ggplot2

Two ways of plotting a graphic:

  1. ggplot() - grammar of graphics plot

    1. fuller implementation of The Grammar of Graphics

    2. highly flexible for plotting graphs

    3. steep learning curve

ggplot2

A ggplot2 plot is built  by creating plot components, or layers, and combining them using the + operator (adding layer).

It can combine several other important components that allow for more complex plots that contain multiple groups, legends, facetting. 

  • Present data with layers of components

  • tidy data concepts (tidyverse.org)

  • Grammar structure enables graphical data analysis and exploration

ggplot2 structure

ggplot2 philosophy (Wickham)

  • Make graphics easier

  • Use the grammar to facilitate research into new types of display

  • Continuum of expertise:

    • start simple by using the results of the theory

    • grow in power by understanding the theory

    • begin to contribute new components

  • Orthogonal components and minimal special cases should make learning easy(er?)

ggplot2 components


Create  graphical display by combining building blocks including:

 

  • data

  • aesthetic mapping

  • geometric object

  • statistical transformations

  • scales

  • coordinate system

  • position adjustments

  • facetting

ggplot2 layers

  • data

  • aesthetic mapping

  • geometric object

ggplot2 layers

Components of ggplot2

  1. data: R data frame
  2. coordinate system:  2-D space plot
  3. geoms: geometric objects representing data, e.g. points, lines, polygons, etc.
  4. aesthetics: visual characteristics, e.g. position, size, color, shape, transparency, fill
  5. scales: governs how visual characteristic is converted to display values, e.g. log scales, color scales, size scales, shape scales.
  6. stats: statistical data transformations, e.g. counts, means, medians, regression lines
  7. facets: split data into subsets to display as multiple graphs

Core components 

Common geoms

Murrell, Paul. 2019. R Graphics.  CRC Press.

Scale components 

Scale components 

Common scales

Murrell, Paul. 2019. R Graphics.  CRC Press.

Stat components 

Common stats

Murrell, Paul. 2019. R Graphics.  CRC Press.

Coord components 

ggplot2 coord

  • Coordinate system component, or coord, in ggplot2  is simple linear cartesian coordinates, but this can be explicitly set to something else.

  • For example, coord_trans() function transforms variables to plotted explicitly, e.g.:
     

​coord_trans(x="exp", y="exp"))

More coord examples

  • coord_cartesian(): default cartesian coordinate system

  • ​coord_fixed(): Cartesian coordinates with fixed aspect ratio between x and y units

  • coord_flip(): Flipped Cartesian coordinates

  • coord_polar: Polar coordinates

  • coord_trans: Transformed cartesian coordinates. 

Gapminder data set

The gapminder data set has 1704 rows and 6 variables:
country(factor) - 142 levels

continent(factor) - 5 levels

year -ranges from 1952 to 2007 in increments of 5 years

lifeExp-  life expectancy at birth, in years

pop - population

gdpPercap -  GDP per capita (US$, inflation-adjusted)

Plot type

  • "p" for points

  • "l" for lines

  • "b" for both

  • "c" for the lines part alone of "b"

  • "o" for both ‘overplotted’

  • "h" for ‘histogram’ like (or ‘high-density’) vertical lines

  • "s" for stair steps, moves first horizontal, then vertical

  • "S" for other steps, contrary to "s"

  • "n" for no plotting.

 

Plot symbols (plot characters)

pch = 0,square
pch = 1,circle
pch = 2,triangle point up
pch = 3,plus
pch = 4,cross
pch = 5,diamond
pch = 6,triangle point down
pch = 7,square cross
pch = 8,star
pch = 9,diamond plus
pch = 10,circle plus
pch = 11,triangles up and down
pch = 12,square plus
pch = 13,circle cross

pch = 14,square and triangle down
pch = 15, filled square
pch = 16, filled circle
pch = 17, filled triangle point-up
pch = 18, filled diamond
pch = 19, solid circle
pch = 20,bullet (smaller circle)
pch = 21, filled circle blue
pch = 22, filled square blue
pch = 23, filled diamond blue
pch = 24, filled triangle point-up blue
pch = 25, filled triangle point down blue

 

Plot symbols: PCH

Additional:
*

.
o
O
note: takes longer to plot

More on symbols (programming)

& - ampersand

 ‘ - apostrophe or single quote

* - asterisk

@ - at

{} - braces or curly brackets

[] - brackets

^ - carat

<> - angle brackets or chevron

~ - tilde

| - pipe 

# - pound

- - hyphen

Line type

Line types can be specified with:

  • An integer or name:
    0 = blank,
    1 = solid,
    2 = dashed,
    3 = dotted,
    4 = dotdash,
    5 = longdash,
    6 = twodash

Line type

  • The lengths of on/off stretches of line can be determined with a string containing 2, 4, 6, or 8 hexadecimal digits (1 - f) which give the lengths of consecutive lengths.
  • For example, the string "33" specifies three units on followed by three off and "3313" specifies three units on followed by three off followed by one on and finally three off.

 

Line type

  • 44

  • 13

  • 1343

  • 73

  • 2262

ggplot()


creates a plot object, layer by layer

plot object p cannot be displayed without adding at least one layer at this point, there is nothing to see!

install.packages("ggplot2")
library(ggplot2)
p <- ggplot(data = gm)
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(size=2)

Financial Times: Visual vocabulary

  1. Deviation

  2. Correlation

  3. Ranking

  4. Distribution

  5. Change over time

  6. Magnitude

  7. Part-to-Whole

  8. Spatial

  9. Flow

Financial Times: Visual vocabulary

  1. Deviation

  2. Correlation

  3. Ranking

  4. Distribution

  5. Change over time

  6. Magnitude

  7. Part-to-Whole

  8. Spatial

  9. Flow

Data import: vroom

  • col_logical() ‘l’, containing only T, F, TRUE, FALSE, 1 or 0.

  • col_integer() ‘i’, integer values.

  • col_double() ‘d’, floating point values.

  • col_number() [n], numbers containing the grouping_mark

  • col_date(format = "") [D]: with the locale’s date_format.

  • col_time(format = "") [t]: with the locale’s time_format.

  • col_datetime(format = "") [T]: ISO8601 date times.

  • col_factor(levels, ordered) ‘f’, a fixed set of values.

  • col_character() ‘c’, everything else.

  • col_skip() ‘_, -', don’t import this column.

  • col_guess() ‘?', parse using the “best” type based on the input.

Module 3

  • Creating a Dashboard using Shiny

  • Reactivity

  • Interactive charts using Plotly

Overview

  • What is Shiny?

  • Components of Shiny

  • Structure of Shiny app

  • Publicizing Shiny app on GitHub

What is Shiny?

A Shiny app is a web page (UI) with user interface connected to a computer running a live R session (Server).

Users can design the UI, which provide interactive interface to visualize data (by running R code).

Shiny app = R + Interactivity + Web hosting

Presenting interactive data and charts

Shiny deployment

  • RStudio Shiny server (http://www.shinyapps.io )

    • Needs Shiny account (connect via GitHub/Google)

  • Pro Shiny account (commercial)

  • Install own Shiny server (https://github.com/rstudio/shiny-server)

    • ​Linux server
    • Free and open source

Shiny reference and resources

Components of Shiny

  1. User Interface (ui.R) — The UI is the frontend that accepts user input values.

  2. Server function (server.R) — The Server is the backend that process these input values to finally produce the output results that are finally displayed on the website.

  3. shinyApp function — The app itself that combines the UI and server components together.

Layout and interface

  • Design & explore UI framework
    • Inputs within the UI framework
    • Outputs within the UI framework
  • Assemble UI with HTML/CSS/... widgets
  • Adjustment of the layout scheme

Structure of Shiny program

# install.packages("shiny")
# install.packages("shinythemes")

library(shiny)
library(shinythemes)
# Create User Interface
ui <− fluidPage ()

# Build R objects displayed in UI
server <− function(input , output){}  

# Create Shiny app
shinyApp(ui = ui, server = server)
  • ui: Nested R functions that assemble an HTML user interface for the app (some HTML knowledge needed)

  • server: A function with instructions on how to build and rebuild the R objects displayed in the UI

  • shinyApp: Combines ui and server into a functioning app

ui.R

  • Nested R functions that assemble an HTML user interface for the app

  • Example:


     
    • ui = creates the user interface object
      • fluidPage() function create the layout page that includes:
        • input 
        • output
library(shiny)
ui = fluidPage(
  numericInput(inputId = "n", "Sample size", value = 50),
  plotOutput(outputId = "hist")) 

server.R

  • Composed of R codes to process input and generate output:

    • Example:


      • Read in the data from input (from ui.R)
      • Create the chart (i.e. histogram)
server = function(input , output){ output$hist = renderPlot ({
    hist(rnorm(input$n)) })}

shinyApp()

  • Combine ur.R and server.R and execute

    • Example:

       
      • Output in plot window ready for publishing
shinyApp(ui = ui , server = server)

Reactivity

  • Reactive values work together with reactive functions. Call a reactive value from within the arguments of one of these functions to avoid the error

  • Operation not allowed without an active reactive context.

fileInput(inputId, label, multiple, accept)
numericInput(inputId, label, value, min, max, step)
passwordInput(inputId, label, value)
radioButtons(inputId, label, choices, selected, inline)
selectInput(inputId, label, choices,
selected, multiple, selectize, width, size) (also selectizeInput())
sliderInput(inputId, label, min, max, value, step, round, format, locale, ticks, animate, width, sep, pre, post)

Inputs

Data Visualization in Practice

By Karl Ho

Data Visualization in Practice

  • 111