Data Visualization
with ggplot2

Joel Ross
Winter 2017

INFO 201

https://slides.com/joelross/info201w17-ggplot2/live

Today's Objectives

By the end of class, you should be able to

  • Describe visualizations using the Grammar of Graphics
     
  • Use ggplot2 to draw beautiful data charts
     
  • Organize data in the proper shape
     

Why create graphical visualizations of data?

What's wrong with tables?

How would you describe this chart?

Grammar of Data Manipulation

Words (verbs) used to describe ways to manipulate data:

  • Select the columns of interest
  • Filter out irrelevant data to keep rows of interest
  • Mutate a data set by adding more columns
  • Arrange the rows in a data set
  • Summarize the data (e.g., mean, median, max)
  • Group the data by category
  • Join multiple data sets together

Grammar of Graphics

Words used to describe the visual components and aspects of a graphic.

  • Data shown in the plot
  • Geometric objects (geoms) that appear on the plot
  • Aesthetic mappings from the data to the geoms
  • Statistical transformation used to calculate the data
  • Scales (range of values) for each aesthetic
  • Coordinate system to organize the geoms
  • Facets or groups of data shown in different plots

Layers

Organize plots into layers, where each layer has:

  • A geometric object
  • A set of aesthetic mappings
  • A statistical transformation
  • A position adjustment

How to describe with Grammar of Graphics?

ggplot2

ggplot2 is an R package (library) that implements this Grammar of Graphics.
It provides declarative functions for specifying plots in terms of the grammar.

install.packages("ggplot2")  # once per machine
library("ggplot2")  # load the package

Plotting with ggplot2

Use the ggplot() function to draw a plot, specifying plot elements via the grammar.

# plot the `mpg` data set, with highway milage 
# on the x axis and engine displacement (power) 
# on the y axis:



ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

data to plot

add geometry

geometric objects (points)

aesthetic mappings

property = column

Aesthetics

The aes() function specifies aesthetic mappings from data values to visual channels.

# color the data by car type
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

x-location based on displ column
(continuous)

color based on class column (discrete)

Can also set visual channels without mapping

# blue points!
ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy), color = "blue")

Geoms

ggplot2 supports many different geoms, each created with a function. Each geom requires/supports different aesthetics.

# line chart of milage by engine power
ggplot(data = mpg) +
  geom_line(mapping = aes(x = displ, y = hwy))

# bar chart of car type
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = class))

no y mapping,
automatically aggregated

Each plot can include multiple geoms, which inherit data and aesthetics unless specified otherwise.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy), se=FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(se=FALSE)

FORK and clone the repo
to turn in for participation

Grammar of Graphics

Words used to describe the visual components and aspects of a graphic.

  • Data shown in the plot
  • Geometric objects (geoms) that appear on the plot
  • Aesthetic mappings from the data to the geoms
  • Statistical transformation used to calculate the data
  • Scales (range of values) for each aesthetic
  • Coordinate system to organize the geoms
  • Facets or groups of data shown in different plots

Statistical Transformation

Many geoms have a default statistical transformation used to calculate new data to plot (e.g., for bar graphs).

# bar chart of car type
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = class), stat="count")

explicit "count"
for y

Each geom is associated with a stat_ function, and can be used interchangeably.

# these two charts are identical
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = class))

ggplot(data = mpg) +
  stat_count(mapping = aes(x = class))

Position Adjustment

Many geoms have a default position adjustment use to lay out the plot separate from the aesthetic mappings

# bar chart of milage, colored by car type
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = hwy, fill = class))
# bar chart of milage, colored by car type
ggplot(data = mpg) +
  geom_bar(aes(x=hwy, fill=class), position="fill")

Scales

Add scales to a plot to determine the range of (aesthetic) values data should map to (replacing the default)

# city/highway milage relationship
ggplot(data = mpg) +
  geom_point(mapping = aes(x = cty, y = hwy, color = class)) +
  scale_x_reverse() +  # reverse x axis
  scale_color_hue(l = 70, c = 30)  # custom color scale

aesthetic
to scale

scale to use

1
2
3
4
5

"red"
"yellow"
"blue"
"green"
"purple"

Data

Aesthetic

ColorBrewer Scales

Use palettes from colorbrewer.org to specify color schemes that are color-bind safe.

# efficiency by engine size, colored nicely
ggplot(data = mpg) +
  geom_point(aes(x = displ, y = hwy, color = class), size=4) +
  scale_color_brewer(palette = "Set3")

Coordinate System

You can also add a specific coordinate system to a plot.

# horizontal bar chart of milage, colored by car type
ggplot(data = mpg) +
  geom_bar(mapping = aes(x = hwy, fill = class)) +
  coord_flip()
# A pie chart = stacked bar chart + polar coordinates
ggplot(mpg, aes(x = factor(1), fill = factor(cyl))) +
  geom_bar(width = 1) +
  coord_polar(theta = "y")

make numeric vector into factor

angle based on (aggregate) "y"

Facets

Break a plot into parts with facets (similar to group_by() in dplyr). Each facet acts like a "level" in a factor, with a plot for each level.

# a plot with facets based on vehicle type.
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~class)

A formula , read as

"as a function of"

What if we want to
facet by exam?

Consider a data set...

Data Shape

Wide Data

Long Data

6 rows x 4 cols
= 24 scores

24 rows x 1 col
= 24 scores

Data Shape

We can convert between wide and long data (and vice versa) using the tidyr package.

# Alternatively, install "tidyverse"
install.packages("tidyr")  # once per machine
library("tidyr")



# Make a data.frame (example)
students <- data.frame(
  name = c('Mason', 'Tabi', 'Bryce', 'Ada', 'Bob','Filipe'),
  section = c('a','a','a','b','b','b'),
  math_exam1 = c(91, 82, 93, 100, 78, 91), 
  math_exam2 = c(88, 79, 77, 99, 88, 93),
  spanish_exam1 = c(79, 88, 92, 83, 87, 77), 
  spanish_exam2 = c(99, 92, 92, 82, 85, 95)
)

Data Shape

students.long <- gather(students.wide, 
                        key = exam, 
                        value = score, 
                        math_exam1, math_exam2, 
                        spanish_exam1, spanish_exam2
                       )

Convert from wide to long using gather(). The key is a new column containing gathered colnames, and value is a new column with their values.

# spread by column "exam"
stu.wide <- spread(students.long, key = exam, value = score)

# spread by column "name"
stu.wide.name <- 
         spread(students.long, key = name, value = score)

Convert from long to wide using spread(). The key is where to get the new colnames, and value is where to get the values

names for new columns

col data to populate with

Questions on anything so far?

Action Items!

  • Be comfortable with module 13

  • Assignment 5 due Thursday before class

    • (Assignment 6 online soon)


Thursday: What makes a good visualization?
                   Also maps.

info201w17-ggplot2

By Joel Ross

info201w17-ggplot2

  • 2,126