Social and Political Data Science: Introduction

Data Visualization

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Introduction to ggplot2

ggplot2

Author: Hadley Wickham (from New Zealand, Rstudio Chief Scientist)

 

Based on based Leland Wilkinson's book, The Grammar of Graphics (2005, available online at UTD library) 

ggplot2 vs. lattice

  • ggplot2's default appearance of plots has been carefully chosen with visual perception in mind, like the defaults for lattice plots. The ggplot2 style may be more appealing to some people than the lattice style.

  • The arrangement of plot components and the inclusion of legends is automated. This is also like lattice, but the ggplot2 facility is more comprehensive and sophisticated.

 

Why ggplot2?

  • Although the conceptual framework in ggplot2 can take a little getting used to, once mastered, it provides a very powerful language for concisely expressing a wide variety of plots.

  • The ggplot2 package uses grid for rendering, which provides a lot of flexibility available for annotating, editing, and embedding ggplot2 output.

Wickham on ggplot2

 Aim of grammar:

 

“bring together in a coherent way things that previously appeared unrelated and which also will provide a basis for dealing systematically with new situations”.

 

- D. R. Cox 1978

 

Wilkinson

Grammar gives language rules. 

The word stems from the Greek noun for letter or mark \(\gamma\rho\acute{\alpha}\mu\mu\alpha\).

That derives from the Greek verb for writing \(\gamma\rho\acute{\alpha}\phi\omega\), which is the source of our English word graph.

Wilkinson

A grammar is a formal system of rules for generating lawful statements in a language.

The grammar of graphics goes beyond a limited set of charts (words) to an  unlimited world of graphical forms (statements). 

The rules of graphics grammar are sometimes mathematical and sometimes aesthetic.

Wilkinson

Mathematics provides symbolic tools for representing abstractions. 

Aesthetics, in the original Greek sense, offers principles for relating sensory attributes (color, shape, sound, etc.) to abstractions.. 

GG focuses on rules for constructing graphs mathematically and then representing them as graphics aesthetically.

Wickham

A grammar of graphics is a tool that enables  concise description of the components of a graphic.

Such a grammar allows moving beyond named graphics and gain insight into the deep structure that underlies statistical graphics.

 

Wickham on ggplot2

"layered grammar of graphics"

  1. develop a hierarchy of defaults based on Wilkinson

  2. embed a graphical grammar into a programming language.

  3. build on the grammar to learn how to create graphical “poems.”

 

ggplot2

Two ways of plotting a graphic:

  1. qplot() - quick plot

  2. ggplot() - grammar of graphics plot

ggplot2

Two ways of plotting a graphic:

  1. qplot() - quick plot                              

    1. very similar to plot()

    2. simple to use

    3. quick to produce basic graphs

ggplot2

Two ways of plotting a graphic:

  1. ggplot() - grammar of graphics plot

    1. fuller implementation of The Grammar of Graphics

    2. highly flexible for plotting graphs

    3. steep learning curve

ggplot2

A ggplot2 plot is built  by creating plot components, or layers, and combining them using the + operator (adding layer).

It can combine several other important components that allow for more complex plots that contain multiple groups, legends, facetting. 

  • Present data with layers of components

  • tidy data concepts (tidyverse.org)

  • Grammar structure enables graphical data analysis and exploration

ggplot2 structure

ggplot2 philosophy (Wickham)

  • Make graphics easier

  • Use the grammar to facilitate research into new types of display

  • Continuum of expertise:

    • start simple by using the results of the theory

    • grow in power by understanding the theory

    • begin to contribute new components

  • Orthogonal components and minimal special cases should make learning easy(er?)

ggplot2 components


Create  graphical display by combining building blocks including:

 

  • data

  • aesthetic mapping

  • geometric object

  • statistical transformations

  • scales

  • coordinate system

  • position adjustments

  • facetting

ggplot2 layers

  • data

  • aesthetic mapping

  • geometric object

ggplot2 layers

Components of ggplot2

  1. data: R data frame
  2. coordinate system:  2-D space plot
  3. geoms: geometric objects representing data, e.g. points, lines, polygons, etc.
  4. aesthetics: visual characteristics, e.g. position, size, color, shape, transparency, fill
  5. scales: governs how visual characteristic is converted to display values, e.g. log scales, color scales, size scales, shape scales.
  6. stats: statistical data transformations, e.g. counts, means, medians, regression lines
  7. facets: split data into subsets to display as multiple graphs

Core components 

Common geoms

Murrell, Paul. 2019. R Graphics.  CRC Press.

Scale components 

Common scales

Murrell, Paul. 2019. R Graphics.  CRC Press.

Stat components 

Common stats

Murrell, Paul. 2019. R Graphics.  CRC Press.

Coord components 

ggplot2 coord

  • Coordinate system component, or coord, in ggplot2  is simple linear cartesian coordinates, but this can be explicitly set to something else.

  • For example, coord_trans() function transforms variables to plotted explicitly, e.g.:
     

​coord_trans(x="exp", y="exp"))

More coord examples

  • coord_cartesian(): default cartesian coordinate system

  • ​coord_fixed(): Cartesian coordinates with fixed aspect ratio between x and y units

  • coord_flip(): Flipped Cartesian coordinates

  • coord_polar: Polar coordinates

  • coord_trans: Transformed cartesian coordinates. 

Thought Starter

Thought Starter

Gapminder data set

The gapminder data set has 1704 rows and 6 variables:
country(factor) - 142 levels

continent(factor) - 5 levels

year -ranges from 1952 to 2007 in increments of 5 years

lifeExp-  life expectancy at birth, in years

pop - population

gdpPercap -  GDP per capita (US$, inflation-adjusted)

Gapminder data set

install.packages("gapminder")
library(gapminder)
gm = gapminder
head(gm)
summary(gm)
table(gm$country)

plot() in graphics package

# Plot one variable
hist(gm$lifeExp)
# Plot two variables with logged version of x
plot(lifeExp ~ gdpPercap, gm, subset = year == 2007, log = "x", pch=20)

Plot type

  • "p" for points

  • "l" for lines

  • "b" for both

  • "c" for the lines part alone of "b"

  • "o" for both ‘overplotted’

  • "h" for ‘histogram’ like (or ‘high-density’) vertical lines

  • "s" for stair steps, moves first horizontal, then vertical

  • "S" for other steps, contrary to "s"

  • "n" for no plotting.

 

Plot symbols (plot characters)

pch = 0,square
pch = 1,circle
pch = 2,triangle point up
pch = 3,plus
pch = 4,cross
pch = 5,diamond
pch = 6,triangle point down
pch = 7,square cross
pch = 8,star
pch = 9,diamond plus
pch = 10,circle plus
pch = 11,triangles up and down
pch = 12,square plus
pch = 13,circle cross

pch = 14,square and triangle down
pch = 15, filled square
pch = 16, filled circle
pch = 17, filled triangle point-up
pch = 18, filled diamond
pch = 19, solid circle
pch = 20,bullet (smaller circle)
pch = 21, filled circle blue
pch = 22, filled square blue
pch = 23, filled diamond blue
pch = 24, filled triangle point-up blue
pch = 25, filled triangle point down blue

 

Plot symbols: PCH

Additional:
*

.
o
O
note: takes longer to plot

More on symbols (programming)

& - ampersand

 ‘ - apostrophe or single quote

* - asterisk

@ - at

{} - braces or curly brackets

[] - brackets

^ - carat

<> - angle brackets or chevron

~ - tilde

| - pipe 

# - pound

- - hyphen

Line type

Line types can be specified with:

  • An integer or name:
    0 = blank,
    1 = solid,
    2 = dashed,
    3 = dotted,
    4 = dotdash,
    5 = longdash,
    6 = twodash

Line type

  • The lengths of on/off stretches of line can be determined with a string containing 2, 4, 6, or 8 hexadecimal digits (1 - f) which give the lengths of consecutive lengths.
  • For example, the string "33" specifies three units on followed by three off and "3313" specifies three units on followed by three off followed by one on and finally three off.

 

Line type

  • 44

  • 13

  • 1343

  • 73

  • 2262

ggplot()


creates a plot object, layer by layer

plot object p cannot be displayed without adding at least one layer at this point, there is nothing to see!

install.packages("ggplot2")
library(ggplot2)
p <- ggplot(data = gm)
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(size=2)

Add a Layer

p + geom_point()

Add color grouping

# Add some color grouping
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp, color=continent))
p + geom_point()

Add regression line

# Add a regression line, dropped the color grouping
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(pch=16) + geom_smooth(method="lm")

Change dot color?

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = "purple"))
p + geom_point() +
  geom_smooth(method = "loess") +
  scale_x_log10()

Change dot color?

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10()

Adding labels

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  xlab("GDP per capita") +
  ylab("Life Expectancy")

Facetting

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  xlab("GDP per capita") +
  ylab("Life Expectancy") +
  facet_wrap(~ continent, nrow=2)

Finishing touches: font

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  xlab("GDP per capita") +
  ylab("Life Expectancy") +
  facet_wrap(~ continent, nrow=2) +
  theme(text=element_text(size=14, family="Palatino")) 

Finishing touches: theme

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  xlab("GDP per capita") +
  ylab("Life Expectancy") +
  facet_wrap(~ continent, nrow=2) +
  theme(text=element_text(size=14, family="Palatino")) + 
  theme_bw()

What happen to the font?

Finishing touches: theme

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  xlab("GDP per capita") +
  ylab("Life Expectancy") +
  facet_wrap(~ continent, nrow=2)  + 
  theme_bw() + 
  theme(text=element_text(size=14, family="Palatino")) 

What is different?

ggplot2 multi-plot layout

Source: Baptiste Auguie. 2017. Laying out multiple plots on a page (https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html)