Vít Gabrhel
vit.gabrhel@mail.muni.cz
FSS MU,
30. 10. 2017
Graphing and figures – general principles
Guidelines for different kinds of graphs and publications:
Journal papers (APA guidelines)
Quick plots using qplot (plus, a review on saving graphs)
More flexible graphs using ggplot
Final formatting using ggplot
Graphics functions in the base package:
Specialised packages:
Specialised functions inside other packages:
General principles: Graphs and figures should...
1.Summarise and/or reveal data, making large datasets coherent
2.Encourage the viewer to think about the data being presented
3.Avoid distorting the data
4.Encourage the viewer to compare different pieces of data
A graph that satisfies all four criteria despite its complexity
A graph that satisfies all four criteria despite its simplicity
All graphs:
Bar graphs:
Histograms:
Graphs involving points (and possibly lines):
Scatter plots:
Think about:
Journal articles | Specialist (colour) articles | Conference presentations |
---|
•Follow APA guidelines – e.g., page 4 in this style manual. (https://intranet.ecu.edu.au/__data/assets/pdf_file/0010/20611/APAstyle.pdf) |
•Follow APA guidelines as much as possible, but use colour if an additional explanatory tool is needed. •Colours: think about whether they should be vibrant or neutral, given (a) your topic, and (b) whether you are plotting points or bars (see previous slide). |
•No caption available, so more labelling in the title or inside the graph is needed. •Figures can •Same advice as for specialist articles regarding colour. •For small sample sizes, you might not even need an x-axis and y-axis. Can instead label points of interest directly. |
The qplot function is in the ggplot2 package.
The function is very useful for data exploration, as it is possible to draw fairly complex plots with one or two lines of code.
The function is not useful for final plots for presentations and publications because the overall appearance of the plots is difficult to change.
Basic principle: "geoms" (representations of data) have "aesthetics“ (properties) that can be "mapped" to variables in the dataset or “set” to a desired value
ggplot
install.packages("ggplot2")
library("ggplot2")
data
setwd()
NPAS = read.csv2("NPAS.csv", header = TRUE)
NPAS_Clean <- na.omit(NPAS)
option | description |
---|---|
alpha | Alpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity) |
color, shape, size, fill | Associates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically. |
data | Specifies a data frame |
facets | Creates a trellis graph by specifying conditioning variables. Its value is expressed asrowvar ~ colvar. To create trellis graphs based on a single conditioning variable, userowvar~. or .~colvar) |
geom | Specifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter". |
main, sub | Character vectors specifying the title and subtitle |
method, formula | If geom="smooth", a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression. The formula parameter gives the form of the fit. For example, to add simple linear regression lines, you'd specify geom="smooth", method="lm", formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables. For method="gam", be sure to load the mgcv package. For method="rml", load the MASS package. |
x, y | Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y |
xlab, ylab | Character vectors specifying horizontal and vertical axis labels |
xlim,ylim | Two-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively |
qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim= xlab=, ylab=, main=, sub=)
# Data
qplot(data = NPAS, x = urban, geom = "bar") # bar chart with categories on x axis
# Faktorizace proměnné "urban"
NPAS_Clean$urbanFACTOR = NPAS_Clean$urban
NPAS_Clean$urbanFACTOR <- factor(NPAS_Clean$urbanFACTOR,levels=c(1,2,3),
labels=c("Rural","Suburban","Urban"))
# Jednoduchý barplot
qplot(data = NPAS_Clean, x = urbanFACTOR, color = urbanFACTOR, fill = urbanFACTOR, geom = c("bar"), alpha=I(1), main="Respondenti dle typu osídlení", xlab="Typ osídlení", ylab="Počet obyvatel")
# Jednoduchý histogram
qplot(NPAS_Clean$age, geom="histogram")
# Stanovení rozsahu
min(NPAS_Clean$age)
max(NPAS_Clean$age)
qplot(NPAS_Clean$age, geom="histogram", xlim = c(14, 90))
# Šířka jednotlivých sloupců v histogramu
qplot(NPAS_Clean$age, geom="histogram", binwidth = 1, xlim = c(0, 100))
qplot(NPAS_Clean$age,
geom="histogram",
binwidth = 1,
main = "Histogram for věk",
xlab = "Věk",
ylab = "Počet",
fill=I("blue"),
col=I("red"),
alpha=I(.5),
xlim=c(14,90))
# Index nerdství
NPAS_Clean$NerdyPersona = rowSums(NPAS_Clean[, 1:26])
# Jednoduchý scatterplot
qplot(age, NerdyPersona, data = NPAS_Clean, geom = c("point"))
# Data dle vybrané proměnné
qplot(age, NerdyPersona, data = NPAS_Clean, colour = urbanFACTOR, geom = c("point"))
# Scatterplot proložený křivkou
qplot(age, NerdyPersona, data = NPAS_Clean, geom = c("point", "smooth"))
# Scatterplot proložený křivkou dle vybrané proměnné
qplot(age, NerdyPersona, data = NPAS_Clean, geom = c("point", "smooth"), colour = urbanFACTOR)
# Scatterplot se spojnicemi bodů dle vybrané proměnné
qplot(age, NerdyPersona, data = NPAS_Clean, colour = urbanFACTOR, geom = "line")
# Jednoduchý boxplot s "integer" třídou třídící proměnné
qplot(gender, NerdyPersona, data=NPAS_Clean, geom="boxplot")
# Faktorizace proměnné gender
NPAS_Clean$genderFACTOR = NPAS_Clean$urban
NPAS_Clean$genderFACTOR <- factor(NPAS_Clean$genderFACTOR,levels=c(1,2,3),
labels=c("Muž","Žena","Ostatní"))
# Jednoduchý boxplot
qplot(genderFACTOR, NerdyPersona, data=NPAS_Clean, geom="boxplot")
# Jednoduchý boxplot s legendou dané proměnné
qplot(genderFACTOR, NerdyPersona, fill=genderFACTOR, data=NPAS_Clean, geom="boxplot")
In the Plots tab, Export -> Save Plot As Image... Then choose Image Format and Size
By default, graphs are saved to your working directory, but you can choose any folder by clicking “Directory” after clicking “Save Plot As Image”.
The ggplot function is also in the ggplot2 package.
Key concepts, apart from the already mentioned geoms, aesthetics, position, facet, setting and mapping:
We covered: overlaying of histograms and regression lines, adding error bars to line plots and bar plots, setting axis limits (coord_cartesian), faceting, and free scales.
Adjusted through | Terms to look for |
---|
Overall colour-scheme: •black and white? •settings for colours |
theme_set scale_colour_manual |
theme_bw() scale_colour_hue() scale_colour_grey |
Appearance of points | geom_point geom_params |
|
Appearance of error bars | stat_summary (in our script) Other possibilities: geom_errorbar, geom_params |
colour = "gray41" |
Appearance of lines | scale_linetype_manual | scale_linetype_manual(values=c("dotted", "solid", "longdash", "dotdash")) |
Gridlines | theme(panel.grid.major = element_line( )) theme(panel.grid.minor = element_line( )) |
panel.grid.major = element_line(colour = "gray41", size = 1) panel.grid.minor.y = element_blank() |
Adjusted through | Terms to look for |
---|
Labels along the axes | scale_x_continuous (when x is not a factor variable) scale_x_discrete scale_y_continous scale_y_discrete |
scale_x_continuous(breaks = 1:2, labels = c("Trials 1-24", "Trials 25-48“, name = "Time period") scale_y_continuous(name = "") |
Facet labels | theme(strip.text = ___) Changes to name of factor levels |
strip.text.y = element_text(size=14, face = "bold") levels(longsub2$Measure) <- c("Kick Dir Entropy", "No. of Player\nChanges") |
Text size | theme( ___ = element_text( )) | theme(axis.title.x = element_text(size= 20), axis.text.y = element_text(size=14, colour = "black") |
Legend | scale_linetype_manual scale_colour_hue etc. Depending on what aesthetic (colour, linetype, shape) you have mapped the variable to theme(legend.text = ___) |
scale_linetype_manual(values=c("dotted", "solid", "longdash", "dotdash"), name="Success Slope", breaks=c("Descending", "U-shaped", "Ascending", "Flat"), labels=c("Desc.", "U-shaped", "Ascending", "Flat")) legend.text = element_text(size=12) |
Data + formula
mtc <- mtcars
ggplot(mtc, aes(x = factor(gear))) + geom_bar(stat = "count")
Aggregate data for barplot
summary.mtc <- data.frame(
gear=levels(as.factor(mtc$gear)),
meanwt=tapply(mtc$wt, mtc$gear, mean))
summary.mtc
Horizontal bars, colors, width of bars
#1. horizontal bars
p1<-ggplot(mtc,aes(x=factor(gear),y=wt)) + stat_summary(fun.y=mean,geom="bar") +
coord_flip()
p1
#2. change colors of bars
p2<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(gear))) + stat_summary(fun.y=mean,geom="bar") +
scale_fill_manual(values=c("purple", "blue", "darkgreen"))
p2
#3. change width of bars
p3<-ggplot(mtc,aes(x=factor(gear),y=wt)) + stat_summary(fun.y=mean,geom="bar", aes(width=0.5))
p3
Split and color by another variable
#1. next to each other
p1<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) +
stat_summary(fun.y=mean,position=position_dodge(),geom="bar")
p1
#2. stacked
p2<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) +
stat_summary(fun.y=mean,position="stack",geom="bar")
p2
#3. with facets
p3<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) +
stat_summary(fun.y=mean, geom="bar") +
facet_wrap(~vs)
p3
Add text to the bars, label axes, and label legend
ag.mtc<-aggregate(mtc$wt, by=list(mtc$gear,mtc$vs), FUN=mean)
colnames(ag.mtc)<-c("gear","vs","meanwt")
ag.mtc
g1<-ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill=factor(vs),color=factor(vs))) +
geom_bar(stat = "identity", position=position_dodge()) +
geom_text(aes(y=meanwt, ymax=meanwt, label=meanwt),position= position_dodge(width=0.9), vjust=-.5)
g1
#2. fixing the yaxis problem, changing the color of text, legend labels, and rounding to 2 decimals
g2<-ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill=factor(vs))) +
geom_bar(stat = "identity", position=position_dodge()) +
geom_text(aes(y=meanwt, ymax=meanwt, label=round(meanwt,2)), position= position_dodge(width=0.9), vjust=-.5, color="black") +
scale_y_continuous("Mean Weight",limits=c(0,4.5),breaks=seq(0, 4.5, .5)) +
scale_x_discrete("Number of Gears") +
scale_fill_discrete(name ="Engine", labels=c("V-engine", "Straight engine"))
g2
Add error bars
summary.mtc2 <- data.frame(
gear=levels(as.factor(mtc$gear)),
meanwt=tapply(mtc$wt, mtc$gear, mean),
sd=tapply(mtc$wt, mtc$gear, sd))
summary.mtc2
ggplot(summary.mtc2, aes(x = factor(gear), y = meanwt)) +
geom_bar(stat = "identity", position="dodge", fill="lightblue") +
geom_errorbar(aes(ymin=meanwt-sd, ymax=meanwt+sd), width=.3, color="darkblue")
# Histogram s nastavením hodnot na osách X a Y
ggplot(data=NPAS_Clean, aes(NPAS_Clean$age)) +
geom_histogram(breaks=seq(14, 100, by = 2),
col="red",
fill="green",
alpha = .2) +
labs(title="Histogram for Age") +
labs(x="Age", y="Count") +
xlim(c(14,90)) +
ylim(c(0,250))
# Barva jako intenzita
ggplot(data=NPAS_Clean, aes(NPAS_Clean$age)) +
geom_histogram(breaks=seq(14, 100, by = 2),
col="red",
aes(fill=..count..))
# Dvě barvy pro vyjádření intenzity
ggplot(data=NPAS_Clean, aes(NPAS_Clean$age)) +
geom_histogram(breaks=seq(14, 100, by = 2),
col="red",
aes(fill=..count..)) +
scale_fill_gradient("Count", low = "green", high = "red")
# Manipulace s backgroundem:
ggplot(data=NPAS_Clean, aes(NPAS_Clean$age)) +
geom_histogram(breaks=seq(14, 100, by = 2),
col="red",
aes(fill=..count..)) +
scale_fill_gradient("Count", low = "green", high = "red") +
theme(plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.line = element_line(size=.4))
# You can easily add a trendline to your histogram by adding geom_density to your code:
ggplot(data=NPAS_Clean, aes(NPAS_Clean$age)) +
geom_histogram(aes(y =..density..),
breaks=seq(14, 100, by = 2),
col="red",
fill="green",
alpha = .2) +
geom_density(col=2) +
labs(title="Histogram for Age") +
labs(x="Age", y="Count")
mtc <- mtcars
# Basic scatterplot
p1 <- ggplot(mtc, aes(x = hp, y = mpg))
# Print plot with default points
p1 + geom_point()
Change color of points
p2 <- p1 + geom_point(color="red") #set one color for all points
p2
p3 <- p1 + geom_point(aes(color = wt)) #set color scale by a continuous variable
p3
p4 <- p1 + geom_point(aes(color=factor(am))) #set color scale by a factor variable
p4
Change default colors in color scale
p1 + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple"))
Change shape or size of points
p2 <- p1 + geom_point(size = 5) #increase all points to size 5
p2
p3 <- p1 + geom_point(aes(size = wt)) #set point size by continuous variable
p3
p4 <- p1 + geom_point(aes(shape = factor(am))) #set point shape by factor variable p4
Add lines to scatterplot
p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2))
#connect points with line
p2 <- p1 + geom_point(color="blue") + geom_line()
p2
#add regression line
p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE)
p3
#add vertical line
p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color="red")
p4
Change axis labels
#label all axes at once
p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point()
p3 <- p2 + labs(x="Horsepower",
y = "Miles per Gallon")
p3
#label and change font size
p4 <- p2 + theme(axis.title.x = element_text(face="bold", size=20)) +
labs(x="Horsepower")
p4
#adjust axis limits and breaks
p5 <- p2 + scale_x_continuous("Horsepower",
limits=c(0,400),
breaks=seq(0, 400, 50))
p5
Change legend options
g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs)))
#move legend inside
g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1))
g2
#move legend bottom
g3 <- g1 + theme(legend.position = "bottom")
g3
#change labels
g4 <- g1 + scale_color_discrete(name ="Engine",
labels=c("V-engine", "Straight engine"))
g4
Change legend options
If we had changed the shape of the points, we would use scale_shape_discrete() with the same options. We can also remove the entire legend altogether by using theme(legend.position=“none”)
g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt))
g5 + scale_color_continuous(name="Weight", #name of legend
breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable
labels = c("Light", "Medium", "Heavy"), #label
low = "pink", #color of lowest value
high = "red") #color of highest value
Change background color and style
g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point()
g2
#Completely clear all lines except axis lines and make background white
t1<-theme(
plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.line = element_line(size=.4))
#Use theme to change axis label style
t2<-theme(
axis.title.x = element_text(face="bold", color="black", size=10),
axis.title.y = element_text(face="bold", color="black", size=10),
plot.title = element_text(face="bold", color = "black", size=12))
g3 <- g2 + t1
g3
g4 <- g2 + theme_bw()
g4
g5 <- g2 + theme_bw() + t2 + labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower")
g5
Change background color and style
g2<- ggplot(mtc, aes(x = hp, y = mpg)) +
geom_point(size=2, aes(color=factor(vs), shape=factor(vs))) +
geom_smooth(aes(color=factor(vs)),method = "lm", se = TRUE) +
scale_color_manual(name ="Engine",
labels=c("V-engine", "Straight engine"),
values=c("red","blue")) +
scale_shape_manual(name ="Engine",
labels=c("V-engine", "Straight engine"),
values=c(0,2)) +
theme_bw() +
theme(
axis.title.x = element_text(face="bold", color="black", size=12),
axis.title.y = element_text(face="bold", color="black", size=12),
plot.title = element_text(face="bold", color = "black", size=12),
legend.position=c(1,1),
legend.justification=c(1,1)) +
labs(x="Horsepower",
y = "Miles per Gallon",
title= "Linear Regression (95% CI) of MPG vs Horsepower by Engine type")
g2
# Prostý boxplot
bp1 <- ggplot(NPAS_Clean, aes(urbanFACTOR, NerdyPersona))
bp1 + geom_boxplot()
# Barva dle proměnné (bez legendy)
bp1 + geom_boxplot(aes(color=urbanFACTOR)) + theme(legend.position='none')
bp1 + geom_boxplot(aes(fill=urbanFACTOR), alpha=I(0.5)) + theme(legend.position='none')
# Manipulace s rozložením bodů
bp1 + geom_boxplot(aes(fill=urbanFACTOR), alpha=I(0.5)) +
geom_point(aes(color=urbanFACTOR), size=3) +
theme(legend.position='none')
bp1 + geom_boxplot(aes(fill=urbanFACTOR), alpha=I(0.5)) +
geom_point(position="jitter", alpha=0.5) +
geom_boxplot(outlier.size=0, alpha=0.5)
# To use the install_github function, you also need to have the devtools library installed and loaded!
# install.packages("devtools")
library(devtools)
install_github("ndphillips/yarrr")
library("yarrr")
pirateplot(NerdyPersona ~ urbanFACTOR, data = NPAS_Clean, main = "Index nerdství dle typu osídlení")
For each interval, we can state that there is a 95% probability that the true population mean falls within that interval
# 1 IV
pirateplot(NerdyPersona ~ urbanFACTOR, data = NPAS_Clean,
main = "Index nerdství dle typu osídlení",
pal = "southpark",
theme = 2,
point.o = 1, # Add points
point.col = "black",
point.bg = "purple",
point.pch = 21,
bean.f.o = 0.2, # Turn down bean filling
inf.f.o = 1, # Turn up inf filling
gl.col = "gray", # gridlines
gl.lwd = c(.5, 0)) # turn off minor grid lines)
# 2 IV
pirateplot(formula = NerdyPersona ~ gender + urbanFACTOR,
data = NPAS_Clean,
main = "Index nerdství dle typu osídlení a genderu",
point.pch = 1, # Point specifications...
point.col = "black",
point.o = .7,
inf.f.o = .9, # inference band opacity
gl.col = "gray")
Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Available online: http://moderngraphics11.pbworks.com/f/ggplot2-Book09hWickham.pdf
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage: UK. Chapter 4. Exploring data with graphs.
Support website: http://docs.ggplot2.org/current/