Introduction to Software: R

“Introduction to Social Network Analysis Methods and Applications”

International Laboratory for Applied Network Research

http://goo.gl/yJU6N0

Benjamin Lind, PhD

Why Program?

Flexibility

Data

Convert formats
Customized cleaning

New methods

Slow GUI development
Open source community
Can write your own functions

Replication

Personal record keeping
Collaboration
Transparency

Awareness

About R

Implements the S programming language

Open source

Object oriented programming language

Base & user-contributed packages

Meets practically all statistical purposes

Availability: r-project.org

Why Program in R?

Open source
Community

Large and growing
Made by statisticians
Wealth of packages (i.e., libraries)

Features

Documentation
Moderately easy language
Graphics

Why Social Networks in R?

Shared platform

Networks
Statistics
Data management
Graphics

Active network software development

Network metrics

igraph
sna
tnet

Network modeling

statnet
RSiena

GeneRal Introduction

Start R!

Type "R" into your terminal

Basics

Command line-only interface

Reliable, Replicable, Records

GUI options do exist, but...

Outside of scope for our purposes

Assignment. Functionally equivalent methods
```
x <- 1 #Preference: primary method 
```
```
x = 1 #Preference: within functions    
```
```
1 -> x #Preference: never    
```
Once assigned, x becomes an "object"
```
ls()    
```

Case Sensitivity

Try

Cheburashka <- "a cartoon character"Cheburashka
cheburashka

What does the error mean?

Hard Drive Navigation

R runs within a folder on your hard drive

Location where R reads & writes

Which folder are you in now?

orig.folder <- getwd()orig.folder
dir.create("new.folder")
list.files()
setwd("new.folder")list.files()
setwd(orig.folder)
file.remove("new.folder")list.files()

Packages

Preloaded (e.g., base, stats)
Dependencies
User-contributed

Social network analysis

statnet
igraph
tnet
RSiena
...

Graphics
Advanced regression models
Web scraping
Spatial
...

Packages

To install a package

 install.packages("igraph")

To load a package

 library(igraph)

To unload a package

 detach("package:igraph", unload=TRUE)

Missing Data

Typically coded as "NA". For example

x <- NA

Missing codes

NA # Non-applicable, missing
NULL # Undefined or empty value
NaN # Not a Number
Inf # Infinite

Indicated by functions

is.na(x)
is.null(x)
is.nan(x)
is.infinite(x)

Data Types

Storing and accessing information within R

Picture by KENPEI

Data Types

Elements

Numeric
Character

Use double quotes

Logical

TRUE, FALSE , T , F , 1 , 0

Objects

Vectors
Matrices
Arrays
Data frames
Lists
Closures

Elements

Identification

is.numeric(42)
is.numeric("42")
is.numeric(3.12)
is.logical(T)
is.logical(FALSE)
is.character(FALSE)
is.character("42")
is.character("HSE Network Workshop")

Conversion

as.numeric("42") + 1
as.character(as.numeric("0") + 1)
as.character(!as.logical("1"))
as.character(as.logical(as.numeric("0") + 1))
as.character(is.logical(as.logical(as.numeric("0")) + 1))
as.character(!as.logical(as.numeric("1")))

Object Types: Vectors

Vectors collect elements in one dimension

(Photo by Frank C. Müller)

Object Types: Vectors

Vectors collect elements in one dimension

RU.cit <- c("Mos", "StP", "Nov", "Yek") # Create vector
RU.cit <- c(RU.cit, "Niz") # Add element
RU.cit[3] # Retrieve element 3
RU.pops <- c(10.4, 4.6, 1.4, 1.3) # Create vector
names(RU.pops) <- RU.cit[c(1, 2, 3, 4)] # Name vector
RU.pops[c(4,2,3,1)] # Permute the order and retrieve
length(RU.pops) # Retrieve the length
RU.vis <- c(TRUE, TRUE, FALSE, FALSE) #Create vector
RU.vis[-c(4, 2)] # Remove elements 4 and 2

Remaining object types build upon this logic

Object Types: Vectors

Sequences

all(1:5 == c(1, 2, 3, 4, 5))
all(1:5 == seq(from = 1, to = 5, by = 1))

Repetitions

rep(1, times = 5)
rep(NA, times = 4)
rep("Beetlejuice", times = 3)

Object Types: Matrices and Arrays

(Picture by David.Asch)

Object Types: Matrices & Arrays

Matrices

Two dimensions

nrow(); ncol()

Assignment

x[1, 5] <- 1; x[2, ] <- x[1, 5] > 0

More columns, rows

cbind(); rbind()

Names: rownames(); colnames()

Arrays

Three dimensions

dim(); length(x[1, 1, ])

Assignment

x[3, 2, 6] <- 1; x[4, , 2] <- NA

Names: dimnames()

Object Types: Matrices

Turning vectors into matrices

RU.mat <- cbind(pop = RU.pops, vis = RU.vis)
rownames(RU.mat) <- RU.cit[c(1:4)]
colnames(RU.mat)

Alternatively

RU.mat <- matrix( , nrow = length(RU.pops), ncol = 2)
colnames(RU.mat) <- c("pop", "vis")
rownames(RU.mat) <- names(RU.pops)
RU.mat[ , "pop"] <- RU.pops
RU.mat[ , "vis"] <- RU.vis

Try

RU.mat[1, ]RU.mat[ , "pop"]
RU.mat[1, 2]RU.mat[1, , ] # What does the error mean?

Object Types: Data Frames

Data frames are like matrices, but allow multiple element types. Compare:

RU.mat[ , "vis"]
RU.vis

To retain initial element type

RU.df <- as.data.frame(RU.mat)
RU.df$vis <- as.logical(RU.df$vis)

These commands are equivalent statements

RU.df$vis[2]
RU.df[2, "vis"]
RU.df[["vis"]][2]
RU.df[[2]][2]

Object Types: Lists

Lists collect objects of (potentially) varying types & dimensions

RU.list <- list(RU.mat, RU.vis, RU.df)
names(RU.list) <- c("Mat", "Vec", "DF")

These commands are equivalent statements

RU.list$DF$vis[2]
RU.list[["DF"]]$vis[2]
RU.list[[3]]$vis[2]
RU.list[["DF"]][["vis"]][2]
RU.list[["DF"]][2, "vis"]
RU.list[["DF"]][2, 2]

Object Types: Closures

Closures are functions

Functions accept input (optional) & return output

(Photo by erix!)

Object Types: Closures

Closures are functions

Functions accept input (optional) & return output

xTimesxMinus1 <- function(x)
  return(x * (x-1))
xTimesxMinus1(42)

InvHypSineTrans <- function(x){
  x.ret <- log(x + sqrt(x^2 + 1))
  return(x.ret)
  }

plot(x = 0:1000, InvHypSineTrans(0:1000), main = "Hyperbolic Sine Transformation", xlab = "x", ylab = "y")

Compiled functions

library(compiler)InvHypSineTrans.cmp <- cmpfun(InvHypSineTrans)system.time(InvHypSineTrans(0:10000000))system.time(InvHypSineTrans.cmp(0:10000000))

Help

How to overcome the inevitable hurdles

Picture from Richard Milnes

Help

If you're looking for a function, but don't know the name

??"Chi Squared"
help.search("Chi Squared")

Once you know the name of the function

?chisq.test

Want to know how the function is written?

chisq.test

Still stumped?

Google: <expression to search> ~rstats
http://rseek.org/
http://stackoverflow.com/questions/tagged/r

Help

Exercise

Find the randomized normal distribution function
Use it to create two vectors of length 100

One with mean of zero & standard deviation of one

Assign it x

A second with mean of 50 & standard deviation of 25

Assign it y

Find the linear regression model command
Regress y on x

Apply Family

(Photo by · · · — — — · · ·)

Apply Family

Apply administers a function across elements within an object

Better than for() loops in R

lab.seq <- seq(from = -100, to = 100, by = 20)
?sapply
lab.mat <- sapply(lab.seq, rnorm, n = 100)
dim(lab.mat)
fix(lab.mat)
?apply
lab.means <- apply(lab.mat, 2, mean)
plot(x = lab.seq, y = lab.means)

Apply Family: Exercise

Recreate the last exercise, but with the following modifications

Use a sequence from 1 to 100 by 5
Run the sapply() line, but...

Set the mean to zero
Apply the sequence to rnorm()'s standard deviation
Run the apply() line, but...

Retrieve the standard deviation instead of the mean

Plot your results and show them to me

Loading and Saving

Reusing material between R sessions

(Photo by mRio)

Loading & Saving: Source

Text files for executable R code

Like 'do files' for Stata

Write them in text editors, following R syntax

In text editor, type

test.val <- "It works!"
test.val2 <- paste(test.val, "Really")

Save as "test.val.source.txt" in working directory

In R, enter

source("test.val.source.txt")
ls()
print(test.val)
print(test.val2)

Loading & Saving

History

RHistory files retain a log of your session
May be opened in a text editor
Should end in ".RHistory" extention

Usage

savehistory("Filename.RHistory")
loadhistory("Filename.RHistory")

Images

Saves every object in the environment (i.e., ls())
Usage

save.image("Filename.RData")
load("Filename.RData")

dput(), dget()

Saves and retrieves individual objects

Network Analysis Packages in R

igraph

statnet

intergraph

tnet

RSiena

igraph

Originated as a fork of the sna package

Libraries C/C++, Python, and R

Created by Gábor Csárdi

Motivations and Strengths

Network science tradition
General and complex network metrics
Data storage
Speed and efficiency

Website , Wiki

statnet

Suite collecting the packages

sna
network
ergm, tergm
...and others

Interdisciplinary, cross-university development team

Motivations

General social network metrics
Data storage
Modeling and statistical inference

Website, Wiki

intergraph

statnet and igraph don't play nicely

Many identical function names
Different data formats

intergraph is a conversion tool

asIgraph()
asNetwork()

Website

tnet

Created by Tore Opsahl

Provides metrics for three types of networks

Weighted
Two-mode
Longitudinal

Motivated by the bias introduced by coercing these networks into simple graphs

Works well with igraph

Website

RSiena

Originally Windows-based, stand alone

Stochastic Actor-Oriented Models

Motivations

Longitudinal network modeling
Longitudinal behavioral network modeling
Cross-sectional network modeling

p* / ERGM

Website

Social Networks in R

Data importation

Basic Measurements

Data

R can read anything!*

Delimited text (comma, tab, bang, pipe, etc)
Excel files
Internet files

Tables from webpages
.json, .xml, .html format
Files stored online

Pajek files
SPSS and Stata files
Spatial data in shapefiles and KML

If possible, I'd recommend tab-delimited text files.

*It reads some file formats much better (and easier) than others

Social Network Data

Typical network data formats

Matrix
Edge list
Pajek

Attribute formats

Actor dataframe
Edge dataframe

Import a Tab-Delimited Edge List

Wikipedia votes for site administrators

One-mode, directed, simple

Source

Picture courtesy of Frank Schulenburg

Import a Tab-Delimited Edge List

library(igraph)

wikidat <- read.table(file = "http://pastebin.com/raw.php?i=UVmTznBj", sep = "\t", header = TRUE, skip = 6)
# 'sep' is the separator
# 'header' indicates column names
# 'skip' are the number of initial rows to ignore

wikidat <- graph.data.frame(d = wikidat, directed = TRUE)
summary(wikidat) # Nodes: 7115; Edges: 103689

Import a Pajek File

Appearances by Marvel comic book characters

Characters by comic book issue

Two-mode

Created by Rosselló, Alberich, and Miro

Info

Source

Import a Pajek File

library(igraph)

marvnet <- read.graph("http://bioinfo.uib.es/~joemiro/marvel/porgat.txt", format = "pajek")
summary(marvnet)

Import Other Formats in igraph

See

?read.graph
?graoh.data.frame
?graph.adjacency
?graph.adjlist
?graph.constructors

Basic Measurements

Density and Degree

is.simple(wikidat) # Verify it's simple
is.directed(wikidat) # Verify it's directed
vcount(wikidat) # Number of vertices
ecount(wikidat) # Number of edges
graph.density(wikidat) # Density

V(wikidat)$indegree <- degree(wikidat, mode = "in")
V(wikidat)$outdegree <- degree(wikidat, mode = "out")
V(wikidat)$totaldegree <- degree(wikidat, mode = "total")

all.vatts <- list.vertex.attributes(wikidat)[-1]
all.vatts <- sapply(all.vatts, get.vertex.attribute, graph = wikidat)
summary(all.vatts)
hist(log(V(wikidat)$outdegree), main="Wiki Admin Votes", xlab="Outdegree (log)")

Dyads and Triads

unlist(dyad.census(wikidat))reciprocity(wikidat)

Which measurement of reciprocity did that value refer to?

Edgewise?
Dyadic?
Dyadic, non-null ("ratio")?

triad.census(wikidat)
transitivity(wikidat) # What does this number refer to?
V(wikidat)$loc.trans <- transitivity(wikidat, "local")

Who has the highest clustering coefficient?

V(wikidat)$name[which.max(V(wikidat)$loc.trans)]

Paths

Warning: the commands below might take a minute or so

average.path.length(wikidat)
diameter(wikidat)
V(wikidat)$name[(get.diameter(wikidat))]
hist(shortest.paths(wikidat), main = "Histogram of Shortest Path Lengths", xlab = "Geodesic Lengths")

V(wikidat)$betw <- betweenness(wikidat)E(wikidat)$eb <- edge.betweenness(wikidat)hist(log(E(wikidat)$eb), main = "Histogram of Edge Betweenness", sub = "Wiki Admin Voting", xlab = "Edge Betweenness (log)")

Connectivity

#How many weak and strong components do we have? 
sapply(c("weak", "strong"), no.clusters, graph = wikidat)

#Notice the distributions
clusters(wikidat, mode = "weak")$csize
clusters(wikidat, mode = "strong")$csize

#Assign membership as attributesV(wikidat)$comp.w <- clusters(wikidat, mode = "weak")$membership
V(wikidat)$comp.s <- clusters(wikidat, mode = "strong")$membership# Are the vertexes part of the giant component?
V(wikidat)$giantcomp.w <- V(wikidat)$comp.w == which.max(clusters(wikidat, mode = "weak")$csize)
V(wikidat)$giantcomp.s <- V(wikidat)$comp.s == which.max(clusters(wikidat, mode = "strong")$csize)

Centrality

We've already reviewed degree and betweenness.

Examples of closeness and eigenvector centrality:

V(wikidat)$closeness <- closeness(wikidat)
V(wikidat)$evcent <- evcent(wikidat)$vector

k-cores

V(wikidat)$kc.undir <- graph.coreness(as.undirected(wikidat, mode = "collapse"))
#How are undirected k-cores related to centrality?
kc.cent.corr.fun<-function(x, y = V(wikidat)$kc.undir){
  a <- get.vertex.attribute(wikidat, name = x)
  b <- y
  cor.est <- cor.test(a, b, method = "spearman", exact = FALSE)$estimate
  return(cor.est)
  }
kc.cent.corr.fun <- cmpfun(kc.cent.corr.fun)

cent.atts <- c("indegree", "outdegree", "totaldegree", "betw", "closeness", "evcent")
cent.cors <- sapply(cent.atts, kc.cent.corr.fun)
names(cent.cors) <- cent.atts
barplot(cent.cors, ylab = "Spearman's rho", xlab = "Centrality", main = "Correlation with k-coreness")
#Directed k-cores
sapply(c("in", "out", "all"), function(y) return(sapply(cent.atts, kc.cent.corr.fun, y = graph.coreness(wikidat, mode = y))))
rm(cent.atts, kc.cent.corr.fun)

Community Detection

wikidat.comms <- multilevel.community(as.undirected(wikidat, mode = "collapse"))
wikidat.comms$modularity
V(wikidat)$comms <- wikidat.comms$membership
xtabs(~V(wikidat)$comms) # 3, 4, 5, 9, and 11 are the biggest

wikidat.comms.w <- multilevel.community(as.undirected(wikidat, mode = "collapse"), weights = max(E(wikidat)$eb) - E(wikidat)$eb)
wikidat.comms.w$modularity # Slightly higher
V(wikidat)$comms.w <- wikidat.comms.w$membership
xtabs(~V(wikidat)$comms.w) # 4, 5, 14, and 15 are the biggest

Visualization

Defaults

Inadvisable for a network this large

vcount(wikidat) # 7115 nodes
ecount(wikidat) # 103689 edges
plot(wikidat)

What are the problems?

Subsample

Completely at Random

sampsize <- round(vcount(wikidat) * 0.05) # 5% sample
subsamp <- sample.int(vcount(wikidat), sampsize)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
plot(wikidat.sg)

What are the problems?

Subsample

Completely at Random

sampsize <- round(vcount(wikidat) * 0.05) # 5% sample
subsamp <- sample.int(vcount(wikidat), sampsize)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
plot(wikidat.sg)

Visual problems

Nodes too large
Labels don't provide information

plot(wikidat.sg, vertex.size = 3, vertex.label = "")

What are the problems?

Subsample

Visual problems

Too many isolates
Arrowheads too large

subsamp <- which(degree(wikidat.sg, mode = "all") > 0)
wikidat.sg <- induced.subgraph(wikidat.sg, subsamp)
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.5)

Much better, visually

But what are some conceptual problems using this technique?

Conceptual Problems with Random Sampling

Grossly underrepresents degree

Majority of alters absent
Underestimates connectivity

Conceptually naïve

Sample could reflect structure

Replication limitations

Alternative Subsampling Strategy

I want to portray communities and leadership

# Sample based on betweenness and large communities
cond1 <- V(wikidat)$betw > 1
cond2 <- V(wikidat)$comms %in% c(3, 4, 5, 9, 11)
subsamp <- which(cond1 & cond2)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
vcount(wikidat.sg) # From 7,115 to 1,372 nodes
ecount(wikidat.sg) # From 103,689 to 40,587 edges
# Still too big to see anything.
# Systematic 25% sample on eigenvector centrality
evrank <- rank(V(wikidat.sg)$evcent, "random")
nsyssamp <- seq(from = 1, to = max(evrank), by = 4)
nsyssamp <- which(evrank%in% nsyssamp)
wikidat.sg <- induced.subgraph(wikidat.sg, nsyssamp)
# Systematic 25% sample on edge betweenness
ebrank <- rank(E(wikidat.sg)$eb, "random")
esyssamp <- seq(from = 1, to = max(ebrank), by = 4)
esyssamp <- which(ebrank %in% esyssamp)
wikidat.sg <- subgraph.edges(wikidat.sg, esyssamp)
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25)

Not too bad!

Layout

There are many layout algorithms out there

Their default values compromise between quality and efficiency

?layout.auto
wikidat.sg.lo <- layout.fruchterman.reingold(wikidat.sg, params = list(niter = 2500, area = vcount(wikidat.sg)^3))
V(wikidat.sg)$x <- wikidat.sg.lo[,1]
V(wikidat.sg)$y <- wikidat.sg.lo[,2]
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25)

Better use of space, but hairballish

What are your empirical interests?

Sizes
Colors and shading

Categorical variables
Interval and ordinal variables

Transparency

Parameters to Vary

Nodes

Non-continuous: Shape, node color, border color
Continuous: Node size, border width, color

Edges

Non-continuous: Color, line type, arrowhead type
Continuous: Width, color

Labels: Size, color, visibility

Edge Adjustments

Emphasize those with higher betweenness

Use width to convey it

ewidth <- ecdf(E(wikidat.sg)$eb)(E(wikidat.sg)$eb)
E(wikidat.sg)$width <- (1.25 * ewidth)^2
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25, edge.curved = TRUE)

Only a slight improvement

Node Adjustments

Color by community

Scale by eigenvector centrality

library(RColorBrewer )
vcols <- brewer.pal(3, "Accent")
names(vcols) <- unique(V(wikidat.sg)$comms)
V(wikidat.sg)$color <- vcols[as.character(V(wikidat.sg)$comms)]
V(wikidat.sg)$size <- 6 * ecdf(V(wikidat.sg)$evcent)(V(wikidat.sg)$evcent)
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = TRUE)

Communities better portrayed

Additional Sweep

Halve the number of edges

Same technique as before

ebrank <- rank(E(wikidat.sg)$eb, "random")
esyssamp <- seq(from = 1, to = max(ebrank), by = 2)
esyssamp <- which(ebrank %in% esyssamp)
wikidat.sg <- subgraph.edges(wikidat.sg, esyssamp)
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25)

Much more clear

Do the main observations hold across different layouts?

par(mfrow = c(2, 2), mar = c(0.1, 0.1, 0.1, 0.1))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.kamada.kawai, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.circle, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.svd, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.mds, margin = c(0, 0, 0, 0))

Color Edges by Community

wikidat.sg.elist <- get.edgelist(wikidat.sg, names = FALSE)
edgecolor <- apply(wikidat.sg.elist, 1, function(x){
    c1 <-V(wikidat.sg)$comms[x[1]]
    c2 <- V(wikidat.sg)$comms[x[2]]
    return(ifelse(c1 == c2, vcols[as.character(c1)], "darkgrey"))
    })
E(wikidat.sg)$color <- edgecolor
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.svd, margin = c(0, 0, 0, 0))

Advice

Less is more

Fewer modifications are better than more
Smaller data is easier to represent than larger data

Transparency when reporting visualization
Informative and intuitive over beautiful

Practical over gimmicky
Science over art

For more options, see

?igraph.plotting

Graph-Level Indices

What's the edgewise reciprocity for Wikipedia administrator voting?

reciprocity(wikidat)

5.6% of all edges are reciprocated

Is this figure big or small?

Bigger or smaller than what?

We need to establish a basis for comparison.

Graph-Level Indices

Affected by network size and density

Example

graph.order <- seq(from = 10, to = 100, by = 5)
graph.trans <- sapply(graph.order, function(x) replicate(1000, transitivity(erdos.renyi.game(x, .1))))
graph.apl <- sapply(graph.order, function(x) replicate(1000, average.path.length(erdos.renyi.game(x, .1))))
par(mfrow = c(1, 2))
plot(x = graph.order, y = apply(graph.trans, 2, mean), ylim = c(0, 1), xlim = c(0, 100), type = "l", xlab = "Order", ylab = "Transitivity", main = "Tie Prob = 0.10; Simulations=1000")
lines(x = graph.order, y = apply(graph.trans, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")
lines(x = graph.order, y = apply(graph.trans, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")
plot(x = graph.order, y = apply(graph.apl, 2, mean), ylim = c(0, 5), xlim = c(0, 100), type = "l", xlab = "Order", ylab = "Average Path Length", main = "Tie Prob = 0.10; Simulations = 1000")
lines(x = graph.order, y = apply(graph.apl, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")
lines(x = graph.order, y = apply(graph.apl, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")

Graph-Level Indices

Affected by network size and density

graph.tieprob <- seq(from = .01, to = .30, by = .01)graph.trans <- sapply(graph.tieprob, function(x) replicate(1000, transitivity(erdos.renyi.game(100, x))))graph.apl <- sapply(graph.tieprob, function(x) replicate(1000, average.path.length(erdos.renyi.game(100, x))))par(mfrow = c(1, 2))plot(x = graph.tieprob, y = apply(graph.trans, 2, mean), ylim = c(0, 1), xlim = c(0, .3), type = "l", xlab = "Tie Probability", ylab = "Transitivity", main = "Order = 100; Simulations = 1000")lines(x = graph.tieprob, y = apply(graph.trans, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")lines(x = graph.tieprob, y = apply(graph.trans, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")plot(x = graph.tieprob, y = apply(graph.apl, 2, mean), ylim = c(0,8), xlim = c(0,.3), type = "l", xlab = "Tie Probability", ylab = "Average Path Length", main = "Order=100; Simulations = 1000")lines(x = graph.tieprob, y = apply(graph.apl, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")lines(x = graph.tieprob, y = apply(graph.apl, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")

Random Graph Comparisons

Conditional Uniform Graph (CUG) test

Measure observed statistic on a graph
Generate many random graphs (~1000)

Take the same measurement from each

Compare the observed statistic to the random ones

Two families

Reassignment methods
Graph generators

Reassignment Methods

"Shuffling" ties and attributes

Permutation tests
Rewiring tests

Permutation tests

Permute an attribute to test its relationship to a nodal property
Permute a graph to test its relationship to another graph

# Do conservative blogs have more links than liberal ones?
polblogs <- nexus.get(id = "polblogs") #0 = liberal, 1 = conservative

polblogs.totdeg <- degree(polblogs, mode = "all")
obs.diff <- mean(polblogs.totdeg[V(polblogs)$LeftRight == 1]) - mean(polblogs.totdeg[V(polblogs)$LeftRight == 0]) 

perm.fun <- function(){
  perm.attribs <- sample(V(polblogs)$LeftRight)
  return(mean(polblogs.totdeg[perm.attribs == 1]) - mean(polblogs.totdeg[perm.attribs == 0]))
  }; perm.fun<-cmpfun(perm.fun)
exp.diffs <- replicate(1000, perm.fun())
mean(obs.diff > exp.diffs) # ~0.86, not significantly greater than by chance assignment

"Why not just use a t-test?"

Permutation tests

E-I Index (Krackhardt and Robert Stern 1988)

(Outgroup Ties - Ingroup Ties) / Total Ties

e.i.index <- function(el, attrib){
  in.ties <- sum(attrib[el[,1]] == attrib[el[ , 2]])
  ex.ties <- sum(attrib[el[,1]] != attrib[el[ , 2]])
  return((ex.ties - in.ties)/(ex.ties + in.ties))
  }; e.i.index<-cmpfun(e.i.index)
polblogs.el <- get.edgelist(polblogs, names = FALSE)
polblogs.attrib <- V(polblogs)$LeftRight
obs.val <- e.i.index(polblogs.el, polblogs.attrib)
exp.val <- replicate(1000, e.i.index(polblogs.el, sample(polblogs.attrib)))
mean(obs.val < exp.val) # We expect fewer outgroup ties

Rewiring tests

One of two approaches

Shuffle edges
Shuffle edge weights

Preserves degree (and edge weight) distribution

karate <- nexus.get("karate")
karate.lo <- layout.kamada.kawai(karate, params = list(niter = 1000))

par(mfrow = c(1, 2))
plot(karate, vertex.size = 5, vertex.label = V(karate)$name, vertex.label.family = "sans", margin = c(0, 0, 0, 0), edge.curved = .33, layout = karate.lo)

karate.rw <- rewire(karate, mode = "simple", niter = 10*ecount(karate))
plot(karate.rw, vertex.size = 5, vertex.label = V(karate)$name, vertex.label.family = "sans", margin = c(0, 0, 0, 0), edge.curved = .33, layout = karate.lo)

Original (left) and Rewired (right)

Rewired Transitivity

obs.trans <- transitivity(karate)karate.rw.fun <- function(){  krw <- rewire(karate, mode = "simple", niter = 10 * ecount(karate))  return(transitivity(krw))  }; karate.rw.fun <- cmpfun(karate.rw.fun)exp.trans <- replicate(1000, karate.rw.fun())mean(obs.trans > exp.trans) # ~0.87

What's the interpretation of this finding?

Graph Generators

Erdős–Rényi model
Power Law model
Watts and Strogatz model
Exponential random graph models
...and many others

Erdős–Rényi model

A fixed number (or proportion) of edges are randomly assigned to a fixed set of nodes.

Variants

Exact #ties or density
Directed, undirected
Dyad census-conditioned

er.graph <- erdos.renyi.game(100, .03)
plot(er.graph, vertex.size = 5, vertex.label = "")
obs.incent <- centralization.degree(wikidat, mode = "in", loops = FALSE)$centralization
exp.incent.fun <- function(){
  new.net <- erdos.renyi.game(vcount(wikidat), ecount(wikidat), type = "gnm", directed = TRUE)
  return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization)
  }; exp.incent.fun <- cmpfun(exp.incent.fun)
exp.incent <- replicate(1000, exp.incent.fun())
mean(obs.incent > exp.incent)

Erdős–Rényi model

Power Law model

Assumptions

Networks grow according to the principle of preferential attachment.

P(X=x) ~ x^(-alpha)

Nodes are of degree greater than or equal to x

P(X=x) is the probability of observing a node with degree x or greater

alpha is the scalar

(Barabási and Albert 1999)

Power Law model

wikidat.in <- degree(wikidat, mode = "in")
wikidat.out <- degree(wikidat, mode = "out")
wikidat.plfit.in <- power.law.fit(wikidat.in)
wikidat.plfit.out <- power.law.fit(wikidat.out)
print(c(wikidat.plfit.in$alpha, wikidat.plfit.out$alpha))
exp.incent.pl.fun <- function(){
  new.net <- static.power.law.game(vcount(wikidat), ecount(wikidat), exponent.out = wikidat.plfit.out$alpha, exponent.in = wikidat.plfit.in$alpha)
  return(centralization.degree(new.net, mode="in", loops = FALSE)$centralization)
  }; exp.incent.pl.fun <- cmpfun(exp.incent.pl.fun)
exp.incent.pl <- replicate(1000, exp.incent.pl.fun())
mean(obs.incent > exp.incent.pl) # Obs more centralized
par(mfrow = c(1, 2))
hist(exp.incent, main = "E-R Indegree Centralization")
hist(exp.incent.pl, main = "Power Law Indegree Centralization") # ~10x more centralized graphs

Watts and Strogatz model

Assumptions

Local neighborhoods bridged by a few actors with ties that jump across neighborhoods.

Game

Begin with a lattice

Each actor initially has the same degree, neighborhood index (nei) * 2

Rewire with a probability of p

par(mfrow=c(1, 2))
g1 <- graph.lattice(100, nei = 1)# nei=mean degree / 2
g1.layout<-layout.fruchterman.reingold(g1)
V(g1)$x <- g1.layout[ , 1]; V(g1)$y <- g1.layout[ , 2]
g1.rewire <- rewire.edges(g1, p = 0.1) # prob of .1 rewire
plot(g1, vertex.size = 5, vertex.label = "", main = "Lattice")
plot(g1.rewire, vertex.size = 5, vertex.label = "", main = "Rewire")

Watts and Strogatz model

Question

What's the relationship between the neighborhood index (nei) and graph density?

Watts and Strogatz model

Low values for p resemble lattices

#The following two plots should be approximately equal
dev.off(); par(mfrow = c(1, 2))
sw.net <- watts.strogatz.game(dim = 1, size = 100, nei = 1, p = .1)
plot(g1.rewire, vertex.size = 5, vertex.label = "", main = "Rewired Lattice")
plot(sw.net, vertex.size = 5, vertex.label = "", layout = g1.layout, main = "Watts-Strogatz")

High values for p become closer to Erdős–Rényi graphs

par(mfrow = c(1, 2))
g2.ws <- watts.strogatz.game(dim = 1, size = 100, nei = 1, p = .9)
g2.er <- erdos.renyi.game(100, (2 * 100 / 2) / (100 * 99 / 2), directed = FALSE)
plot(g2.ws, vertex.size = 5, vertex.label = "", main = "Watts-Strogatz")
plot(g2.er, vertex.size = 5, vertex.label = "", main = "Erdős–Rényi")

Rewiring Demonstration

Exponential random graph models

Exercises

Recreate the first two plots in this lab using the Watts and Strogatz model. Fix the order (size) to 100 and set the density to approximately 0.02. Vary p from 0 to 1 by .01.

(Hint: mean degree equals the number of edges divided by the number of nodes.)

Recreate the power law exercise on degree centralization for the political blog network. If appropriate to model as a power law, is this network more centralized than predicted?