Introduction to Software: R



International Summer School on Social Network Analysis
St. Petersburg, 2014

Benjamin Lind, PhD

Why Program?

  • Flexibility
    • Data
      • Convert formats
      • Customized cleaning
    • New methods
      • Slow GUI development
      • Open source community
      • Can write your own functions
  • Replication
    • Personal record keeping
    • Collaboration
    • Transparency
  • Awareness

About R

Implements the S programming language
Open source
Object oriented programming language
Base & user-contributed packages
Meets practically all statistical purposes
Availability:  r-project.org

Why Program in R?


  • Open source
  • Community
    • Large and growing
    • Made by statisticians
    • Wealth of packages (i.e., libraries)
  • Features
    • Documentation
    • Moderately easy language
    • Graphics

Why Social Networks in R?

  • Shared platform
    • Networks
    • Statistics
    • Data management
    • Graphics
  • Active network software development
    • Network metrics
      • igraph
      • sna
      • tnet
    • Network modeling
      • statnet
      • RSiena

GeneRal Introduction


Start R!

Type "R" into your terminal

Basics

  • Command line-only interface
    • Reliable, Replicable, Records
  • GUI options do exist, but...
    • Outside of scope for our purposes
  • Assignment. Functionally equivalent methods
  • x <- 1 # Preference: primary method 
  • x = 1 # Preference: within functions    
  • 1 -> x # Preference: never    
  • Once assigned, x becomes an "object"
  • ls()    

Case Sensitivity

Try

Cheburashka <- "a cartoon character"Cheburashka
cheburashka

What does the error mean?

Hard Drive Navigation

  • R runs within a folder on your hard drive
    • Location where R reads & writes
  • Which folder are you in now?

orig.folder <- getwd()orig.folder
dir.create("new.folder")
list.files()
setwd("new.folder")list.files()
setwd(orig.folder)
file.remove("new.folder")list.files()

Packages


Image courtesy of Kliponius

Packages

  • Preloaded (e.g., basestats)
  • Dependencies
  • User-contributed
    • Social network analysis
      • statnet
      • igraph
      • tnet
      • RSiena
      • ...
    • Graphics
    • Advanced regression models
    • Web scraping
    • Spatial
    • ...

Packages

To install a package
 install.packages("igraph")

To load a package
 library(igraph)

To unload a package
 detach("package:igraph", unload=TRUE)

Missing Data

Typically coded as "NA".  For example
x <- NA
Missing codes
NA # Non-applicable, missing
NULL # Undefined or empty value
NaN # Not a Number
Inf # Infinite 
Indicated by functions
is.na(x)
is.null(x)
is.nan(x)
is.infinite(x)

Data Types

Storing and accessing information within R

Picture by KENPEI

Data Types

  • Elements
    • Numeric
    • Character
      • Use double quotes
    • Logical
      • TRUEFALSE T F 1 0
  • Objects
    • Vectors
    • Matrices
    • Arrays
    • Data frames
    • Lists
    • Closures

Elements

Identification
is.numeric(42)
is.numeric("42")
is.numeric(3.12)
is.logical(T)
is.logical(FALSE)
is.character(FALSE)
is.character("42")
is.character("HSE Network Workshop") 
Conversion
as.numeric("42") + 1
as.character(as.numeric("0") + 1)
as.character(!as.logical("1"))
as.character(as.logical(as.numeric("0") + 1))
as.character(is.logical(as.logical(as.numeric("0")) + 1))
as.character(!as.logical(as.numeric("1")))  

Object Types: Vectors

Vectors collect elements in one dimension
(Photo by Frank C. Müller)

Object Types: Vectors

Vectors collect elements in one dimension
RU.cit <- c("Mos", "StP", "Nov", "Yek") # Create vector
RU.cit <- c(RU.cit, "Niz") # Add element
RU.cit[3] # Retrieve element 3
RU.pops <- c(10.4, 4.6, 1.4, 1.3) # Create vector
names(RU.pops) <- RU.cit[c(1, 2, 3, 4)] # Name vector
RU.pops[c(4,2,3,1)] # Permute the order and retrieve
length(RU.pops) # Retrieve the length
RU.vis <- c(TRUE, TRUE, FALSE, FALSE) #Create vector
RU.vis[-c(4, 2)] # Remove elements 4 and 2
Remaining object types build upon this logic

Object Types: Vectors


Sequences
all(1:5 == c(1, 2, 3, 4, 5))
all(1:5 == seq(from = 1, to = 5, by = 1)) 

Repetitions
rep(1, times = 5)
rep(NA, times = 4)
rep("Beetlejuice", times = 3)

Object Types: Matrices and Arrays


Picture by David.Asch

Object Types: Matrices & Arrays

Matrices
  • Two dimensions
    • nrow(); ncol()  
  • Assignment
    • x[1, 5] <- 1; x[2, ] <- x[1, 5] > 0  
  • More columns, rows
    • cbind(); rbind()
  • Names:  rownames();  colnames()
Arrays
  • Three dimensions
    • dim(); length(x[1, 1, ])  
  • Assignment
    • x[3, 2, 6] <- 1; x[4, , 2] <- NA  
  • Names:  dimnames()

Object Types: Matrices

Turning vectors into matrices
RU.mat <- cbind(pop = RU.pops, vis = RU.vis)
rownames(RU.mat) <- RU.cit[c(1:4)]
colnames(RU.mat)
Alternatively
RU.mat <- matrix( , nrow = length(RU.pops), ncol = 2)
colnames(RU.mat) <- c("pop", "vis")
rownames(RU.mat) <- names(RU.pops)
RU.mat[ , "pop"] <- RU.pops
RU.mat[ , "vis"] <- RU.vis 
Try
RU.mat[1, ]RU.mat[ , "pop"]
RU.mat[1, 2]RU.mat[1, , ] # What does the error mean?

Object Types: Data Frames

Data frames are like matrices, but allow multiple element types. Compare:
RU.mat[ , "vis"]
RU.vis
To retain initial element type
RU.df <- as.data.frame(RU.mat)
RU.df$vis <- as.logical(RU.df$vis) 
These commands are equivalent statements
RU.df$vis[2]
RU.df[2, "vis"]
RU.df[["vis"]][2]
RU.df[[2]][2]

Object Types: Lists

Lists collect objects of (potentially) varying types & dimensions
RU.list <- list(RU.mat, RU.vis, RU.df)
names(RU.list) <- c("Mat", "Vec", "DF") 

These commands are equivalent statements
RU.list$DF$vis[2]
RU.list[["DF"]]$vis[2]
RU.list[[3]]$vis[2]
RU.list[["DF"]][["vis"]][2]
RU.list[["DF"]][2, "vis"]
RU.list[["DF"]][2, 2]

Object Types: Closures

Closures are functions
Functions accept input (optional) & return output

Photo by erix!

Object Types: Closures

Closures are functions
Functions accept input (optional) & return output
xTimesxMinus1 <- function(x)
  return(x * (x-1))
xTimesxMinus1(42)

InvHypSineTrans <- function(x){
  x.ret <- log(x + sqrt(x^2 + 1))
  return(x.ret)
  }

plot(x = 0:1000, InvHypSineTrans(0:1000), main = "Hyperbolic Sine Transformation", xlab = "x", ylab = "y")
Compiled functions
library(compiler)InvHypSineTrans.cmp <- cmpfun(InvHypSineTrans)system.time(InvHypSineTrans(0:10000000))system.time(InvHypSineTrans.cmp(0:10000000))

Help

How to overcome the inevitable hurdles

Picture from Richard Milnes

Help

If you're looking for a function, but don't know the name
??"Chi Squared"
help.search("Chi Squared")
Once you know the name of the function
?chisq.test
Want to know how the function is written?
chisq.test
Still stumped?

Help

Exercise

  1. Find the randomized normal distribution function
  2. Use it to create two vectors of length 100
    1. One with mean of zero & standard deviation of one
      1. Assign it x
    2. A second with mean of 50 & standard deviation of 25
      1. Assign it y
  3. Find the linear regression model command
  4. Regress y on x

Apply Family


Apply Family

Apply administers a function across elements within an object
Better than for() loops in R

lab.seq <- seq(from = -100, to = 100, by = 20)
?sapply
lab.mat <- sapply(lab.seq, rnorm, n = 100)
dim(lab.mat)
fix(lab.mat)
?apply
lab.means <- apply(lab.mat, 2, mean)
plot(x = lab.seq, y = lab.means)

Apply Family: Exercise



Recreate the last exercise, but with the following modifications
  1. Use a sequence from 1 to 100 by 5
  2. Run the sapply() line, but...
    1. Set the mean to zero
    2. Apply the sequence to rnorm()'s standard deviation
    3. Run the apply() line, but...
      1. Retrieve the standard deviation instead of the mean
  3. Plot your results and show them to me

Loading and Saving

Reusing material between R sessions

Photo by mRio

Loading & Saving: Source

Text files for executable R code

Like 'do files' for Stata
Write them in text editors, following R syntax
In text editor, type
test.val <- "It works!"
test.val2 <- paste(test.val, "Really")
Save as "test.val.source.txt" in working directory
In R, enter
source("test.val.source.txt")
ls()
print(test.val)
print(test.val2)

Loading & Saving

  • History
    • RHistory files retain a log of your session
    • May be opened in a text editor
    • Should end in ".RHistory" extention
    • Usage
      savehistory("Filename.RHistory")
      loadhistory("Filename.RHistory") 
  • Images
    • Saves every object in the environment (i.e., ls())
    • Usage
    • save.image("Filename.RData")
      load("Filename.RData")
  • dput(), dget()
    • Saves and retrieves individual objects

Network Analysis Packages in R

igraph

statnet

intergraph

tnet

RSiena

igraph

Originated as a fork of the sna package
Libraries C/C++, Python, and R
Created by Gábor Csárdi

Motivations and Strengths
  • Network science tradition
  • General and complex network metrics
  • Data storage
  • Speed and efficiency

statnet

Suite collecting the packages
  • sna
  • network
  • ergm, tergm
  • ...and others
  
Interdisciplinary, cross-university development team

Motivations
  • General social network metrics
  • Data storage
  • Modeling and statistical inference

intergraph

statnet and igraph don't play nicely
  • Many identical function names
  • Different data formats

intergraph is a conversion tool
  • asIgraph()
  • asNetwork()






tnet

Created by Tore Opsahl
Provides metrics for three types of networks
  • Weighted
  • Two-mode
  • Longitudinal


Motivated by the bias introduced by coercing these networks into simple graphs

Works well with igraph

RSiena

Originally Windows-based, stand alone
Stochastic Actor-Oriented Models

Motivations
  • Longitudinal network modeling
  • Longitudinal behavioral network modeling
  • Cross-sectional network modeling
    • p* / ERGM




Social Networks in R

Data importation

Basic Measurements

Data

R can read anything!*
  • Delimited text (comma, tab, bang, pipe, etc)
  • Excel files
  • Internet files
    • Tables from webpages
    • .json, .xml, .html format
    • Files stored online
  • Pajek files
  • SPSS and Stata files
  • Spatial data in shapefiles and KML

If possible, I'd recommend tab-delimited text files.

*It reads some file formats much better (and easier) than others

Social Network Data

Typical network data formats
  • Matrix
  • Edge list
  • Pajek


Attribute formats
  • Actor dataframe
  • Edge dataframe

Import a Tab-Delimited Edge List

Wikipedia votes for site administrators
One-mode, directed, simple

Picture courtesy of Frank Schulenburg

Import a Tab-Delimited Edge List

library(igraph)

wikidat <- read.table(file = "http://pastebin.com/raw.php?i=UVmTznBj", sep = "\t", header = TRUE, skip = 6)
# 'sep' is the separator
# 'header' indicates column names
# 'skip' are the number of initial rows to ignore

wikidat <- graph.data.frame(d = wikidat, directed = TRUE)
summary(wikidat) # Nodes: 7115; Edges: 103689

Import a Pajek File

Appearances by Marvel comic book characters
Characters by comic book issue
Two-mode
Created by Rosselló, Alberich, and Miro


Import a Pajek File

library(igraph)

marvnet <- read.graph("http://bioinfo.uib.es/~joemiro/marvel/porgat.txt", format = "pajek")
summary(marvnet)

A More Difficult Case

EIES Data
  • Case
    • Early study on electronic information exchange
    • Precursor to email, late-1970s
  • Subjects
    • 32 researchers
    • Cross-discipline social network analysts
  • Relationships, expressed in matrices
    • Heard of - Met - Friend - Close friend at two time points
    • Message frequency between time points
  • Attributes
    • Vertex: discipline
    • Vertex: citation count
    • Edges: affinity
    • Edges: message frequency

EIES

Acquire the data
firstdir <- getwd()
newdir <- paste(firstdir, "EIES_data", sep = "/")
download.file("https://www.stats.ox.ac.uk/~snijders/siena/EIES_data.zip", "EIES_data.zip", method = "wget")
unzip("EIES_data.zip", exdir = newdir)
setwd(newdir)
Read the nodal attribute data
eiesatts <- read.table("eies.dat", as.is = TRUE, stringsAsFactors = FALSE, strip.white = TRUE, skip = 3, quote = "", nrows = 32)
colnames(eiesatts) <- c("Researcher", "Original_Code", "Citations", "Discipline")
eiesatts$Discipline_Name <- c("Sociology", "Anthropology", "Statistics", "Psychology")[eiesatts$Discipline]
eiesatts$Discipline_Name[3] <- "Communication"
eiesatts$Discipline_Name[27] <- "Mathematics"

EIES

Read in the networks and create igraph objects
friend1 <- read.table("EIES.1", sep = " ", as.is = TRUE, strip.white = TRUE)
friend2 <- read.table("EIES.2", sep = " ", as.is = TRUE, strip.white = TRUE)
eiesmessages <- read.table("messages.asc", nrows = 32, strip.white = TRUE)
eies <- list(t1 = friend1, t2 = friend2, messages = eiesmessages)
rm(friend1, friend2, eiesmessages)
eies <- lapply(eies, as.matrix)
eies$t1 <- graph.adjacency(eies$t1, mode = "directed", weighted = TRUE, diag = FALSE)
E(eies$t1)$etype <- c("Do not know", "Heard of", "Met", "Friend", "Close Friend")[E(eies$t1)$weight + 1]
eies$t2 <- graph.adjacency(eies$t2, mode = "directed", weighted = TRUE, diag = FALSE)
E(eies$t2)$etype <- c("Do not know", "Heard of", "Met", "Friend", "Close Friend")[E(eies$t2)$weight + 1]
eies$messages <- graph.adjacency(eies$messages, mode = "directed", weighted = TRUE, diag = TRUE)

EIES

Assign vertex attributes
eiesvatts <- function(g, atts = eiesatts){
    V(g)$name <- as.character(1:vcount(g))
    for (i in 1:ncol(atts))
        g <- set.vertex.attribute(g, name = colnames(atts)[i], value = atts[, i])
    return(g)
    }
eies <- lapply(eies, eiesvatts)
rm(eiesvatts, eiesatts)
Clean the working directory (optional)
sapply(list.files(), file.remove)
setwd(firstdir)
file.remove(newdir)
file.remove("EIES_data.zip")
rm(firstdir, newdir)

Import Other Formats in igraph

See
?read.graph
?graoh.data.frame
?graph.adjacency
?graph.adjlist
?graph.constructors
# Query database for popular data sets
?nexus.list
?nexus.search
?nexus.get
?nexus.info

Basic Measurements


Density and Degree

is.simple(wikidat) # Verify it's simple
is.directed(wikidat) # Verify it's directed
vcount(wikidat) # Number of vertices
ecount(wikidat) # Number of edges
graph.density(wikidat) # Density

V(wikidat)$indegree <- degree(wikidat, mode = "in") V(wikidat)$outdegree <- degree(wikidat, mode = "out") V(wikidat)$totaldegree <- degree(wikidat, mode = "total")

all.vatts <- list.vertex.attributes(wikidat)[-1] all.vatts <- sapply(all.vatts, get.vertex.attribute, graph = wikidat) summary(all.vatts)
hist(log(V(wikidat)$outdegree), main = "Wiki Admin Votes", xlab = "Outdegree (log)")


Subsetting Networks

By edges
eies.closefriends.t1 <- subgraph.edges(eies$t1, eids = which(E(eies$t1)$etype == "Close Friend"), delete.vertices = FALSE)
eies.closefriends.t1 <- as.undirected(eies.closefriends.t1)
eies.closefriends.t2 <- subgraph.edges(eies$t2, eids = which(E(eies$t2)$etype == "Close Friend"), delete.vertices = FALSE)
eies.closefriends.t2 <- as.undirected(eies.closefriends.t2)
eies.closefriends <- list(t1 = eies.closefriends.t1, t2 = eies.closefriends.t2)
rm(eies.closefriends.t1, eies.closefriends.t2)
By vertices
remove.isolates <- function(g){
    nonisolates <- which(degree(g) > 0)
    return(induced.subgraph(g, nonisolates))
    }
eies.closefriends.nonisolates <- lapply(eies.closefriends, remove.isolates)
rm(remove.isolates)

Dyads and Triads

unlist(dyad.census(wikidat))
reciprocity(wikidat)
Which measurement of reciprocity did that value refer to?
  • Edgewise?
  • Dyadic?
  • Dyadic, non-null ("ratio")?
transitivity(eies.closefriends$t1) # Unidrected network
V(eies.closefriends$t1)$loc.trans <- transitivity(eies.closefriends$t1, "local")
triad.census(wikidat) # Directed network
source("http://pastebin.com/raw.php?i=J4fPjFuN")
transitivity_directed(wikidat) # Not in igraph
Who has the highest clustering coefficient?
V(eies.closefriends$t1)$name[which.max(V(eies.closefriends$t1)$loc.trans)]

Paths


Courtesy of Mihael Simonic

Paths

Warning: the commands below might take a minute or so
average.path.length(wikidat)
diameter(wikidat)
V(wikidat)$name[(get.diameter(wikidat))]
hist(shortest.paths(wikidat), main = "Histogram of Shortest Path Lengths", xlab = "Geodesic Lengths")

V(wikidat)$betw <- betweenness(wikidat)E(wikidat)$eb <- edge.betweenness(wikidat)hist(log(E(wikidat)$eb), main = "Histogram of Edge Betweenness", sub = "Wiki Admin Voting", xlab = "Edge Betweenness (log)")



Connectivity


Courtesy of Rico Shen

Connectivity

#How many weak and strong components do we have? 
sapply(c("weak", "strong"), no.clusters, graph = wikidat)

#Notice the distributions clusters(wikidat, mode = "weak")$csize clusters(wikidat, mode = "strong")$csize #Assign membership as attributesV(wikidat)$comp.w <- clusters(wikidat, mode = "weak")$membership V(wikidat)$comp.s <- clusters(wikidat, mode = "strong")$membership# Are the vertexes part of the giant component? V(wikidat)$giantcomp.w <- V(wikidat)$comp.w == which.max(clusters(wikidat, mode = "weak")$csize) V(wikidat)$giantcomp.s <- V(wikidat)$comp.s == which.max(clusters(wikidat, mode = "strong")$csize)

Centrality


Centrality

We've already reviewed degree and betweenness.
Examples of closeness and eigenvector centrality:
V(wikidat)$closeness <- closeness(wikidat)
V(wikidat)$evcent <- evcent(wikidat)$vector

k-cores

V(wikidat)$kc.undir <- graph.coreness(as.undirected(wikidat, mode = "collapse"))
#How are undirected k-cores related to centrality?
kc.cent.corr.fun<-function(x, y = V(wikidat)$kc.undir){
  a <- get.vertex.attribute(wikidat, name = x)
  b <- y
  cor.est <- cor.test(a, b, method = "spearman", exact = FALSE)$estimate
  return(cor.est)
  }
kc.cent.corr.fun <- cmpfun(kc.cent.corr.fun)
cent.atts <- c("indegree", "outdegree", "totaldegree", "betw", "closeness", "evcent")
cent.cors <- sapply(cent.atts, kc.cent.corr.fun)
names(cent.cors) <- cent.atts
barplot(cent.cors, ylab = "Spearman's rho", xlab = "Centrality", main = "Correlation with k-coreness")
#Directed k-cores
sapply(c("in", "out", "all"), function(y) return(sapply(cent.atts, kc.cent.corr.fun, y = graph.coreness(wikidat, mode = y))))
rm(cent.atts, kc.cent.corr.fun)


Community Detection


Courtesy of Jorge Láscar

Community Detection

wikidat.comms <- multilevel.community(as.undirected(wikidat, mode = "collapse"))
wikidat.comms$modularity
V(wikidat)$comms <- wikidat.comms$membership
xtabs(~V(wikidat)$comms) # 3, 4, 5, 9, and 11 are the biggest

wikidat.comms.w <- multilevel.community(as.undirected(wikidat, mode = "collapse"), weights = max(E(wikidat)$eb) - E(wikidat)$eb) wikidat.comms.w$modularity # Slightly higher V(wikidat)$comms.w <- wikidat.comms.w$membership xtabs(~V(wikidat)$comms.w) # 4, 5, 14, and 15 are the biggest
See also ?communities
BE SURE TO SAVE YOUR SESSION!!!
(We'll use it later)
save.image("NetSchool.RData")
savehistory("NetSchool.RHistory")

Visualization

A Small Example

plot(eies.closefriends$t1)

Let's Indicate Attributes!

Scale vertex size by citations, color by discipline
library(RColorBrewer)library(compiler)
my_ecdf_fun <- function(x.val, all.vals){
    ret.val <- mean(x.val >= all.vals)
    return(ret.val)
    }
my_ecdf_fun <- cmpfun(my_ecdf_fun)
vsize <- sapply(V(eies.closefriends$t1)$Citations, my_ecdf_fun, all.vals = V(eies.closefriends$t1)$Citations)
vsize <- 7.5 + 15 * vsize

each.discipline <- unique(V(eies.closefriends$t1)$Discipline_Name) n.disciplines <- length(each.discipline) vcol <- brewer.pal(n.disciplines, name = "Accent") vcol <- vcol[match(V(eies.closefriends$t1)$Discipline_Name, each.discipline)]
plot(eies.closefriends$t1, vertex.color = vcol, vertex.size = vsize, vertex.label.color = "black", vertex.label.family = "sans")


Defaults for a Much Larger Case

Inadvisable for a network this large

vcount(wikidat) # 7115 nodes
ecount(wikidat) # 103689 edges
plot(wikidat)

What are the problems?


Subsample

Completely at Random
sampsize <- round(vcount(wikidat) * 0.05) # 5% sample
subsamp <- sample.int(vcount(wikidat), sampsize)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
plot(wikidat.sg)

What are the problems?


Subsample

Completely at Random
sampsize <- round(vcount(wikidat) * 0.05) # 5% sample
subsamp <- sample.int(vcount(wikidat), sampsize)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
plot(wikidat.sg)
Visual problems
  • Nodes too large
  • Labels don't provide information
plot(wikidat.sg, vertex.size = 3, vertex.label = "")

What are the problems?


Subsample

Visual problems
  • Too many isolates
  • Arrowheads too large
subsamp <- which(degree(wikidat.sg, mode = "all") > 0)
wikidat.sg <- induced.subgraph(wikidat.sg, subsamp)
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.5)

Much better, visually



But what are some conceptual problems using this technique?

Conceptual Problems with Random Sampling


  • Grossly underrepresents degree
    • Majority of alters absent
    • Underestimates connectivity
  • Conceptually naïve
    • Sample could reflect structure
  • Replication limitations

Alternative Subsampling Strategy

I want to portray communities and leadership
# Sample based on betweenness and large communities
cond1 <- V(wikidat)$betw > 1
cond2 <- V(wikidat)$comms %in% c(3, 4, 5, 9, 11)
subsamp <- which(cond1 & cond2)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
vcount(wikidat.sg) # From 7,115 to 1,372 nodes
ecount(wikidat.sg) # From 103,689 to 40,587 edges
# Still too big to see anything.
# Systematic 25% sample on eigenvector centrality
evrank <- rank(V(wikidat.sg)$evcent, "random")
nsyssamp <- seq(from = 1, to = max(evrank), by = 4)
nsyssamp <- which(evrank%in% nsyssamp)
wikidat.sg <- induced.subgraph(wikidat.sg, nsyssamp)
# Systematic 25% sample on edge betweenness ebrank <- rank(E(wikidat.sg)$eb, "random") esyssamp <- seq(from = 1, to = max(ebrank), by = 4) esyssamp <- which(ebrank %in% esyssamp) wikidat.sg <- subgraph.edges(wikidat.sg, esyssamp) plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25)

Not too bad!


Layout

There are many layout algorithms out there
Their default values compromise between quality and efficiency
?layout.auto
wikidat.sg.lo <- layout.fruchterman.reingold(wikidat.sg, params = list(niter = 2500, area = vcount(wikidat.sg)^3))
V(wikidat.sg)$x <- wikidat.sg.lo[,1]
V(wikidat.sg)$y <- wikidat.sg.lo[,2]
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25)

Better use of space, but hairballish


What are your empirical interests?

  • Sizes
  • Colors and shading
    • Categorical variables
    • Interval and ordinal variables
  • Transparency


Parameters to Vary
  • Nodes
    • Non-continuous: Shape, node color, border color
    • Continuous: Node size, border width, color
  • Edges
    • Non-continuous: Color, line type, arrowhead type
    • Continuous: Width, color
  • Labels: Size, color, visibility

Edge Adjustments

Emphasize those with higher betweenness
Use width to convey it
ewidth <- sapply(E(wikidat.sg)$eb, my_ecdf_fun, E(wikidat.sg)$eb)
E(wikidat.sg)$width <- (1.25 * ewidth)^2
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25, edge.curved = TRUE)

Only a slight improvement


Node Adjustments

Color by community
Scale by eigenvector centrality
vcols <- brewer.pal(3, "Accent")
names(vcols) <- unique(V(wikidat.sg)$comms)
V(wikidat.sg)$color <- vcols[as.character(V(wikidat.sg)$comms)]
V(wikidat.sg)$size <- 6 * sapply(V(wikidat.sg)$evcent, my_ecdf_fun, V(wikidat.sg)$evcent)
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = TRUE)

Communities better portrayed


Additional Sweep

Halve the number of edges
Same technique as before
ebrank <- rank(E(wikidat.sg)$eb, "random")
esyssamp <- seq(from = 1, to = max(ebrank), by = 2)
esyssamp <- which(ebrank %in% esyssamp)
wikidat.sg <- subgraph.edges(wikidat.sg, esyssamp)
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25)


Much more clear


Do the main observations hold across different layouts?

par(mfrow = c(2, 2), mar = c(0.1, 0.1, 0.1, 0.1))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.kamada.kawai, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.circle, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.svd, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.mds, margin = c(0, 0, 0, 0))


Color Edges by Community

wikidat.sg.elist <- get.edgelist(wikidat.sg, names = FALSE)
edgecolor <- apply(wikidat.sg.elist, 1, function(x){
    c1 <-V(wikidat.sg)$comms[x[1]]
    c2 <- V(wikidat.sg)$comms[x[2]]
    return(ifelse(c1 == c2, vcols[as.character(c1)], "darkgrey"))
    })
E(wikidat.sg)$color <- edgecolordev.off()
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.svd, margin = c(0, 0, 0, 0))


Advice


  • Less is more
    • Fewer modifications are better than more
    • Smaller data is easier to represent than larger data
  • Transparency when reporting visualization
  • Informative and intuitive over beautiful
    • Practical over gimmicky
    • Science over art


For more options, see
?igraph.plotting

Graph-Level Indices


Graph-Level Indices

Some custom commands
(May take a few minutes)
source("http://pastebin.com/raw.php?i=SErkr5sg")
Includes functions
  • ei_index
  • transitivity_directed
  • uman_game
  • rewire_dyads

Graph-Level Indices

What's the edgewise reciprocity for Wikipedia administrator voting?

reciprocity(wikidat)

5.6% of all edges are reciprocated

Is this figure big or small?

Bigger or smaller than what?

We need to establish a basis for comparison.

Graph-Level Indices

Affected by network size and density
Example
graph.order <- seq(from = 10, to = 100, by = 5)
graph.trans <- sapply(graph.order, function(x) replicate(1000, transitivity(erdos.renyi.game(x, .1))))
graph.apl <- sapply(graph.order, function(x) replicate(1000, average.path.length(erdos.renyi.game(x, .1))))
par(mfrow = c(1, 2))
plot(x = graph.order, y = apply(graph.trans, 2, mean), ylim = c(0, 1), xlim = c(0, 100), type = "l", xlab = "Order", ylab = "Transitivity", main = "Tie Prob = 0.10; Simulations=1000")
lines(x = graph.order, y = apply(graph.trans, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")
lines(x = graph.order, y = apply(graph.trans, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")
plot(x = graph.order, y = apply(graph.apl, 2, mean), ylim = c(0, 5), xlim = c(0, 100), type = "l", xlab = "Order", ylab = "Average Path Length", main = "Tie Prob = 0.10; Simulations = 1000")
lines(x = graph.order, y = apply(graph.apl, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")
lines(x = graph.order, y = apply(graph.apl, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")


Graph-Level Indices

Affected by network size and density
graph.tieprob <- seq(from = .01, to = .30, by = .01)graph.trans <- sapply(graph.tieprob, function(x) replicate(1000, transitivity(erdos.renyi.game(100, x))))graph.apl <- sapply(graph.tieprob, function(x) replicate(1000, average.path.length(erdos.renyi.game(100, x))))par(mfrow = c(1, 2))plot(x = graph.tieprob, y = apply(graph.trans, 2, mean), ylim = c(0, 1), xlim = c(0, .3), type = "l", xlab = "Tie Probability", ylab = "Transitivity", main = "Order = 100; Simulations = 1000")lines(x = graph.tieprob, y = apply(graph.trans, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")lines(x = graph.tieprob, y = apply(graph.trans, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")plot(x = graph.tieprob, y = apply(graph.apl, 2, mean), ylim = c(0,8), xlim = c(0,.3), type = "l", xlab = "Tie Probability", ylab = "Average Path Length", main = "Order=100; Simulations = 1000")lines(x = graph.tieprob, y = apply(graph.apl, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")lines(x = graph.tieprob, y = apply(graph.apl, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")


Random Graph Comparisons

Conditional Uniform Graph (CUG) test
  • Measure observed statistic on a graph
  • Generate many random graphs (~1000)
    • Take the same measurement from each
  • Compare the observed statistic to the random ones


Two families
  • Reassignment methods
  • Graph generators

Reassignment Methods

"Shuffling" ties and attributes

  • Permutation tests
  • Rewiring tests

Permutation tests

  • Permute an attribute to test its relationship to a nodal property
  • Permute a graph to test its relationship to another graph
# Do conservative blogs have more links than liberal ones?
polblogs <- nexus.get(id = "polblogs") #0 = liberal, 1 = conservative

polblogs.totdeg <- degree(polblogs, mode = "all")
obs.diff <- mean(polblogs.totdeg[V(polblogs)$LeftRight == 1]) - mean(polblogs.totdeg[V(polblogs)$LeftRight == 0]) 

perm.fun <- function(){
  perm.attribs <- sample(V(polblogs)$LeftRight)
  return(mean(polblogs.totdeg[perm.attribs == 1]) - mean(polblogs.totdeg[perm.attribs == 0]))
  }; perm.fun<-cmpfun(perm.fun)
exp.diffs <- replicate(1000, perm.fun())
mean(obs.diff > exp.diffs) # ~0.86, not significantly greater than by chance assignment
"Why not just use a t-test?"

Permutation tests

E-I Index (Krackhardt and Robert Stern 1988)
(Outgroup Ties - Ingroup Ties) / Total Ties 
obs.val <- ei_index(polblogs, V(polblogs)$LeftRight)
exp.val <- replicate(1000, ei_index(polblogs, sample(V(polblogs)$LeftRight)))
mean(obs.val < exp.val) # We expect fewer outgroup ties

Permutation Tests

EI Index = (Outgroup Ties - Ingroup Ties) / Total Ties 
# Friendship homophily by discipline among researchers
# Time 1 obs.val.t1 <- ei_index(eies.closefriends$t1, V(eies.closefriends$t1)$Discipline_Name) exp.val.t1 <- replicate(1000, ei_index(eies.closefriends$t1, sample(V(eies.closefriends$t1)$Discipline_Name)))
# Time 2 obs.val.t2 <- ei_index(eies.closefriends$t2, V(eies.closefriends$t2)$Discipline_Name) exp.val.t2 <- replicate(1000, ei_index(eies.closefriends$t2, sample(V(eies.closefriends$t2)$Discipline_Name)))
# Compare the homophily effect & its non-random probability c(t1 = obs.val.t1, t1.p = mean(obs.val.t1 < exp.val.t1), t2 = obs.val.t2, mean(obs.val.t2 < exp.val.t2))

Rewiring tests

One of two approaches
  • Shuffle edges (or dyads)
  • Shuffle edge weights
Preserves degree (and edge weight) distribution
karate <- nexus.get("karate")
karate.lo <- layout.kamada.kawai(karate, params = list(niter = 1000))

par(mfrow = c(1, 2))
plot(karate, vertex.size = 5, vertex.label = V(karate)$name, vertex.label.family = "sans", margin = c(0, 0, 0, 0), edge.curved = .33, layout = karate.lo)

karate.rw <- rewire(karate, mode = "simple", niter = 10*ecount(karate))
plot(karate.rw, vertex.size = 5, vertex.label = V(karate)$name, vertex.label.family = "sans", margin = c(0, 0, 0, 0), edge.curved = .33, layout = karate.lo)

Original (left) and Rewired (right)


Rewired Transitivity

obs.trans <- transitivity(karate)karate.rw.fun <- function(){  krw <- rewire(karate, mode = "simple", niter = 10 * ecount(karate))  return(transitivity(krw))  }; karate.rw.fun <- cmpfun(karate.rw.fun)exp.trans <- replicate(1000, karate.rw.fun())mean(obs.trans > exp.trans) # ~0.87
What's the interpretation of this finding?

Rewired Transitivity

Is the transitivity in this directed network greater than chance?
Did it increase from the first to second time point?
Rewired dyads control for reciprocity and degree distribution
n.researchers <- vcount(eies$t1)
n.rewires <- 10 * n.researchers * (n.researchers - 1)
obs.val.t1 <- transitivity_directed(eies$t1)
obs.val.t2 <- transitivity_directed(eies$t2)

# Will take some time.
# Would recommend at least 1000, not 250, replications
exp.val.t1 <- replicate(250, transitivity_directed(rewire_dyads(eies$t1, n.rewires)))
exp.val.t2 <- replicate(250, transitivity_directed(rewire_dyads(eies$t2, n.rewires)))

c(t1 = obs.val.t1, t1.p = mean(obs.val.t1 > exp.val.t1), t2 = obs.val.t2, t2.p = mean(obs.val.t2 > exp.val.t2))

Graph Generators

  1. Erdős–Rényi model
  2. Power Law model
  3. Watts and Strogatz model
  4. Exponential random graph models
  5. ...and many others

Erdős–Rényi model

A fixed number (or proportion) of edges are randomly assigned to a fixed set of nodes.
Variants
  • Exact #ties or density
  • Directed, undirected
  • Dyad census-conditioned
er.graph <- erdos.renyi.game(100, .03)
plot(er.graph, vertex.size = 5, vertex.label = "")
obs.incent <- centralization.degree(wikidat, mode = "in", loops = FALSE)$centralization
exp.incent.fun <- function(){
  new.net <- erdos.renyi.game(vcount(wikidat), ecount(wikidat), type = "gnm", directed = TRUE)
  return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization)
  }; exp.incent.fun <- cmpfun(exp.incent.fun)
exp.incent <- replicate(1000, exp.incent.fun())
mean(obs.incent > exp.incent)

Erdős–Rényi model


obs.incent <- centralization.degree(wikidat, mode = "in", loops = FALSE)$centralization
exp.incent.fun.ec <- function(){
  new.net <- erdos.renyi.game(vcount(wikidat), ecount(wikidat), type = "gnm", directed = TRUE)
  return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization)
  }; exp.incent.fun.ec <- cmpfun(exp.incent.fun.ec)
exp.incent.ec <- replicate(500, exp.incent.fun.ec())
c(obs = obs.incent, exp.ec = mean(obs.incent > exp.incent.ec))

# An alternative model that will take too much time # Set sparse = TRUE if you don't have much RAM exp.incent.fun.dc <- function(){ dc <- unlist(dyad.census(wikidat)) new.net <- uman_game(vcount(wikidat), dc, sparse = FALSE) return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization) }; exp.incent.fun.dc <- cmpfun(exp.incent.fun.dc) exp.incent.dc <- replicate(500, exp.incent.fun.dc())

Erdős–Rényi model


Power Law model

Assumptions
Networks grow according to the principle of preferential attachment.






P(X=x) ~ x^(-alpha)
Nodes are of degree greater than or equal to x
P(X=x) is the probability of observing a node with degree x or greater
alpha is the scalar
(Barabási and Albert 1999)

Power Law model


wikidat.in <- degree(wikidat, mode = "in")
wikidat.out <- degree(wikidat, mode = "out")
wikidat.plfit.in <- power.law.fit(wikidat.in)
wikidat.plfit.out <- power.law.fit(wikidat.out)
print(c(wikidat.plfit.in$alpha, wikidat.plfit.out$alpha))
exp.incent.pl.fun <- function(){
  new.net <- static.power.law.game(vcount(wikidat), ecount(wikidat), exponent.out = wikidat.plfit.out$alpha, exponent.in = wikidat.plfit.in$alpha)
  return(centralization.degree(new.net, mode="in", loops = FALSE)$centralization)
  }; exp.incent.pl.fun <- cmpfun(exp.incent.pl.fun)
exp.incent.pl <- replicate(500, exp.incent.pl.fun())
mean(obs.incent > exp.incent.pl) # Obs more centralized
par(mfrow = c(1, 2))
hist(exp.incent.ec, main = "E-R Indegree Centralization")
hist(exp.incent.pl, main = "Power Law Indegree Centralization") # ~10x more centralized graphs


Watts and Strogatz model

Assumptions
Local neighborhoods bridged by a few actors with ties that jump across neighborhoods.
Game
  • Begin with a lattice
    • Each actor initially has the same degree, neighborhood index (nei) * 2
  • Rewire with a probability of p  
par(mfrow=c(1, 2))
g1 <- graph.lattice(100, nei = 1)# nei = mean degree / 2
g1.layout<-layout.fruchterman.reingold(g1)
V(g1)$x <- g1.layout[ , 1]; V(g1)$y <- g1.layout[ , 2]
g1.rewire <- rewire.edges(g1, p = 0.1) # prob of .1 rewire
plot(g1, vertex.size = 5, vertex.label = "", main = "Lattice")
plot(g1.rewire, vertex.size = 5, vertex.label = "", main = "Rewire")

Watts and Strogatz model


Watts and Strogatz model

Question
What's the relationship between the neighborhood index (nei) and graph density?

Watts and Strogatz model

Low values for p resemble lattices
#The following two plots should be approximately equal
dev.off(); par(mfrow = c(1, 2))
sw.net <- watts.strogatz.game(dim = 1, size = 100, nei = 1, p = .1)
plot(g1.rewire, vertex.size = 5, vertex.label = "", main = "Rewired Lattice")
plot(sw.net, vertex.size = 5, vertex.label = "", layout = g1.layout, main = "Watts-Strogatz")
High values for p become closer to Erdős–Rényi graphs
par(mfrow = c(1, 2))
g2.ws <- watts.strogatz.game(dim = 1, size = 100, nei = 1, p = .9)
g2.er <- erdos.renyi.game(100, (2 * 100 / 2) / (100 * 99 / 2), directed = FALSE)
plot(g2.ws, vertex.size = 5, vertex.label = "", main = "Watts-Strogatz")
plot(g2.er, vertex.size = 5, vertex.label = "", main = "Erdős–Rényi")

Rewiring Demonstration 

Exponential random graph models

Exercises

Recreate the first two plots in this lab using the Watts and Strogatz model.  Fix the order (size) to 100 and set the density to approximately 0.02.  Vary p from 0 to 1 by .01.
(Hint: mean degree equals the number of edges divided by the number of nodes.)

Recreate the power law exercise on degree centralization for the political blog network.  If appropriate to model as a power law, is this network more centralized than predicted?

Two-Mode Networks


Two-Mode Networks are Special


  • Clustering follows a different mechanism
  • Distance can take on a different meaning
    • Depends upon conceptualization of tie strength


Traditionally "projected" as a one-mode network.
  • Theoretical problems
    • Assumes an "event" creates ties among all participating actors
      • Do you always know everyone at each party you attend?
      • Some actors and events are much more inclusive
        • Should their relationships hold the same weight?
  • Methodological problems
    • Creates a complete subgraph ("clique") around each event
    • Increases density

Two-Mode Network Software

  • Limited functions in igraph (R, Python, C, Ruby)
  • More capabilities in tnet (R)
  • NetworkX (Python)

library(igraph)
library(tnet)
library(compiler)
marvnet <- read.graph("http://bioinfo.uib.es/~joemiro/marvel/porgat.txt", format = "pajek")
V(marvnet)$name <- V(marvnet)$id
Convert it to tnet
marvnet.tnet <- get.edgelist(marvnet, names = FALSE)
marvnet.tnet <- as.tnet(marvnet.tnet, type = "binary two-mode tnet")

Two-Mode Network Software

  • Limited functions in igraph (R, Python, C, Ruby)
  • More capabilities in tnet (R)  and NetworkX (Python)  
download.file("https://raw.githubusercontent.com/kjhealy/revere/master/data/PaulRevereAppD.csv", "PaulRevere.csv", method = "wget")
amrev <- read.csv("PaulRevere.csv", row.names = 1)
file.remove("PaulRevere.csv")
amrev.el <- which(amrev == 1, arr.ind = TRUE)
amrev.el[, 1] <- rownames(amrev)[as.numeric(amrev.el[, 1])]
amrev.el[, 2] <- colnames(amrev)[as.numeric(amrev.el[, 2])]
amrev <- graph.data.frame(amrev.el, directed = FALSE)
V(amrev)$type <- V(amrev)$name %in% amrev.el[, 2]
Convert it to tnet
amrev.tnet <- get.edgelist(amrev, names = FALSE)
amrev.tnet <- as.tnet(amrev.tnet, type = "binary two-mode tnet")

Clustering Coefficient

Closed Three-Paths (Robins and Alexander [2004])
ra_clust <- function(g){
    g.motif <- graph.motifs(g, size = 4)
    threepaths <- g.motif[7]
    fourcycles <- 4 * g.motif[9]
    ret.val <- fourcycles / (fourcycles + threepaths)
    return(ret.val)
    }
ra_clust <- cmpfun(ra_clust)

amrev.raclust <- ra_clust(amrev)
amrev.raclust
Closed Four-Paths with Three Vertices in Primary Mode (Opsahl 2012)
# Much slower, but faster version available (email Opsahl)
amrev.oclust <- clustering_tm(amrev.tnet)
amrev.oclust

Local Clustering Coefficient

Robins and Alexander (2004)
local_ra_clust <- function(g, vind){
    new.g <- graph.neighborhood(g, order = 2, nodes = vind)[[1]]
    new.g.motif <- graph.motifs(new.g, size = 4)
    threepaths <- new.g.motif[7]
    fourcycles <- 4 * new.g.motif[9]
    retval <- ifelse(fourcycles + threepaths > 0, fourcycles / (fourcycles + threepaths), NaN)     return(retval)
    }
local_ra_clust <- cmpfun(local_ra_clust)
amrev.raclust.loc <- sapply(1:vcount(amrev), local_ra_clust, g = amrev)
Opsahl (2012)
(Took my computer ~4 minutes)
amrev.oclust.loc <- clustering_local_tm(amrev.tnet)

Local Clustering Coefficient

Robins and Alexander (2004)
Whose Ego Networks Had the Highest Clustering?
# Identify both the actors and those actors with valid clustering coefficients
amrev.raclust.loc.actors.ind <- which(is.nan(amrev.raclust.loc[which(V(amrev)$type == FALSE)]) == FALSE)
amrev.raclust.loc.actors <- data.frame(name = V(amrev)$name[amrev.raclust.loc.actors.ind])
amrev.raclust.loc.actors$clust <- amrev.raclust.loc[amrev.raclust.loc.actors.ind]
amrev.raclust.loc.actors$rank <- rank(amrev.raclust.loc.actors$clust)
amrev.raclust.loc.actors[order(amrev.raclust.loc.actors$clust, amrev.raclust.loc.actors$name), ]

Randomization

Two main options
  • Random network generation
  • Edge rewires

# Generate a random network with the same parameters as the Marvel network
n.primary <- sum(V(marvnet)$type == FALSE)
n.secondary <- vcount(marvnet) - n.primary
marvnet.rand.el <- rg_tm(ni = n.primary, np = n.secondary, ties = ecount(marvnet))
head(marvnet.rand.el) # It's an edgelist.
# Export it to igraph
summary(marvnet.rand.el)
# Because both the first and second mode begin at 1 ...
marvnet.rand.el[, "p"] <- marvnet.rand.el[, "p"] + n.primary
marvnet.rand.type <- c(rep(FALSE, times = n.primary), rep(TRUE, times = n.secondary))
marvnet.rand.atts <- data.frame(name = 1:vcount(marvnet), type = marvnet.rand.type)
marvnet.rand <- graph.data.frame(marvnet.rand.el, directed = FALSE, vertices = marvnet.rand.atts)
Also see ?bipartite.random.game in igraph package

Randomization

Let's compare the degree distributions. First mode
par(mfrow = c(2, 2))
tempdegree <- degree(marvnet, v = which(V(marvnet)$type == FALSE))
hist(log(1 + tempdegree), main = "Observed", xlab = "Issues in Which Each Character Appears (log)", ylab = "Frequency")
tempdegree <- degree(marvnet.rand, v = which(V(marvnet)$type == FALSE))
hist(log(1 + tempdegree), main = "Random", xlab = "Issues in Which Each Character Appears (log)", ylab = "Frequency")
Second mode
tempdegree <- degree(marvnet, v = which(V(marvnet)$type == TRUE))
hist(log(1 + tempdegree), main = "Observed", xlab = "Characters Appearing in Each Issue (log)", ylab = "Frequency")
tempdegree <- degree(marvnet.rand, v = which(V(marvnet)$type == TRUE))
hist(log(1 + tempdegree), main = "Random", xlab = "Characters Appearing in Each Issue (log)", ylab = "Frequency")

Randomization

Random graph generator
  • Good for small, dense networks
  • Highly general
  • Controls for size and density
  • Does not control for degree distribution

Rewiring method

# tnet method.
# Do not use now, as it will take A VERY LONG TIME.
# Also allows for weighted two-mode data and can shuffle weights
amrev.rand.rw <- rg_reshuffling_tm(amrev.tnet) # Reshuffles each edge

Randomization

Rewiring method
# igraph method. Inadviseable for large networks# First, recreate the bipartite network, but treat it as directed to preserves the bipartite structure
# Second, rewire the directed two-mode network
# Third, remove the direction
rewire_tm <- function(g, niter = NULL){ g.el <- get.edgelist(g) new.g <- graph.data.frame(g.el, directed = TRUE) V(new.g)$type <- V(g)$type nswaps <- ifelse(is.null(niter), 10 * ecount(g), niter) new.g <- rewire(new.g, mode = "simple", nswaps) new.g <- as.undirected(new.g) return(new.g) } rewire_tm <- cmpfun(rewire_tm)
amrev.raclust.exp <- replicate(1000, ra_clust(rewire_tm(amrev)))
summary(amrev.raclust.exp)
mean(amrev.raclust > amrev.raclust.exp)
Edge list must be consistently partitioned for this function to work!

Graph-Level Indices

When measuring graph-level indices on a projected network....
  1. Project observed two-mode data to one-mode
  2. Take graph-level measurement
  3. Simulate
    1. Simulate two-mode network from observed two-mode networks
    2. Project simulated two-mode to a one-mode network
    3. Take graph-level measurement on simulated one-mode network
    4. Repeat to create distribution
  4. Compare observed measurement to simulated distribution


DO NOT SIMULATE BASED ON ONE-MODE PROJECTION

Projection

Four methods
  • Dichotomous (Brieger. 1974. Social Forces.)
  • Sum (Brieger. 1974. Social Forces.)
    • Weighted
    • k threshold cut-off
  • Newman. 2001. Phys Rev E.
    • Collaboration weight penalized by number of collaborators
  • Backbone
    • Convert vertex pairs to z-scores using frequency counts
      • Set alpha cut-off
      • Retain edges beyond cut-off (+/-)
    • Ahn,  Ahnert, Bagrow, & Barabási. 2011. Scientific Reports.
    • Serrano,  Boguñá, & Vespignani. 2009 PNAS.
    • Demonstration: http://pastebin.com/UNi0ryG3

Projection

Dichotomous
# tnet
marv.dichot <- projecting_tm(marvnet.tnet, method = "binary") # Takes me ~1.5 minutes
head(marv.dichot, 15) # Performed on "primary" mode only
# igraph.
# Set 'which' to "true" for the other mode
# Set it to "both" for a list of both projections
marv.dichot <- bipartite.projection(marvnet, multiplicity = FALSE, which = "false")

Projection

Sum
# tnet
marv.sum <- projecting_tm(marvnet.tnet, method = "sum") # Takes me ~1.5 minutes
head(marv.sum, 15) # Performed on "primary" mode only
# k-threshold >= 12
marv.k12 <- marv.sum[which(marv.sum$w >= 12), ]
# igraph
marv.sum <- bipartite.projection(marvnet, multiplicity = TRUE, which = "false")
head(E(marv.sum)$weight, 20)
# k-threshold >= 12
marv.k12 <- subgraph.edges(marv.sum, which(E(marv.sum)$weight >= 12), delete.vertices = FALSE)

Projection

Newman (2001)
newman.2m.fun <- function(g, type = FALSE){
    vdeg <- degree(g)
    vneigh <- neighborhood(g, order = 1)
    mode.which <- ifelse(type, "true", "false")
    proj1m <- bipartite.projection(g, multiplicity = TRUE, which = mode.which)
    vmatch <- match(V(proj1m)$name, V(g)$name)
    proj1m.el <- get.edgelist(proj1m, names = FALSE)
    ncollabs <- function(x){
        orig.i <- vmatch[x[1]]
        orig.j <- vmatch[x[2]]
        art.k <- vneigh[[orig.i]][which(vneigh[[orig.i]] %in% vneigh[[orig.j]])]
        art.degs <- sum(1 / (vdeg[art.k] - 1))
        return(art.degs)
        }
    E(proj1m)$newman.w <- apply(proj1m.el, 1, ncollabs)
    return(proj1m)
    }
newman.2m.fun <- cmpfun(newman.2m.fun)

marv.newman <- newman.2m.fun(marvnet) # Took me ~30 seconds
head(cbind(E(marv.newman)$weight, E(marv.newman)$newman.w), 20)

Projection

Meanings
  • Dichotomous
    • "Any commonality"
  • Sum
    • Weighted
      • "Number of commonalities"
    • k threshold cut-off
      • "Any commonality beyond k"
  • Newman (2001)
    • "Probability of dyadic interaction"
  • Backbone
    • "Dyadic interaction far above (or below) 'chance'"
    • "Chance" based upon the level of an individual's activity

Distance

Weights affect the interpretation of distance
  • Dichotomous
    • Simple: number of collaborations away
  • Weighted (including Newman)
    • Distance between adjacent vertices is 1 / edge weight
    • Average path length may be less than 1!
    • Normalization (Opsahl, Agneessens, and Skvoretz 2010)
      • Divide the weights by the mean of all weights
      • Then calculate distance

Distance

# Dichotomous
shortest.paths(marv.newman, v = 1, to = 2, weights = rep(1, ecount(marv.newman))) # 3 # Summed weights shortest.paths(marv.newman, v = 1, to = 2, weights = E(marv.newman)$weight) # 4 shortest.paths(marv.newman, 1, 2, weight = E(marv.newman)$weight / mean(E(marv.newman)$weight)) # 1.18 # Newmanedges2keep <- which(is.infinite(E(marv.newman)$newman.w) == FALSE) marv.newman <- subgraph.edges(marv.newman, edges2keep, delete.vertices = FALSE) shortest.paths(marv.newman, v = 1, to = 2, weights = E(marv.newman)$newman.w) # 0.38 shortest.paths(marv.newman, v = 1, to = 2, weights = E(marv.newman)$newman.w / mean(E(marv.newman)$newman.w)) # 1.33

Plotting

  • Symbols
    • Primary mode: circular nodes
    • Secondary modes: square nodes
  • Layout
    • Bipartite
      • Primary mode to the left
      • Secondary mode to the right
    • Multidimensional layouts appropriate also
    • Projected as one-mode
      • Any of the conventional layouts

Plotting

Marvel Universe Data
My inclinations
  • Character co-appearance
    • Project to one-mode
  • Subsample
    • Popular characters (two-mode degree)
    • Strong relationships (Newman's weight)
    • Remove isolates
  • Clean the names
  • Color by cohesion
  • Scale vertices by closeness
  • Scale edge widths by Newman's weight

Plotting

issues48 <- which((degree(marvnet) >= 48) & (V(marvnet)$type == FALSE))
issues48 <- which(V(marv.newman)$name %in% V(marvnet)$name[issues48])
marv.newman.sg <- induced.subgraph(marv.newman, issues48)
marv.newman.sg <- subgraph.edges(marv.newman.sg, which(E(marv.newman.sg)$newman.w > 4), delete.vertices = TRUE)
cleanupnames <- function(y){
    y <- strsplit(y, "/", fixed = TRUE)[[1]][1]
    y <- strsplit(y, " [", fixed = TRUE)[[1]][1]
    y <- gsub(".", "", y, fixed = TRUE)
    ylen <- nchar(y)
    lastchar <- substr(y, ylen, ylen)
    y <- ifelse(lastchar == " ", substr(y, 1, ylen - 1), y)
    return(y)
    }
cleanupnames <- cmpfun(cleanupnames)
V(marv.newman.sg)$name <- sapply(V(marv.newman.sg)$name, cleanupnames)

Plotting

library(RColorBrewer)
# Community detection
V(marv.newman.sg)$comm <- multilevel.community(marv.newman.sg, weights = E(marv.newman.sg)$weight)$membership
max(V(marv.newman.sg)$comm) # 25 is too many communities to portray
# k-cores
V(marv.newman.sg)$kc <- graph.coreness(marv.newman.sg)
kcmax <- max(V(marv.newman.sg)$kc)
kcmax # 10 on a continuous scale is good.
min(V(marv.newman.sg)$kc) # Because there are no isolates, minimum is 1.
corecolors <- brewer.pal(max(V(marv.newman.sg)$kc), "Reds")
corecolors <- c(rep("white", times = kcmax - 9), corecolors)
V(marv.newman.sg)$color <- corecolors[V(marv.newman.sg)$kc]


Plotting

V(marv.newman.sg)$closeness <- closeness(marv.newman.sg, weights = E(marv.newman.sg)$weight / mean(E(marv.newman.sg)$weight), normalize = FALSE)
V(marv.newman.sg)$size <- ecdf(V(marv.newman.sg)$closeness)(V(marv.newman.sg)$closeness)
V(marv.newman.sg)$size <- V(marv.newman.sg)$size * 6
V(marv.newman.sg)$label.cex <- V(marv.newman.sg)$size / (4.5 * 2)
E(marv.newman.sg)$width <- ecdf(E(marv.newman.sg)$weight)(E(marv.newman.sg)$weight)
g.lo <- layout.fruchterman.reingold(marv.newman.sg, params = list(niter = 2500, area = vcount(marv.newman.sg)^3))
V(marv.newman.sg)$x <- g.lo[ ,1]
V(marv.newman.sg)$y <- g.lo[ ,2]
svg("MarvPlot.svg", height = 20, width = 20)
plot(marv.newman.sg, vertex.label.color = "black", vertex.label.family = "sans", margin = c(0, 0, 0, 0))
dev.off()
Try opening that .svg file in your web browser!


THANK YOU!


Benjamin Lind
blind@hse.ru



PDF:


R SNA Intro Lab St. Petersburg

By Benjamin Lind

R SNA Intro Lab St. Petersburg

  • 6,223