Introduction to Software: R

International Summer School on Social Network Analysis

St. Petersburg, 2014

http://slides.com/lind_benjamin/snalab201408

http://goo.gl/j3FlCC

Benjamin Lind, PhD

Why Program?

Flexibility

Data

Convert formats
Customized cleaning

New methods

Slow GUI development
Open source community
Can write your own functions

Replication

Personal record keeping
Collaboration
Transparency

Awareness

About R

Implements the S programming language

Open source

Object oriented programming language

Base & user-contributed packages

Meets practically all statistical purposes

Availability: r-project.org

Why Program in R?

Open source
Community

Large and growing
Made by statisticians
Wealth of packages (i.e., libraries)

Features

Documentation
Moderately easy language
Graphics

Why Social Networks in R?

Shared platform

Networks
Statistics
Data management
Graphics

Active network software development

Network metrics

igraph
sna
tnet

Network modeling

statnet
RSiena

GeneRal Introduction

Start R!

Type "R" into your terminal

Basics

Command line-only interface

Reliable, Replicable, Records

GUI options do exist, but...

Outside of scope for our purposes

Assignment. Functionally equivalent methods
```
x <- 1 # Preference: primary method 
```

x = 1 # Preference: within functions

```
1 -> x # Preference: never    
```
Once assigned, x becomes an "object"
```
ls()    
```

Case Sensitivity

Try

Cheburashka <- "a cartoon character"Cheburashka
cheburashka

What does the error mean?

Hard Drive Navigation

R runs within a folder on your hard drive

Location where R reads & writes

Which folder are you in now?

orig.folder <- getwd()orig.folder
dir.create("new.folder")
list.files()
setwd("new.folder")list.files()
setwd(orig.folder)
file.remove("new.folder")list.files()

Packages

Image courtesy of Kliponius

Packages

Preloaded (e.g., base, stats)
Dependencies
User-contributed

Social network analysis

statnet
igraph
tnet
RSiena
...

Graphics
Advanced regression models
Web scraping
Spatial
...

Packages

To install a package

 install.packages("igraph")

To load a package

 library(igraph)

To unload a package

 detach("package:igraph", unload=TRUE)

Missing Data

Typically coded as "NA". For example

x <- NA

Missing codes

NA # Non-applicable, missing
NULL # Undefined or empty value
NaN # Not a Number
Inf # Infinite

Indicated by functions

is.na(x)
is.null(x)
is.nan(x)
is.infinite(x)

Data Types

Storing and accessing information within R

Picture by KENPEI

Data Types

Elements

Numeric
Character

Use double quotes

Logical

TRUE, FALSE , T , F , 1 , 0

Objects

Vectors
Matrices
Arrays
Data frames
Lists
Closures

Elements

Identification

is.numeric(42)
is.numeric("42")
is.numeric(3.12)
is.logical(T)
is.logical(FALSE)
is.character(FALSE)
is.character("42")
is.character("HSE Network Workshop")

Conversion

as.numeric("42") + 1
as.character(as.numeric("0") + 1)
as.character(!as.logical("1"))
as.character(as.logical(as.numeric("0") + 1))
as.character(is.logical(as.logical(as.numeric("0")) + 1))
as.character(!as.logical(as.numeric("1")))

Object Types: Vectors

Vectors collect elements in one dimension

(Photo by Frank C. Müller)

Object Types: Vectors

Vectors collect elements in one dimension

RU.cit <- c("Mos", "StP", "Nov", "Yek") # Create vector
RU.cit <- c(RU.cit, "Niz") # Add element
RU.cit[3] # Retrieve element 3
RU.pops <- c(10.4, 4.6, 1.4, 1.3) # Create vector
names(RU.pops) <- RU.cit[c(1, 2, 3, 4)] # Name vector
RU.pops[c(4,2,3,1)] # Permute the order and retrieve
length(RU.pops) # Retrieve the length
RU.vis <- c(TRUE, TRUE, FALSE, FALSE) #Create vector
RU.vis[-c(4, 2)] # Remove elements 4 and 2

Remaining object types build upon this logic

Object Types: Vectors

Sequences

all(1:5 == c(1, 2, 3, 4, 5))
all(1:5 == seq(from = 1, to = 5, by = 1))

Repetitions

rep(1, times = 5)
rep(NA, times = 4)
rep("Beetlejuice", times = 3)

Object Types: Matrices and Arrays

Picture by David.Asch

Object Types: Matrices & Arrays

Matrices

Two dimensions

nrow(); ncol()

Assignment

x[1, 5] <- 1; x[2, ] <- x[1, 5] > 0

More columns, rows

cbind(); rbind()

Names: rownames(); colnames()

Arrays

Three dimensions

dim(); length(x[1, 1, ])

Assignment

x[3, 2, 6] <- 1; x[4, , 2] <- NA

Names: dimnames()

Object Types: Matrices

Turning vectors into matrices

RU.mat <- cbind(pop = RU.pops, vis = RU.vis)
rownames(RU.mat) <- RU.cit[c(1:4)]
colnames(RU.mat)

Alternatively

RU.mat <- matrix( , nrow = length(RU.pops), ncol = 2)
colnames(RU.mat) <- c("pop", "vis")
rownames(RU.mat) <- names(RU.pops)
RU.mat[ , "pop"] <- RU.pops
RU.mat[ , "vis"] <- RU.vis

Try

RU.mat[1, ]RU.mat[ , "pop"]
RU.mat[1, 2]RU.mat[1, , ] # What does the error mean?

Object Types: Data Frames

Data frames are like matrices, but allow multiple element types. Compare:

RU.mat[ , "vis"]
RU.vis

To retain initial element type

RU.df <- as.data.frame(RU.mat)
RU.df$vis <- as.logical(RU.df$vis)

These commands are equivalent statements

RU.df$vis[2]
RU.df[2, "vis"]
RU.df[["vis"]][2]
RU.df[[2]][2]

Object Types: Lists

Lists collect objects of (potentially) varying types & dimensions

RU.list <- list(RU.mat, RU.vis, RU.df)
names(RU.list) <- c("Mat", "Vec", "DF")

These commands are equivalent statements

RU.list$DF$vis[2]
RU.list[["DF"]]$vis[2]
RU.list[[3]]$vis[2]
RU.list[["DF"]][["vis"]][2]
RU.list[["DF"]][2, "vis"]
RU.list[["DF"]][2, 2]

Object Types: Closures

Closures are functions

Functions accept input (optional) & return output

Photo by erix!

Object Types: Closures

Closures are functions

Functions accept input (optional) & return output

xTimesxMinus1 <- function(x)
  return(x * (x-1))
xTimesxMinus1(42)

InvHypSineTrans <- function(x){
  x.ret <- log(x + sqrt(x^2 + 1))
  return(x.ret)
  }

plot(x = 0:1000, InvHypSineTrans(0:1000), main = "Hyperbolic Sine Transformation", xlab = "x", ylab = "y")

Compiled functions

library(compiler)InvHypSineTrans.cmp <- cmpfun(InvHypSineTrans)system.time(InvHypSineTrans(0:10000000))system.time(InvHypSineTrans.cmp(0:10000000))

Help

How to overcome the inevitable hurdles

Picture from Richard Milnes

Help

If you're looking for a function, but don't know the name

??"Chi Squared"
help.search("Chi Squared")

Once you know the name of the function

?chisq.test

Want to know how the function is written?

chisq.test

Still stumped?

Google: <expression to search> ~rstats
http://rseek.org/
http://stackoverflow.com/questions/tagged/r

Help

Exercise

Find the randomized normal distribution function
Use it to create two vectors of length 100

One with mean of zero & standard deviation of one

Assign it x

A second with mean of 50 & standard deviation of 25

Assign it y

Find the linear regression model command
Regress y on x

Apply Family

Photo by · · · — — — · · ·

Apply Family

Apply administers a function across elements within an object

Better than for() loops in R

lab.seq <- seq(from = -100, to = 100, by = 20)
?sapply
lab.mat <- sapply(lab.seq, rnorm, n = 100)
dim(lab.mat)
fix(lab.mat)
?apply
lab.means <- apply(lab.mat, 2, mean)
plot(x = lab.seq, y = lab.means)

Apply Family: Exercise

Recreate the last exercise, but with the following modifications

Use a sequence from 1 to 100 by 5
Run the sapply() line, but...

Set the mean to zero
Apply the sequence to rnorm()'s standard deviation
Run the apply() line, but...

Retrieve the standard deviation instead of the mean

Plot your results and show them to me

Loading and Saving

Reusing material between R sessions

Photo by mRio

Loading & Saving: Source

Text files for executable R code

Like 'do files' for Stata

Write them in text editors, following R syntax

In text editor, type

test.val <- "It works!"
test.val2 <- paste(test.val, "Really")

Save as "test.val.source.txt" in working directory

In R, enter

source("test.val.source.txt")
ls()
print(test.val)
print(test.val2)

Loading & Saving

History

RHistory files retain a log of your session
May be opened in a text editor
Should end in ".RHistory" extention

Usage

savehistory("Filename.RHistory")
loadhistory("Filename.RHistory")

Images

Saves every object in the environment (i.e., ls())
Usage

save.image("Filename.RData")
load("Filename.RData")

dput(), dget()

Saves and retrieves individual objects

Network Analysis Packages in R

igraph

statnet

intergraph

tnet

RSiena

igraph

Originated as a fork of the sna package

Libraries C/C++, Python, and R

Created by Gábor Csárdi

Motivations and Strengths

Network science tradition
General and complex network metrics
Data storage
Speed and efficiency

Website, Wiki

statnet

Suite collecting the packages

sna
network
ergm, tergm
...and others

Interdisciplinary, cross-university development team

Motivations

General social network metrics
Data storage
Modeling and statistical inference

Website, Wiki

intergraph

statnet and igraph don't play nicely

Many identical function names
Different data formats

intergraph is a conversion tool

asIgraph()
asNetwork()

Website

tnet

Created by Tore Opsahl

Provides metrics for three types of networks

Weighted
Two-mode
Longitudinal

Motivated by the bias introduced by coercing these networks into simple graphs

Works well with igraph

Website

RSiena

Originally Windows-based, stand alone

Stochastic Actor-Oriented Models

Motivations

Longitudinal network modeling
Longitudinal behavioral network modeling
Cross-sectional network modeling

p* / ERGM

Website

Social Networks in R

Data importation

Basic Measurements

Data

R can read anything!*

Delimited text (comma, tab, bang, pipe, etc)
Excel files
Internet files

Tables from webpages
.json, .xml, .html format
Files stored online

Pajek files
SPSS and Stata files
Spatial data in shapefiles and KML

If possible, I'd recommend tab-delimited text files.

*It reads some file formats much better (and easier) than others

Social Network Data

Typical network data formats

Matrix
Edge list
Pajek

Attribute formats

Actor dataframe
Edge dataframe

Import a Tab-Delimited Edge List

Wikipedia votes for site administrators

One-mode, directed, simple

Source

Picture courtesy of Frank Schulenburg

Import a Tab-Delimited Edge List

library(igraph)

wikidat <- read.table(file = "http://pastebin.com/raw.php?i=UVmTznBj", sep = "\t", header = TRUE, skip = 6)
# 'sep' is the separator
# 'header' indicates column names
# 'skip' are the number of initial rows to ignore

wikidat <- graph.data.frame(d = wikidat, directed = TRUE)
summary(wikidat) # Nodes: 7115; Edges: 103689

Import a Pajek File

Appearances by Marvel comic book characters

Characters by comic book issue

Two-mode

Created by Rosselló, Alberich, and Miro

Info

Source

Import a Pajek File

library(igraph)

marvnet <- read.graph("http://bioinfo.uib.es/~joemiro/marvel/porgat.txt", format = "pajek")
summary(marvnet)

A More Difficult Case

EIES Data

Case

Early study on electronic information exchange
Precursor to email, late-1970s

Subjects

32 researchers
Cross-discipline social network analysts

Relationships, expressed in matrices

Heard of - Met - Friend - Close friend at two time points
Message frequency between time points

Attributes

Vertex: discipline
Vertex: citation count
Edges: affinity
Edges: message frequency

EIES

Acquire the data

firstdir <- getwd()
newdir <- paste(firstdir, "EIES_data", sep = "/")
download.file("https://www.stats.ox.ac.uk/~snijders/siena/EIES_data.zip", "EIES_data.zip", method = "wget")
unzip("EIES_data.zip", exdir = newdir)
setwd(newdir)

Read the nodal attribute data

eiesatts <- read.table("eies.dat", as.is = TRUE, stringsAsFactors = FALSE, strip.white = TRUE, skip = 3, quote = "", nrows = 32)
colnames(eiesatts) <- c("Researcher", "Original_Code", "Citations", "Discipline")
eiesatts$Discipline_Name <- c("Sociology", "Anthropology", "Statistics", "Psychology")[eiesatts$Discipline]
eiesatts$Discipline_Name[3] <- "Communication"
eiesatts$Discipline_Name[27] <- "Mathematics"

EIES

Read in the networks and create igraph objects

friend1 <- read.table("EIES.1", sep = " ", as.is = TRUE, strip.white = TRUE)
friend2 <- read.table("EIES.2", sep = " ", as.is = TRUE, strip.white = TRUE)
eiesmessages <- read.table("messages.asc", nrows = 32, strip.white = TRUE)
eies <- list(t1 = friend1, t2 = friend2, messages = eiesmessages)
rm(friend1, friend2, eiesmessages)
eies <- lapply(eies, as.matrix)

eies$t1 <- graph.adjacency(eies$t1, mode = "directed", weighted = TRUE, diag = FALSE)
E(eies$t1)$etype <- c("Do not know", "Heard of", "Met", "Friend", "Close Friend")[E(eies$t1)$weight + 1]
eies$t2 <- graph.adjacency(eies$t2, mode = "directed", weighted = TRUE, diag = FALSE)
E(eies$t2)$etype <- c("Do not know", "Heard of", "Met", "Friend", "Close Friend")[E(eies$t2)$weight + 1]
eies$messages <- graph.adjacency(eies$messages, mode = "directed", weighted = TRUE, diag = TRUE)

EIES

Assign vertex attributes

eiesvatts <- function(g, atts = eiesatts){
    V(g)$name <- as.character(1:vcount(g))
    for (i in 1:ncol(atts))
        g <- set.vertex.attribute(g, name = colnames(atts)[i], value = atts[, i])
    return(g)
    }
eies <- lapply(eies, eiesvatts)
rm(eiesvatts, eiesatts)

Clean the working directory (optional)

sapply(list.files(), file.remove)
setwd(firstdir)
file.remove(newdir)
file.remove("EIES_data.zip")
rm(firstdir, newdir)

Import Other Formats in igraph

See

?read.graph
?graoh.data.frame
?graph.adjacency
?graph.adjlist
?graph.constructors
# Query database for popular data sets
?nexus.list
?nexus.search
?nexus.get
?nexus.info

Basic Measurements

Density and Degree

is.simple(wikidat) # Verify it's simple
is.directed(wikidat) # Verify it's directed
vcount(wikidat) # Number of vertices
ecount(wikidat) # Number of edges
graph.density(wikidat) # Density

V(wikidat)$indegree <- degree(wikidat, mode = "in")
V(wikidat)$outdegree <- degree(wikidat, mode = "out")
V(wikidat)$totaldegree <- degree(wikidat, mode = "total")

all.vatts <- list.vertex.attributes(wikidat)[-1]
all.vatts <- sapply(all.vatts, get.vertex.attribute, graph = wikidat)
summary(all.vatts)
hist(log(V(wikidat)$outdegree), main = "Wiki Admin Votes", xlab = "Outdegree (log)")

Subsetting Networks

By edges

eies.closefriends.t1 <- subgraph.edges(eies$t1, eids = which(E(eies$t1)$etype == "Close Friend"), delete.vertices = FALSE)
eies.closefriends.t1 <- as.undirected(eies.closefriends.t1)
eies.closefriends.t2 <- subgraph.edges(eies$t2, eids = which(E(eies$t2)$etype == "Close Friend"), delete.vertices = FALSE)
eies.closefriends.t2 <- as.undirected(eies.closefriends.t2)
eies.closefriends <- list(t1 = eies.closefriends.t1, t2 = eies.closefriends.t2)
rm(eies.closefriends.t1, eies.closefriends.t2)

By vertices

remove.isolates <- function(g){
    nonisolates <- which(degree(g) > 0)
    return(induced.subgraph(g, nonisolates))
    }
eies.closefriends.nonisolates <- lapply(eies.closefriends, remove.isolates)
rm(remove.isolates)

Dyads and Triads

unlist(dyad.census(wikidat))
reciprocity(wikidat)

Which measurement of reciprocity did that value refer to?

Edgewise?
Dyadic?
Dyadic, non-null ("ratio")?

transitivity(eies.closefriends$t1) # Unidrected network
V(eies.closefriends$t1)$loc.trans <- transitivity(eies.closefriends$t1, "local")
triad.census(wikidat) # Directed network
source("http://pastebin.com/raw.php?i=J4fPjFuN")
transitivity_directed(wikidat) # Not in igraph

Who has the highest clustering coefficient?

V(eies.closefriends$t1)$name[which.max(V(eies.closefriends$t1)$loc.trans)]

Paths

Courtesy of Mihael Simonic

Paths

Warning: the commands below might take a minute or so

average.path.length(wikidat)
diameter(wikidat)
V(wikidat)$name[(get.diameter(wikidat))]
hist(shortest.paths(wikidat), main = "Histogram of Shortest Path Lengths", xlab = "Geodesic Lengths")

V(wikidat)$betw <- betweenness(wikidat)E(wikidat)$eb <- edge.betweenness(wikidat)hist(log(E(wikidat)$eb), main = "Histogram of Edge Betweenness", sub = "Wiki Admin Voting", xlab = "Edge Betweenness (log)")

Connectivity

Courtesy of Rico Shen

Connectivity

#How many weak and strong components do we have? 
sapply(c("weak", "strong"), no.clusters, graph = wikidat)

#Notice the distributions
clusters(wikidat, mode = "weak")$csize
clusters(wikidat, mode = "strong")$csize

#Assign membership as attributesV(wikidat)$comp.w <- clusters(wikidat, mode = "weak")$membership
V(wikidat)$comp.s <- clusters(wikidat, mode = "strong")$membership# Are the vertexes part of the giant component?
V(wikidat)$giantcomp.w <- V(wikidat)$comp.w == which.max(clusters(wikidat, mode = "weak")$csize)
V(wikidat)$giantcomp.s <- V(wikidat)$comp.s == which.max(clusters(wikidat, mode = "strong")$csize)

Centrality

Centrality

We've already reviewed degree and betweenness.

Examples of closeness and eigenvector centrality:

V(wikidat)$closeness <- closeness(wikidat)
V(wikidat)$evcent <- evcent(wikidat)$vector

k-cores

V(wikidat)$kc.undir <- graph.coreness(as.undirected(wikidat, mode = "collapse"))
#How are undirected k-cores related to centrality?
kc.cent.corr.fun<-function(x, y = V(wikidat)$kc.undir){
  a <- get.vertex.attribute(wikidat, name = x)
  b <- y
  cor.est <- cor.test(a, b, method = "spearman", exact = FALSE)$estimate
  return(cor.est)
  }
kc.cent.corr.fun <- cmpfun(kc.cent.corr.fun)

cent.atts <- c("indegree", "outdegree", "totaldegree", "betw", "closeness", "evcent")
cent.cors <- sapply(cent.atts, kc.cent.corr.fun)
names(cent.cors) <- cent.atts
barplot(cent.cors, ylab = "Spearman's rho", xlab = "Centrality", main = "Correlation with k-coreness")
#Directed k-cores
sapply(c("in", "out", "all"), function(y) return(sapply(cent.atts, kc.cent.corr.fun, y = graph.coreness(wikidat, mode = y))))
rm(cent.atts, kc.cent.corr.fun)

Community Detection

Courtesy of Jorge Láscar

Community Detection

wikidat.comms <- multilevel.community(as.undirected(wikidat, mode = "collapse"))
wikidat.comms$modularity
V(wikidat)$comms <- wikidat.comms$membership
xtabs(~V(wikidat)$comms) # 3, 4, 5, 9, and 11 are the biggest

wikidat.comms.w <- multilevel.community(as.undirected(wikidat, mode = "collapse"), weights = max(E(wikidat)$eb) - E(wikidat)$eb)
wikidat.comms.w$modularity # Slightly higher
V(wikidat)$comms.w <- wikidat.comms.w$membership
xtabs(~V(wikidat)$comms.w) # 4, 5, 14, and 15 are the biggest

Visualization

A Small Example

plot(eies.closefriends$t1)

Let's Indicate Attributes!

Scale vertex size by citations, color by discipline

library(RColorBrewer)library(compiler)
my_ecdf_fun <- function(x.val, all.vals){
    ret.val <- mean(x.val >= all.vals)
    return(ret.val)
    }
my_ecdf_fun <- cmpfun(my_ecdf_fun)
vsize <- sapply(V(eies.closefriends$t1)$Citations, my_ecdf_fun, all.vals = V(eies.closefriends$t1)$Citations)
vsize <- 7.5 + 15 * vsize

each.discipline <- unique(V(eies.closefriends$t1)$Discipline_Name)
n.disciplines <- length(each.discipline)
vcol <- brewer.pal(n.disciplines, name = "Accent")
vcol <- vcol[match(V(eies.closefriends$t1)$Discipline_Name, each.discipline)]

plot(eies.closefriends$t1, vertex.color = vcol, vertex.size = vsize, vertex.label.color = "black", vertex.label.family = "sans")

Defaults for a Much Larger Case

Inadvisable for a network this large

vcount(wikidat) # 7115 nodes
ecount(wikidat) # 103689 edges
plot(wikidat)

What are the problems?

Subsample

Completely at Random

sampsize <- round(vcount(wikidat) * 0.05) # 5% sample
subsamp <- sample.int(vcount(wikidat), sampsize)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
plot(wikidat.sg)

What are the problems?

Subsample

Completely at Random

sampsize <- round(vcount(wikidat) * 0.05) # 5% sample
subsamp <- sample.int(vcount(wikidat), sampsize)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
plot(wikidat.sg)

Visual problems

Nodes too large
Labels don't provide information

plot(wikidat.sg, vertex.size = 3, vertex.label = "")

What are the problems?

Subsample

Visual problems

Too many isolates
Arrowheads too large

subsamp <- which(degree(wikidat.sg, mode = "all") > 0)
wikidat.sg <- induced.subgraph(wikidat.sg, subsamp)
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.5)

Much better, visually

But what are some conceptual problems using this technique?

Conceptual Problems with Random Sampling

Grossly underrepresents degree

Majority of alters absent
Underestimates connectivity

Conceptually naïve

Sample could reflect structure

Replication limitations

Alternative Subsampling Strategy

I want to portray communities and leadership

# Sample based on betweenness and large communities
cond1 <- V(wikidat)$betw > 1
cond2 <- V(wikidat)$comms %in% c(3, 4, 5, 9, 11)
subsamp <- which(cond1 & cond2)
wikidat.sg <- induced.subgraph(wikidat, subsamp)
vcount(wikidat.sg) # From 7,115 to 1,372 nodes
ecount(wikidat.sg) # From 103,689 to 40,587 edges
# Still too big to see anything.
# Systematic 25% sample on eigenvector centrality
evrank <- rank(V(wikidat.sg)$evcent, "random")
nsyssamp <- seq(from = 1, to = max(evrank), by = 4)
nsyssamp <- which(evrank%in% nsyssamp)
wikidat.sg <- induced.subgraph(wikidat.sg, nsyssamp)
# Systematic 25% sample on edge betweenness
ebrank <- rank(E(wikidat.sg)$eb, "random")
esyssamp <- seq(from = 1, to = max(ebrank), by = 4)
esyssamp <- which(ebrank %in% esyssamp)
wikidat.sg <- subgraph.edges(wikidat.sg, esyssamp)
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25)

Not too bad!

Layout

There are many layout algorithms out there

Their default values compromise between quality and efficiency

?layout.auto
wikidat.sg.lo <- layout.fruchterman.reingold(wikidat.sg, params = list(niter = 2500, area = vcount(wikidat.sg)^3))
V(wikidat.sg)$x <- wikidat.sg.lo[,1]
V(wikidat.sg)$y <- wikidat.sg.lo[,2]
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25)

Better use of space, but hairballish

What are your empirical interests?

Sizes
Colors and shading

Categorical variables
Interval and ordinal variables

Transparency

Parameters to Vary

Nodes

Non-continuous: Shape, node color, border color
Continuous: Node size, border width, color

Edges

Non-continuous: Color, line type, arrowhead type
Continuous: Width, color

Labels: Size, color, visibility

Edge Adjustments

Emphasize those with higher betweenness

Use width to convey it

ewidth <- sapply(E(wikidat.sg)$eb, my_ecdf_fun, E(wikidat.sg)$eb)
E(wikidat.sg)$width <- (1.25 * ewidth)^2
plot(wikidat.sg, vertex.size = 3, vertex.label = "", edge.arrow.size = 0.25, edge.curved = TRUE)

Only a slight improvement

Node Adjustments

Color by community

Scale by eigenvector centrality

vcols <- brewer.pal(3, "Accent")
names(vcols) <- unique(V(wikidat.sg)$comms)
V(wikidat.sg)$color <- vcols[as.character(V(wikidat.sg)$comms)]
V(wikidat.sg)$size <- 6 * sapply(V(wikidat.sg)$evcent, my_ecdf_fun, V(wikidat.sg)$evcent)
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = TRUE)

Communities better portrayed

Additional Sweep

Halve the number of edges

Same technique as before

ebrank <- rank(E(wikidat.sg)$eb, "random")
esyssamp <- seq(from = 1, to = max(ebrank), by = 2)
esyssamp <- which(ebrank %in% esyssamp)
wikidat.sg <- subgraph.edges(wikidat.sg, esyssamp)
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25)

Much more clear

Do the main observations hold across different layouts?

par(mfrow = c(2, 2), mar = c(0.1, 0.1, 0.1, 0.1))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.kamada.kawai, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.circle, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.svd, margin = c(0, 0, 0, 0))
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.mds, margin = c(0, 0, 0, 0))

Color Edges by Community

wikidat.sg.elist <- get.edgelist(wikidat.sg, names = FALSE)
edgecolor <- apply(wikidat.sg.elist, 1, function(x){
    c1 <-V(wikidat.sg)$comms[x[1]]
    c2 <- V(wikidat.sg)$comms[x[2]]
    return(ifelse(c1 == c2, vcols[as.character(c1)], "darkgrey"))
    })
E(wikidat.sg)$color <- edgecolordev.off()
plot(wikidat.sg, vertex.label = "", edge.arrow.size = 0.25, edge.curved = .25, layout = layout.svd, margin = c(0, 0, 0, 0))

Advice

Less is more

Fewer modifications are better than more
Smaller data is easier to represent than larger data

Transparency when reporting visualization
Informative and intuitive over beautiful

Practical over gimmicky
Science over art

For more options, see

?igraph.plotting

Graph-Level Indices

Some custom commands

(May take a few minutes)

source("http://pastebin.com/raw.php?i=SErkr5sg")

Includes functions

ei_index
transitivity_directed
uman_game
rewire_dyads

Graph-Level Indices

What's the edgewise reciprocity for Wikipedia administrator voting?

reciprocity(wikidat)

5.6% of all edges are reciprocated

Is this figure big or small?

Bigger or smaller than what?

We need to establish a basis for comparison.

Graph-Level Indices

Affected by network size and density

Example

graph.order <- seq(from = 10, to = 100, by = 5)
graph.trans <- sapply(graph.order, function(x) replicate(1000, transitivity(erdos.renyi.game(x, .1))))
graph.apl <- sapply(graph.order, function(x) replicate(1000, average.path.length(erdos.renyi.game(x, .1))))
par(mfrow = c(1, 2))
plot(x = graph.order, y = apply(graph.trans, 2, mean), ylim = c(0, 1), xlim = c(0, 100), type = "l", xlab = "Order", ylab = "Transitivity", main = "Tie Prob = 0.10; Simulations=1000")
lines(x = graph.order, y = apply(graph.trans, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")
lines(x = graph.order, y = apply(graph.trans, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")
plot(x = graph.order, y = apply(graph.apl, 2, mean), ylim = c(0, 5), xlim = c(0, 100), type = "l", xlab = "Order", ylab = "Average Path Length", main = "Tie Prob = 0.10; Simulations = 1000")
lines(x = graph.order, y = apply(graph.apl, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")
lines(x = graph.order, y = apply(graph.apl, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")

Graph-Level Indices

Affected by network size and density

graph.tieprob <- seq(from = .01, to = .30, by = .01)graph.trans <- sapply(graph.tieprob, function(x) replicate(1000, transitivity(erdos.renyi.game(100, x))))graph.apl <- sapply(graph.tieprob, function(x) replicate(1000, average.path.length(erdos.renyi.game(100, x))))par(mfrow = c(1, 2))plot(x = graph.tieprob, y = apply(graph.trans, 2, mean), ylim = c(0, 1), xlim = c(0, .3), type = "l", xlab = "Tie Probability", ylab = "Transitivity", main = "Order = 100; Simulations = 1000")lines(x = graph.tieprob, y = apply(graph.trans, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")lines(x = graph.tieprob, y = apply(graph.trans, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")plot(x = graph.tieprob, y = apply(graph.apl, 2, mean), ylim = c(0,8), xlim = c(0,.3), type = "l", xlab = "Tie Probability", ylab = "Average Path Length", main = "Order=100; Simulations = 1000")lines(x = graph.tieprob, y = apply(graph.apl, 2, quantile, probs = .025, na.rm = TRUE), col = "red", lty = 2, type = "l")lines(x = graph.tieprob, y = apply(graph.apl, 2, quantile, probs = .975, na.rm = TRUE), col = "red", lty = 2, type = "l")

Random Graph Comparisons

Conditional Uniform Graph (CUG) test

Measure observed statistic on a graph
Generate many random graphs (~1000)

Take the same measurement from each

Compare the observed statistic to the random ones

Two families

Reassignment methods
Graph generators

Reassignment Methods

"Shuffling" ties and attributes

Permutation tests
Rewiring tests

Permutation tests

Permute an attribute to test its relationship to a nodal property
Permute a graph to test its relationship to another graph

# Do conservative blogs have more links than liberal ones?
polblogs <- nexus.get(id = "polblogs") #0 = liberal, 1 = conservative

polblogs.totdeg <- degree(polblogs, mode = "all")
obs.diff <- mean(polblogs.totdeg[V(polblogs)$LeftRight == 1]) - mean(polblogs.totdeg[V(polblogs)$LeftRight == 0]) 

perm.fun <- function(){
  perm.attribs <- sample(V(polblogs)$LeftRight)
  return(mean(polblogs.totdeg[perm.attribs == 1]) - mean(polblogs.totdeg[perm.attribs == 0]))
  }; perm.fun<-cmpfun(perm.fun)
exp.diffs <- replicate(1000, perm.fun())
mean(obs.diff > exp.diffs) # ~0.86, not significantly greater than by chance assignment

"Why not just use a t-test?"

Permutation tests

E-I Index (Krackhardt and Robert Stern 1988)

(Outgroup Ties - Ingroup Ties) / Total Ties

obs.val <- ei_index(polblogs, V(polblogs)$LeftRight)
exp.val <- replicate(1000, ei_index(polblogs, sample(V(polblogs)$LeftRight)))
mean(obs.val < exp.val) # We expect fewer outgroup ties

Permutation Tests

EI Index = (Outgroup Ties - Ingroup Ties) / Total Ties

# Friendship homophily by discipline among researchers
# Time 1
obs.val.t1 <- ei_index(eies.closefriends$t1, V(eies.closefriends$t1)$Discipline_Name)
exp.val.t1 <- replicate(1000, ei_index(eies.closefriends$t1, sample(V(eies.closefriends$t1)$Discipline_Name)))

# Time 2
obs.val.t2 <- ei_index(eies.closefriends$t2, V(eies.closefriends$t2)$Discipline_Name)
exp.val.t2 <- replicate(1000, ei_index(eies.closefriends$t2, sample(V(eies.closefriends$t2)$Discipline_Name)))
# Compare the homophily effect & its non-random probability
c(t1 = obs.val.t1, t1.p = mean(obs.val.t1 < exp.val.t1), t2 = obs.val.t2, mean(obs.val.t2 < exp.val.t2))

Rewiring tests

One of two approaches

Shuffle edges (or dyads)
Shuffle edge weights

Preserves degree (and edge weight) distribution

karate <- nexus.get("karate")
karate.lo <- layout.kamada.kawai(karate, params = list(niter = 1000))

par(mfrow = c(1, 2))
plot(karate, vertex.size = 5, vertex.label = V(karate)$name, vertex.label.family = "sans", margin = c(0, 0, 0, 0), edge.curved = .33, layout = karate.lo)

karate.rw <- rewire(karate, mode = "simple", niter = 10*ecount(karate))
plot(karate.rw, vertex.size = 5, vertex.label = V(karate)$name, vertex.label.family = "sans", margin = c(0, 0, 0, 0), edge.curved = .33, layout = karate.lo)

Original (left) and Rewired (right)

Rewired Transitivity

obs.trans <- transitivity(karate)karate.rw.fun <- function(){  krw <- rewire(karate, mode = "simple", niter = 10 * ecount(karate))  return(transitivity(krw))  }; karate.rw.fun <- cmpfun(karate.rw.fun)exp.trans <- replicate(1000, karate.rw.fun())mean(obs.trans > exp.trans) # ~0.87

What's the interpretation of this finding?

Rewired Transitivity

Is the transitivity in this directed network greater than chance?

Did it increase from the first to second time point?

Rewired dyads control for reciprocity and degree distribution

n.researchers <- vcount(eies$t1)
n.rewires <- 10 * n.researchers * (n.researchers - 1)
obs.val.t1 <- transitivity_directed(eies$t1)
obs.val.t2 <- transitivity_directed(eies$t2)

# Will take some time.
# Would recommend at least 1000, not 250, replications
exp.val.t1 <- replicate(250, transitivity_directed(rewire_dyads(eies$t1, n.rewires)))
exp.val.t2 <- replicate(250, transitivity_directed(rewire_dyads(eies$t2, n.rewires)))

c(t1 = obs.val.t1, t1.p = mean(obs.val.t1 > exp.val.t1), t2 = obs.val.t2, t2.p = mean(obs.val.t2 > exp.val.t2))

Graph Generators

Erdős–Rényi model
Power Law model
Watts and Strogatz model
Exponential random graph models
...and many others

Erdős–Rényi model

A fixed number (or proportion) of edges are randomly assigned to a fixed set of nodes.

Variants

Exact #ties or density
Directed, undirected
Dyad census-conditioned

er.graph <- erdos.renyi.game(100, .03)
plot(er.graph, vertex.size = 5, vertex.label = "")
obs.incent <- centralization.degree(wikidat, mode = "in", loops = FALSE)$centralization
exp.incent.fun <- function(){
  new.net <- erdos.renyi.game(vcount(wikidat), ecount(wikidat), type = "gnm", directed = TRUE)
  return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization)
  }; exp.incent.fun <- cmpfun(exp.incent.fun)
exp.incent <- replicate(1000, exp.incent.fun())
mean(obs.incent > exp.incent)

Erdős–Rényi model

obs.incent <- centralization.degree(wikidat, mode = "in", loops = FALSE)$centralization
exp.incent.fun.ec <- function(){
  new.net <- erdos.renyi.game(vcount(wikidat), ecount(wikidat), type = "gnm", directed = TRUE)
  return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization)
  }; exp.incent.fun.ec <- cmpfun(exp.incent.fun.ec)
exp.incent.ec <- replicate(500, exp.incent.fun.ec())
c(obs = obs.incent, exp.ec = mean(obs.incent > exp.incent.ec))

# An alternative model that will take too much time
# Set sparse = TRUE if you don't have much RAM
exp.incent.fun.dc <- function(){
  dc <- unlist(dyad.census(wikidat))
  new.net <- uman_game(vcount(wikidat), dc, sparse = FALSE)
  return(centralization.degree(new.net, mode = "in", loops = FALSE)$centralization)
  }; exp.incent.fun.dc <- cmpfun(exp.incent.fun.dc)
exp.incent.dc <- replicate(500, exp.incent.fun.dc())

Erdős–Rényi model

Power Law model

Assumptions

Networks grow according to the principle of preferential attachment.

P(X=x) ~ x^(-alpha)

Nodes are of degree greater than or equal to x

P(X=x) is the probability of observing a node with degree x or greater

alpha is the scalar

(Barabási and Albert 1999)

Power Law model

wikidat.in <- degree(wikidat, mode = "in")
wikidat.out <- degree(wikidat, mode = "out")
wikidat.plfit.in <- power.law.fit(wikidat.in)
wikidat.plfit.out <- power.law.fit(wikidat.out)
print(c(wikidat.plfit.in$alpha, wikidat.plfit.out$alpha))
exp.incent.pl.fun <- function(){
  new.net <- static.power.law.game(vcount(wikidat), ecount(wikidat), exponent.out = wikidat.plfit.out$alpha, exponent.in = wikidat.plfit.in$alpha)
  return(centralization.degree(new.net, mode="in", loops = FALSE)$centralization)
  }; exp.incent.pl.fun <- cmpfun(exp.incent.pl.fun)
exp.incent.pl <- replicate(500, exp.incent.pl.fun())
mean(obs.incent > exp.incent.pl) # Obs more centralized
par(mfrow = c(1, 2))
hist(exp.incent.ec, main = "E-R Indegree Centralization")
hist(exp.incent.pl, main = "Power Law Indegree Centralization") # ~10x more centralized graphs

Watts and Strogatz model

Assumptions

Local neighborhoods bridged by a few actors with ties that jump across neighborhoods.

Game

Begin with a lattice

Each actor initially has the same degree, neighborhood index (nei) * 2

Rewire with a probability of p

par(mfrow=c(1, 2))
g1 <- graph.lattice(100, nei = 1)# nei = mean degree / 2
g1.layout<-layout.fruchterman.reingold(g1)
V(g1)$x <- g1.layout[ , 1]; V(g1)$y <- g1.layout[ , 2]
g1.rewire <- rewire.edges(g1, p = 0.1) # prob of .1 rewire
plot(g1, vertex.size = 5, vertex.label = "", main = "Lattice")
plot(g1.rewire, vertex.size = 5, vertex.label = "", main = "Rewire")

Watts and Strogatz model

Question

What's the relationship between the neighborhood index (nei) and graph density?

Watts and Strogatz model

Low values for p resemble lattices

#The following two plots should be approximately equal
dev.off(); par(mfrow = c(1, 2))
sw.net <- watts.strogatz.game(dim = 1, size = 100, nei = 1, p = .1)
plot(g1.rewire, vertex.size = 5, vertex.label = "", main = "Rewired Lattice")
plot(sw.net, vertex.size = 5, vertex.label = "", layout = g1.layout, main = "Watts-Strogatz")

High values for p become closer to Erdős–Rényi graphs

par(mfrow = c(1, 2))
g2.ws <- watts.strogatz.game(dim = 1, size = 100, nei = 1, p = .9)
g2.er <- erdos.renyi.game(100, (2 * 100 / 2) / (100 * 99 / 2), directed = FALSE)
plot(g2.ws, vertex.size = 5, vertex.label = "", main = "Watts-Strogatz")
plot(g2.er, vertex.size = 5, vertex.label = "", main = "Erdős–Rényi")

Rewiring Demonstration

Exponential random graph models

Exercises

Recreate the first two plots in this lab using the Watts and Strogatz model. Fix the order (size) to 100 and set the density to approximately 0.02. Vary p from 0 to 1 by .01.

(Hint: mean degree equals the number of edges divided by the number of nodes.)

Recreate the power law exercise on degree centralization for the political blog network. If appropriate to model as a power law, is this network more centralized than predicted?

Two-Mode Networks

Two-Mode Networks are Special

Clustering follows a different mechanism
Distance can take on a different meaning

Depends upon conceptualization of tie strength

Traditionally "projected" as a one-mode network.

Theoretical problems

Assumes an "event" creates ties among all participating actors

Do you always know everyone at each party you attend?
Some actors and events are much more inclusive

Should their relationships hold the same weight?

Methodological problems

Creates a complete subgraph ("clique") around each event
Increases density

Two-Mode Network Software

Limited functions in igraph (R, Python, C, Ruby)
More capabilities in tnet (R)
NetworkX (Python)

library(igraph)
library(tnet)
library(compiler)
marvnet <- read.graph("http://bioinfo.uib.es/~joemiro/marvel/porgat.txt", format = "pajek")
V(marvnet)$name <- V(marvnet)$id

Convert it to tnet

marvnet.tnet <- get.edgelist(marvnet, names = FALSE)
marvnet.tnet <- as.tnet(marvnet.tnet, type = "binary two-mode tnet")

Two-Mode Network Software

Limited functions in igraph (R, Python, C, Ruby)
More capabilities in tnet (R) and NetworkX (Python)

download.file("https://raw.githubusercontent.com/kjhealy/revere/master/data/PaulRevereAppD.csv", "PaulRevere.csv", method = "wget")
amrev <- read.csv("PaulRevere.csv", row.names = 1)
file.remove("PaulRevere.csv")
amrev.el <- which(amrev == 1, arr.ind = TRUE)
amrev.el[, 1] <- rownames(amrev)[as.numeric(amrev.el[, 1])]
amrev.el[, 2] <- colnames(amrev)[as.numeric(amrev.el[, 2])]
amrev <- graph.data.frame(amrev.el, directed = FALSE)
V(amrev)$type <- V(amrev)$name %in% amrev.el[, 2]

Convert it to tnet

amrev.tnet <- get.edgelist(amrev, names = FALSE)
amrev.tnet <- as.tnet(amrev.tnet, type = "binary two-mode tnet")

Clustering Coefficient

Closed Three-Paths (Robins and Alexander [2004])

ra_clust <- function(g){
    g.motif <- graph.motifs(g, size = 4)
    threepaths <- g.motif[7]
    fourcycles <- 4 * g.motif[9]
    ret.val <- fourcycles / (fourcycles + threepaths)
    return(ret.val)
    }
ra_clust <- cmpfun(ra_clust)

amrev.raclust <- ra_clust(amrev)
amrev.raclust

Closed Four-Paths with Three Vertices in Primary Mode (Opsahl 2012)

# Much slower, but faster version available (email Opsahl)
amrev.oclust <- clustering_tm(amrev.tnet)
amrev.oclust

Local Clustering Coefficient

Robins and Alexander (2004)

local_ra_clust <- function(g, vind){
    new.g <- graph.neighborhood(g, order = 2, nodes = vind)[[1]]
    new.g.motif <- graph.motifs(new.g, size = 4)
    threepaths <- new.g.motif[7]
    fourcycles <- 4 * new.g.motif[9]
    retval <- ifelse(fourcycles + threepaths > 0, fourcycles / (fourcycles + threepaths), NaN)     return(retval)
    }
local_ra_clust <- cmpfun(local_ra_clust)
amrev.raclust.loc <- sapply(1:vcount(amrev), local_ra_clust, g = amrev)

Opsahl (2012)

(Took my computer ~4 minutes)

amrev.oclust.loc <- clustering_local_tm(amrev.tnet)

Local Clustering Coefficient

Robins and Alexander (2004)

Whose Ego Networks Had the Highest Clustering?

# Identify both the actors and those actors with valid clustering coefficients
amrev.raclust.loc.actors.ind <- which(is.nan(amrev.raclust.loc[which(V(amrev)$type == FALSE)]) == FALSE)
amrev.raclust.loc.actors <- data.frame(name = V(amrev)$name[amrev.raclust.loc.actors.ind])
amrev.raclust.loc.actors$clust <- amrev.raclust.loc[amrev.raclust.loc.actors.ind]
amrev.raclust.loc.actors$rank <- rank(amrev.raclust.loc.actors$clust)
amrev.raclust.loc.actors[order(amrev.raclust.loc.actors$clust, amrev.raclust.loc.actors$name), ]

Randomization

Two main options

Random network generation
Edge rewires

# Generate a random network with the same parameters as the Marvel network
n.primary <- sum(V(marvnet)$type == FALSE)
n.secondary <- vcount(marvnet) - n.primary
marvnet.rand.el <- rg_tm(ni = n.primary, np = n.secondary, ties = ecount(marvnet))
head(marvnet.rand.el) # It's an edgelist.
# Export it to igraph
summary(marvnet.rand.el)
# Because both the first and second mode begin at 1 ...
marvnet.rand.el[, "p"] <- marvnet.rand.el[, "p"] + n.primary
marvnet.rand.type <- c(rep(FALSE, times = n.primary), rep(TRUE, times = n.secondary))
marvnet.rand.atts <- data.frame(name = 1:vcount(marvnet), type = marvnet.rand.type)
marvnet.rand <- graph.data.frame(marvnet.rand.el, directed = FALSE, vertices = marvnet.rand.atts)

Also see ?bipartite.random.game in igraph package

Randomization

Let's compare the degree distributions. First mode

par(mfrow = c(2, 2))
tempdegree <- degree(marvnet, v = which(V(marvnet)$type == FALSE))
hist(log(1 + tempdegree), main = "Observed", xlab = "Issues in Which Each Character Appears (log)", ylab = "Frequency")
tempdegree <- degree(marvnet.rand, v = which(V(marvnet)$type == FALSE))
hist(log(1 + tempdegree), main = "Random", xlab = "Issues in Which Each Character Appears (log)", ylab = "Frequency")

Second mode

tempdegree <- degree(marvnet, v = which(V(marvnet)$type == TRUE))
hist(log(1 + tempdegree), main = "Observed", xlab = "Characters Appearing in Each Issue (log)", ylab = "Frequency")
tempdegree <- degree(marvnet.rand, v = which(V(marvnet)$type == TRUE))
hist(log(1 + tempdegree), main = "Random", xlab = "Characters Appearing in Each Issue (log)", ylab = "Frequency")

Randomization

Random graph generator

Good for small, dense networks
Highly general
Controls for size and density

Does not control for degree distribution

Rewiring method

# tnet method.
# Do not use now, as it will take A VERY LONG TIME.
# Also allows for weighted two-mode data and can shuffle weights

amrev.rand.rw <- rg_reshuffling_tm(amrev.tnet) # Reshuffles each edge

Randomization

Rewiring method

# igraph method. Inadviseable for large networks# First, recreate the bipartite network, but treat it as directed to preserves the bipartite structure
# Second, rewire the directed two-mode network
# Third, remove the direction
rewire_tm <- function(g, niter = NULL){
    g.el <- get.edgelist(g)
    new.g <- graph.data.frame(g.el, directed = TRUE)
    V(new.g)$type <- V(g)$type
    nswaps <- ifelse(is.null(niter), 10 * ecount(g), niter)
    new.g <- rewire(new.g, mode = "simple", nswaps)
    new.g <- as.undirected(new.g)
    return(new.g)
    }
rewire_tm <- cmpfun(rewire_tm)

amrev.raclust.exp <- replicate(1000, ra_clust(rewire_tm(amrev)))
summary(amrev.raclust.exp)
mean(amrev.raclust > amrev.raclust.exp)

Edge list must be consistently partitioned for this function to work!

Graph-Level Indices

When measuring graph-level indices on a projected network....

Project observed two-mode data to one-mode
Take graph-level measurement
Simulate

Simulate two-mode network from observed two-mode networks
Project simulated two-mode to a one-mode network
Take graph-level measurement on simulated one-mode network
Repeat to create distribution

Compare observed measurement to simulated distribution

DO NOT SIMULATE BASED ON ONE-MODE PROJECTION

Projection

Four methods

Dichotomous (Brieger. 1974. Social Forces.)
Sum (Brieger. 1974. Social Forces.)

Weighted
k threshold cut-off

Newman. 2001. Phys Rev E.

Collaboration weight penalized by number of collaborators

Backbone

Convert vertex pairs to z-scores using frequency counts

Set alpha cut-off
Retain edges beyond cut-off (+/-)

Ahn, Ahnert, Bagrow, & Barabási. 2011. Scientific Reports.
Serrano, Boguñá, & Vespignani. 2009 PNAS.
Demonstration: http://pastebin.com/UNi0ryG3

Projection

Dichotomous

# tnet
marv.dichot <- projecting_tm(marvnet.tnet, method = "binary") # Takes me ~1.5 minutes
head(marv.dichot, 15) # Performed on "primary" mode only
# igraph.
# Set 'which' to "true" for the other mode
# Set it to "both" for a list of both projections
marv.dichot <- bipartite.projection(marvnet, multiplicity = FALSE, which = "false")

Projection

Sum

# tnet
marv.sum <- projecting_tm(marvnet.tnet, method = "sum") # Takes me ~1.5 minutes
head(marv.sum, 15) # Performed on "primary" mode only
# k-threshold >= 12
marv.k12 <- marv.sum[which(marv.sum$w >= 12), ]
# igraph
marv.sum <- bipartite.projection(marvnet, multiplicity = TRUE, which = "false")
head(E(marv.sum)$weight, 20)
# k-threshold >= 12
marv.k12 <- subgraph.edges(marv.sum, which(E(marv.sum)$weight >= 12), delete.vertices = FALSE)

Projection

Newman (2001)

newman.2m.fun <- function(g, type = FALSE){
    vdeg <- degree(g)
    vneigh <- neighborhood(g, order = 1)
    mode.which <- ifelse(type, "true", "false")
    proj1m <- bipartite.projection(g, multiplicity = TRUE, which = mode.which)
    vmatch <- match(V(proj1m)$name, V(g)$name)
    proj1m.el <- get.edgelist(proj1m, names = FALSE)
    ncollabs <- function(x){
        orig.i <- vmatch[x[1]]
        orig.j <- vmatch[x[2]]
        art.k <- vneigh[[orig.i]][which(vneigh[[orig.i]] %in% vneigh[[orig.j]])]
        art.degs <- sum(1 / (vdeg[art.k] - 1))
        return(art.degs)
        }
    E(proj1m)$newman.w <- apply(proj1m.el, 1, ncollabs)
    return(proj1m)
    }
newman.2m.fun <- cmpfun(newman.2m.fun)

marv.newman <- newman.2m.fun(marvnet) # Took me ~30 seconds
head(cbind(E(marv.newman)$weight, E(marv.newman)$newman.w), 20)

Projection

Meanings

Dichotomous

"Any commonality"

Weighted

"Number of commonalities"

k threshold cut-off

"Any commonality beyond k"

Newman (2001)

"Probability of dyadic interaction"

Backbone

"Dyadic interaction far above (or below) 'chance'"
"Chance" based upon the level of an individual's activity

Distance

Weights affect the interpretation of distance

Dichotomous

Simple: number of collaborations away

Weighted (including Newman)

Distance between adjacent vertices is 1 / edge weight
Average path length may be less than 1!
Normalization (Opsahl, Agneessens, and Skvoretz 2010)

Divide the weights by the mean of all weights
Then calculate distance

Distance

# Dichotomous
shortest.paths(marv.newman, v = 1, to = 2, weights = rep(1, ecount(marv.newman))) # 3

# Summed weights
shortest.paths(marv.newman, v = 1, to = 2, weights = E(marv.newman)$weight) # 4
shortest.paths(marv.newman, 1, 2, weight = E(marv.newman)$weight / mean(E(marv.newman)$weight)) # 1.18

# Newmanedges2keep <- which(is.infinite(E(marv.newman)$newman.w) == FALSE)
marv.newman <- subgraph.edges(marv.newman, edges2keep, delete.vertices = FALSE)
shortest.paths(marv.newman, v = 1, to = 2, weights = E(marv.newman)$newman.w) # 0.38
shortest.paths(marv.newman, v = 1, to = 2, weights = E(marv.newman)$newman.w / mean(E(marv.newman)$newman.w)) # 1.33

Plotting

Symbols

Primary mode: circular nodes
Secondary modes: square nodes

Layout

Bipartite

Primary mode to the left
Secondary mode to the right

Multidimensional layouts appropriate also
Projected as one-mode

Any of the conventional layouts

Plotting

Marvel Universe Data

My inclinations

Character co-appearance

Project to one-mode

Subsample

Popular characters (two-mode degree)
Strong relationships (Newman's weight)
Remove isolates

Clean the names
Color by cohesion
Scale vertices by closeness
Scale edge widths by Newman's weight

Plotting

issues48 <- which((degree(marvnet) >= 48) & (V(marvnet)$type == FALSE))
issues48 <- which(V(marv.newman)$name %in% V(marvnet)$name[issues48])
marv.newman.sg <- induced.subgraph(marv.newman, issues48)
marv.newman.sg <- subgraph.edges(marv.newman.sg, which(E(marv.newman.sg)$newman.w > 4), delete.vertices = TRUE)

cleanupnames <- function(y){
    y <- strsplit(y, "/", fixed = TRUE)[[1]][1]
    y <- strsplit(y, " [", fixed = TRUE)[[1]][1]
    y <- gsub(".", "", y, fixed = TRUE)
    ylen <- nchar(y)
    lastchar <- substr(y, ylen, ylen)
    y <- ifelse(lastchar == " ", substr(y, 1, ylen - 1), y)
    return(y)
    }
cleanupnames <- cmpfun(cleanupnames)
V(marv.newman.sg)$name <- sapply(V(marv.newman.sg)$name, cleanupnames)

Plotting

library(RColorBrewer)
# Community detection
V(marv.newman.sg)$comm <- multilevel.community(marv.newman.sg, weights = E(marv.newman.sg)$weight)$membership
max(V(marv.newman.sg)$comm) # 25 is too many communities to portray
# k-cores
V(marv.newman.sg)$kc <- graph.coreness(marv.newman.sg)
kcmax <- max(V(marv.newman.sg)$kc)
kcmax # 10 on a continuous scale is good.
min(V(marv.newman.sg)$kc) # Because there are no isolates, minimum is 1.
corecolors <- brewer.pal(max(V(marv.newman.sg)$kc), "Reds")
corecolors <- c(rep("white", times = kcmax - 9), corecolors)
V(marv.newman.sg)$color <- corecolors[V(marv.newman.sg)$kc]

Plotting

V(marv.newman.sg)$closeness <- closeness(marv.newman.sg, weights = E(marv.newman.sg)$weight / mean(E(marv.newman.sg)$weight), normalize = FALSE)
V(marv.newman.sg)$size <- ecdf(V(marv.newman.sg)$closeness)(V(marv.newman.sg)$closeness)
V(marv.newman.sg)$size <- V(marv.newman.sg)$size * 6
V(marv.newman.sg)$label.cex <- V(marv.newman.sg)$size / (4.5 * 2)

E(marv.newman.sg)$width <- ecdf(E(marv.newman.sg)$weight)(E(marv.newman.sg)$weight)

g.lo <- layout.fruchterman.reingold(marv.newman.sg, params = list(niter = 2500, area = vcount(marv.newman.sg)^3))
V(marv.newman.sg)$x <- g.lo[ ,1]
V(marv.newman.sg)$y <- g.lo[ ,2]

svg("MarvPlot.svg", height = 20, width = 20)
plot(marv.newman.sg, vertex.label.color = "black", vertex.label.family = "sans", margin = c(0, 0, 0, 0))
dev.off()

Try opening that .svg file in your web browser!

THANK YOU!

Benjamin Lind

blind@hse.ru

PDF:

http://goo.gl/wfgOEO

R SNA Intro Lab St. Petersburg

By Benjamin Lind

R SNA Intro Lab St. Petersburg

6,453

Benjamin Lind

benjaminlind