Social Network Analysis in R

Laboratory Two
Benjamin Lind

Social Network Analysis: Internet Research
St. Petersburg
20 August, 2013 (11:45-13:15)

Network Analysis Packages in R

statnet

igraph

intergraph

tnet

RSiena

statnet

Suite collecting the packages
  • sna
  • network
  • ergm, tergm
  • ...and others
  
Interdisciplinary, cross-university development team

Motivations
  • General social network metrics
  • Data storage
  • Modeling and statistical inference

WebsiteWiki

igraph

Originated as a fork of the sna package
Libraries C/C++, Python, and R
Created by Gábor Csárdi

Motivations and Strengths
  • Network science tradition
  • General and complex network metrics
  • Data storage

Website, Wiki

intergraph

statnet and igraph don't play nicely
  • Many identical function names
  • Different data formats

intergraph is a conversion tool
  • asIgraph()
  • asNetwork()






Website

tnet

Created by Tore Opsahl
Provides metrics for three types of networks
  • Weighted
  • Two-mode
  • Longitudinal


Motivated by the bias introduced by coercing these networks into simple graphs

Works well with igraph




Website

RSiena

Originally Windows-based, stand alone
Motivations
  • Longitudinal network modeling
  • Longitudinal behavioral network modeling
  • Cross-sectional network modeling
    • p* / ERGM








Website

Data

R can read anything!*

  • Delimited text (comma, tab, bang, pipe, etc)
  • Excel and Gnumeric files
  • Internet files
    • Tables from webpages
    • .json, .xml, .html files
    • Files stored online
  • Pajek files
  • SPSS and Stata files
  • Spatial data in shapefiles and KML

If possible, I'd recommend tab-delimited text files.

*It reads some file formats much better (and easier) than others

Import a Tab-Delimited Edge List

Wikipedia votes for site administrators
One-mode, directed, simple
Source










(Picture courtesy of Frank Schulenburg)

Import a Tab-Delimited Edge List

library(RCurl)
library(igraph)

#Function to read a data table from online read.sstab<-function(theurl, ...){ #_theurl_ refers to the location of the data #_..._ are parameters passed onto read.table require(RCurl) outtab<-getURL(theurl, ssl.verifypeer=FALSE) outtab<-textConnection(outtab) outtab<-read.table(outtab, sep="\t", ...) return(outtab) }
wikidat<-read.sstab("http://pastebin.com/raw.php?i=UVmTznBj", header=TRUE, skip=6) #Convert the data to igraph wikidat<-graph.data.frame(wikidat, directed=TRUE) summary(wikidat) #Nodes: 7115; Edges: 103689


Import a Pajek File












#snaspb2013 name network
Netlytic
One-mode, directed, simple

Import a Pajek File


download.file("http://pastebin.com/raw.php?i=a7sF1V75", "snaspb2013.net", method="wget")
snaspb2013<-read.graph("snaspb2013.net", "pajek")

file.remove("snaspb2013.net")

For details and formats produced by other software, see

 ?read.graph

Import Two-Mode Data into tnet

Data on membership in metal bands
Collected from Freebase.com

These commands will take some time.
Advisable to revisit later.


library(tnet)
metal.bands.df<-read.sstab("http://pastebin.com/raw.php?i=AA1SPz5K", header=TRUE, skip=4, as.is=TRUE, stringsAsFactors=FALSE, strip.white=TRUE)
colnames(metal.bands.df)[c(1,2)]<-c("group", "member")
su<-function(x) return(sort(unique(x)))

(Picture by Cecil)

Import Two-Mode Data into tnet

non.dupes<-which(duplicated(paste(metal.bands.df$group, metal.bands.df$member, sep="*"))==FALSE)
metal.bands.df<-metal.bands.df[non.dupes,c("member", "group")]  all.metal.names<-unique(c(su(metal.bands.df$member), su(metal.bands.df$group)))all.metal.names<-all.metal.names[-which(all.metal.names=="")]metal.bands.df$member<-match(metal.bands.df$member, all.metal.names) metal.bands.df$group<-match(metal.bands.df$group, all.metal.names)
miss.rows<-which((is.na(metal.bands.df$member) | is.na(metal.bands.df$group))==TRUE) metal.bands.df<-metal.bands.df[-miss.rows,]

metal.bands.tn<-as.tnet(metal.bands.df, type="binary two-mode tnet"); rm(non.dupes, miss.rows)

Measurements

Density and Degree

is.simple(snaspb2013) #Verify it's simple
is.directed(snaspb2013) #Verify it's directed
vcount(snaspb2013) #Number of vertices
ecount(snaspb2013) #Number of edges
graph.density(snaspb2013) #Density

V(snaspb2013)$indegree<-degree(snaspb2013, mode="in") V(snaspb2013)$outdegree<-degree(snaspb2013, mode="out") V(snaspb2013)$totaldegree<-degree(snaspb2013, mode="total")

all.vatts<-list.vertex.attributes(snaspb2013) sapply(all.vatts, get.vertex.attribute, graph=snaspb2013) summary(sapply(all.vatts[-1], get.vertex.attribute, graph=snaspb2013))
par(mfrow=c(1,2)) hist(V(snaspb2013)$indegree, main="snaspb2013", xlab="Indegree") hist(V(snaspb2013)$outdegree, main="snaspb2013", xlab="Outdegree")dev.off()

Dyads and Triads

unlist(dyad.census(snaspb2013))reciprocity(snaspb2013)
Which measurement of reciprocity did that value refer to?
  • Edgewise?
  • Dyadic?
  • Dyadic, non-null ("ratio")?


triad.census(snaspb2013)
transitivity(snaspb2013) #What does this number refer to?
V(snaspb2013)$loc.trans<-transitivity(snaspb2013, "local")
Who has the highest clustering coefficient?
Metal bonus!
clustering_tm(metal.bands.tn, subsample=.1)

Paths

average.path.length(snaspb2013)
diameter(snaspb2013) / 100 #Bug in the code
V(snaspb2013)$id[(get.diameter(snaspb2013))]
hist(shortest.paths(snaspb2013)/100, main="Histogram of Shortest Path Lengths", xlab="Path Lengths")

V(snaspb2013)$betw<-betweenness(snaspb2013)E(snaspb2013)$eb<-edge.betweenness(snaspb2013)hist(E(snaspb2013)$eb, main="Histogram of Edge Betweenness", sub="snaspb2013", xlab="Edge Betweenness")

#\m/ METAL BONUS! \m/member.geodist<-distance_tm(metal.bands.tn)

Connectivity

#How many weak and strong components do we have? sapply(c("weak", "strong"), function(x) return(sapply(list(snaspb2013=snaspb2013, wikidat=wikidat), no.clusters, mode=x)))
#Notice the distributionsclusters(snaspb2013, mode="weak")$csize clusters(snaspb2013, mode="strong")$csize clusters(wikidat, mode="weak")$csize tail(sort(clusters(wikidat, mode="strong")$csize))

V(snaspb2013)$comp.w<-clusters(snaspb2013, mode="weak")$membership V(snaspb2013)$comp.s<-clusters(snaspb2013, mode="strong")$membership
V(snaspb2013)$id[which(V(snaspb2013)$comp.s == which.max(clusters(snaspb2013, mode="strong")$csize))]

Centrality

We've already reviewed degree and betweenness.
Examples of closeness and eigenvector centrality:
V(snaspb2013)$closeness<-closeness(snaspb2013)
V(snaspb2013)$evcent<-evcent(snaspb2013)$vector

k-cores

V(snaspb2013)$kc.undir<-graph.coreness(as.undirected(snaspb2013, mode="collapse"))

#How are undirected k-cores related to centrality?
kc.cent.corr.fun<-function(x, y=V(snaspb2013)$kc.undir){
  a<-get.vertex.attribute(snaspb2013, name=x)
  b<-y
  return(cor.test(a, b, method="kendall", exact=FALSE)$estimate)
  }

cent.atts<-c("indegree", "outdegree", "totaldegree", "betw", "closeness", "evcent") sapply(cent.atts, kc.cent.corr.fun)
#Directed k-cores sapply(c("in", "out", "all"), function(y) return(sapply(cent.atts, kc.cent.corr.fun, y=graph.coreness(snaspb2013, mode=y)))) rm(cent.atts, kc.cent.corr.fun)

Community Detection

snaspb2013.sg.members<-which(V(snaspb2013)$comp.w == which.max(clusters(snaspb2013, mode="weak")$csize))
snaspb2013.sg<-induced.subgraph(snaspb2013, snaspb2013.sg.members)
snaspb2013.comms<-multilevel.community(as.undirected(snaspb2013.sg, mode="collapse")) snaspb2013.comms$modularity V(snaspb2013.sg)$comms<-snaspb2013.comms$membership names(snaspb2013.comms$membership)<-V(snaspb2013.sg)$id sort(snaspb2013.comms$membership)
snaspb2013.comms.w<-multilevel.community(as.undirected(snaspb2013.sg, mode="collapse"), weights=max(E(snaspb2013.sg)$eb)-E(snaspb2013.sg)$eb) snaspb2013.comms.w$modularity V(snaspb2013.sg)$comms.w<-snaspb2013.comms.w$membership names(snaspb2013.comms.w$membership)<-V(snaspb2013.sg)$id sort(snaspb2013.comms.w$membership)
See also ?communities

Visualization

Start simple
plot(snaspb2013)
What could be improved?
  • Include labels
  • Less activity in the center
  • Smaller nodes, arrows
snaspb.layout<-layout.fruchterman.reingold(snaspb2013, params=list(niter=5000, area=vcount(snaspb2013)^3))
V(snaspb2013)$x<-snaspb.layout[,1]
V(snaspb2013)$y<-snaspb.layout[,2]
rm(snaspb.layout)
plot(snaspb2013, vertex.size=5, vertex.label=V(snaspb2013)$id, vertex.label.family="sans", vertex.label.cex=.75, edge.arrow.size=.5, margin=c(0,0,0,0), edge.curved=.33)

Visualization

What are your empirical interests?
  • Sizes
  • Colors and shading
    • Categorical variables
    • Interval and ordinal variables
  • Transparency


Parameters to Vary
  • Nodes
    • Non-continuous: Shape, node color, border color
    • Continuous: Node size, border width
  • Edges
    • Non-continuous: Color, line type, arrowhead type
    • Continuous: Width
  • Labels: Size, color, visibility

Questions?

How would you represent:
  • k-cores?
  • Eigenvector centrality?
  • Edge betweenness?

Advisability aside, would it be possible to represent all of them at once?


png("snaspb2013.kc.png", height=8, width=11, units="in", bg="transparent", res=300)




plot(snaspb2013, vertex.size=8*(.5+V(snaspb2013)$evcent), vertex.label = V(snaspb2013)$id, edge.width = log(E(snaspb2013)$eb+1)/2, vertex.color = rev(heat.colors(max(V(snaspb2013)$kc.undir)+1))[V(snaspb2013)$kc.undir+1], vertex.label.color="white", vertex.label.family="sans", vertex.label.cex=.75, edge.arrow.size=.5, edge.curved=.33, margin=c(0, 0, 0, 0))

dev.off()    

Divergent Color Schemes

Let's try it for the communities detected!
snaspb2013.sg.layout<-layout.fruchterman.reingold(snaspb2013.sg, params=list(niter=5000, area=vcount(snaspb2013.sg)^3)) V(snaspb2013.sg)$x<-snaspb2013.sg.layout[,1] V(snaspb2013.sg)$y<-snaspb2013.sg.layout[,2] rm(snaspb2013.sg.layout)

plot(snaspb2013.sg, vertex.size=5, vertex.color=rainbow(max(V(snaspb2013.sg)$comms.w))[V(snaspb2013.sg)$comms.w], vertex.label = V(snaspb2013.sg)$id, vertex.label.family="sans", vertex.label.cex=.75, edge.arrow.size=.5, margin=c(0,0,0,0), edge.curved=.33)



Exercises

Calculate the non-null dyadic reciprocity on the Wikipedia network.

Calculate the modularity for the largest weak component in the Wikipedia network. Assign ("set") it as a graph-level attribute.
(Hint: ??"graph attribute")

What are the maximum in, out, and "all" k-core values in the Wikipedia network?  Assign those three k-core values as vertex attributes.

Plot the #snaspb2013 network that illustrates weighted communities with nodes scaled by betweenness centrality, their labels scaled by local transitivity, edges scaled according to their betweenness centrality, and using a Kamada Kawai layout.

Made with Slides.com