Social Network Analysis in R
Laboratory Two
Benjamin Lind
Social Network Analysis: Internet Research
St. Petersburg
20 August, 2013 (11:45-13:15)
Social Network Analysis: Internet Research
St. Petersburg
20 August, 2013 (11:45-13:15)
Network Analysis Packages in R
statnet
statnet
Suite collecting the packages
- sna
- network
- ergm, tergm
- ...and others
Interdisciplinary, cross-university development team
Motivations
- General social network metrics
- Data storage
- Modeling and statistical inference
igraph
Originated as a fork of the sna package
Libraries C/C++, Python, and R
Created by Gábor Csárdi
Motivations and Strengths
- Network science tradition
- General and complex network metrics
- Data storage
intergraph
statnet and igraph don't play nicely
- Many identical function names
- Different data formats
intergraph is a conversion tool
- asIgraph()
- asNetwork()
tnet
Created by Tore Opsahl
Provides metrics for three types of networks
- Weighted
- Two-mode
- Longitudinal
Motivated by the bias introduced by coercing these networks into simple graphs
Works well with igraph
RSiena
Originally Windows-based, stand alone
Motivations
- Longitudinal network modeling
- Longitudinal behavioral network modeling
- Cross-sectional network modeling
- p* / ERGM
Data
R can read anything!*
-
Delimited text (comma, tab, bang, pipe, etc)
-
Excel and Gnumeric files
-
Internet files
- Tables from webpages
- .json, .xml, .html files
- Files stored online
-
Pajek files
-
SPSS and Stata files
-
Spatial data in shapefiles and KML
If possible, I'd recommend tab-delimited text files.
*It reads some file formats much better (and easier) than others
Import a Tab-Delimited Edge List
Wikipedia votes for site administrators
One-mode, directed, simple
(Picture courtesy of Frank Schulenburg)
Import a Tab-Delimited Edge List
library(RCurl) library(igraph)
#Function to read a data table from online read.sstab<-function(theurl, ...){ #_theurl_ refers to the location of the data #_..._ are parameters passed onto read.table require(RCurl) outtab<-getURL(theurl, ssl.verifypeer=FALSE) outtab<-textConnection(outtab) outtab<-read.table(outtab, sep="\t", ...) return(outtab) }
wikidat<-read.sstab("http://pastebin.com/raw.php?i=UVmTznBj", header=TRUE, skip=6) #Convert the data to igraph wikidat<-graph.data.frame(wikidat, directed=TRUE) summary(wikidat) #Nodes: 7115; Edges: 103689
Import a Pajek File
#snaspb2013 name network
Netlytic
One-mode, directed, simple
Import a Pajek File
download.file("http://pastebin.com/raw.php?i=a7sF1V75", "snaspb2013.net", method="wget")
snaspb2013<-read.graph("snaspb2013.net", "pajek")
file.remove("snaspb2013.net")
For details and formats produced by other software, see
?read.graph
Import Two-Mode Data into tnet
Data on membership in metal bands
Collected from Freebase.com
These commands will take some time.
Advisable to revisit later.
library(tnet) metal.bands.df<-read.sstab("http://pastebin.com/raw.php?i=AA1SPz5K", header=TRUE, skip=4, as.is=TRUE, stringsAsFactors=FALSE, strip.white=TRUE) colnames(metal.bands.df)[c(1,2)]<-c("group", "member")
su<-function(x) return(sort(unique(x)))
(Picture by Cecil)
Import Two-Mode Data into tnet
non.dupes<-which(duplicated(paste(metal.bands.df$group, metal.bands.df$member, sep="*"))==FALSE)
metal.bands.df<-metal.bands.df[non.dupes,c("member", "group")]
all.metal.names<-unique(c(su(metal.bands.df$member), su(metal.bands.df$group)))
all.metal.names<-all.metal.names[-which(all.metal.names=="")]
metal.bands.df$member<-match(metal.bands.df$member, all.metal.names) metal.bands.df$group<-match(metal.bands.df$group, all.metal.names)
miss.rows<-which((is.na(metal.bands.df$member) | is.na(metal.bands.df$group))==TRUE) metal.bands.df<-metal.bands.df[-miss.rows,]
metal.bands.tn<-as.tnet(metal.bands.df, type="binary two-mode tnet"); rm(non.dupes, miss.rows)
Measurements
Density and Degree
is.simple(snaspb2013) #Verify it's simple is.directed(snaspb2013) #Verify it's directed vcount(snaspb2013) #Number of vertices ecount(snaspb2013) #Number of edges graph.density(snaspb2013) #Density
V(snaspb2013)$indegree<-degree(snaspb2013, mode="in") V(snaspb2013)$outdegree<-degree(snaspb2013, mode="out") V(snaspb2013)$totaldegree<-degree(snaspb2013, mode="total")
all.vatts<-list.vertex.attributes(snaspb2013) sapply(all.vatts, get.vertex.attribute, graph=snaspb2013) summary(sapply(all.vatts[-1], get.vertex.attribute, graph=snaspb2013))
par(mfrow=c(1,2)) hist(V(snaspb2013)$indegree, main="snaspb2013", xlab="Indegree") hist(V(snaspb2013)$outdegree, main="snaspb2013", xlab="Outdegree")
dev.off()
Dyads and Triads
unlist(dyad.census(snaspb2013))
reciprocity(snaspb2013)
- Edgewise?
- Dyadic?
- Dyadic, non-null ("ratio")?
triad.census(snaspb2013)
transitivity(snaspb2013) #What does this number refer to?
V(snaspb2013)$loc.trans<-transitivity(snaspb2013, "local")
Who has the highest clustering coefficient?
Metal bonus!
clustering_tm(metal.bands.tn, subsample=.1)
Paths
average.path.length(snaspb2013)
diameter(snaspb2013) / 100 #Bug in the code V(snaspb2013)$id[(get.diameter(snaspb2013))] hist(shortest.paths(snaspb2013)/100, main="Histogram of Shortest Path Lengths", xlab="Path Lengths")
V(snaspb2013)$betw<-betweenness(snaspb2013)
E(snaspb2013)$eb<-edge.betweenness(snaspb2013)
hist(E(snaspb2013)$eb, main="Histogram of Edge Betweenness", sub="snaspb2013", xlab="Edge Betweenness")
#\m/ METAL BONUS! \m/
member.geodist<-distance_tm(metal.bands.tn)
Connectivity
#How many weak and strong components do we have?
sapply(c("weak", "strong"), function(x) return(sapply(list(snaspb2013=snaspb2013, wikidat=wikidat), no.clusters, mode=x)))
#Notice the distributions
clusters(snaspb2013, mode="weak")$csize clusters(snaspb2013, mode="strong")$csize clusters(wikidat, mode="weak")$csize tail(sort(clusters(wikidat, mode="strong")$csize))
V(snaspb2013)$comp.w<-clusters(snaspb2013, mode="weak")$membership V(snaspb2013)$comp.s<-clusters(snaspb2013, mode="strong")$membership
V(snaspb2013)$id[which(V(snaspb2013)$comp.s == which.max(clusters(snaspb2013, mode="strong")$csize))]
Centrality
We've already reviewed degree and betweenness.
Examples of closeness and eigenvector centrality:
V(snaspb2013)$closeness<-closeness(snaspb2013)
V(snaspb2013)$evcent<-evcent(snaspb2013)$vector
k-cores
V(snaspb2013)$kc.undir<-graph.coreness(as.undirected(snaspb2013, mode="collapse")) #How are undirected k-cores related to centrality? kc.cent.corr.fun<-function(x, y=V(snaspb2013)$kc.undir){ a<-get.vertex.attribute(snaspb2013, name=x) b<-y return(cor.test(a, b, method="kendall", exact=FALSE)$estimate) }
cent.atts<-c("indegree", "outdegree", "totaldegree", "betw", "closeness", "evcent") sapply(cent.atts, kc.cent.corr.fun)
#Directed k-cores sapply(c("in", "out", "all"), function(y) return(sapply(cent.atts, kc.cent.corr.fun, y=graph.coreness(snaspb2013, mode=y))))
rm(cent.atts, kc.cent.corr.fun)
Community Detection
snaspb2013.sg.members<-which(V(snaspb2013)$comp.w == which.max(clusters(snaspb2013, mode="weak")$csize))
snaspb2013.sg<-induced.subgraph(snaspb2013, snaspb2013.sg.members)
snaspb2013.comms<-multilevel.community(as.undirected(snaspb2013.sg, mode="collapse")) snaspb2013.comms$modularity V(snaspb2013.sg)$comms<-snaspb2013.comms$membership names(snaspb2013.comms$membership)<-V(snaspb2013.sg)$id sort(snaspb2013.comms$membership)
snaspb2013.comms.w<-multilevel.community(as.undirected(snaspb2013.sg, mode="collapse"), weights=max(E(snaspb2013.sg)$eb)-E(snaspb2013.sg)$eb) snaspb2013.comms.w$modularity V(snaspb2013.sg)$comms.w<-snaspb2013.comms.w$membership names(snaspb2013.comms.w$membership)<-V(snaspb2013.sg)$id sort(snaspb2013.comms.w$membership)
Visualization
Start simple
plot(snaspb2013)
What could be improved?
- Include labels
- Less activity in the center
- Smaller nodes, arrows
snaspb.layout<-layout.fruchterman.reingold(snaspb2013, params=list(niter=5000, area=vcount(snaspb2013)^3)) V(snaspb2013)$x<-snaspb.layout[,1] V(snaspb2013)$y<-snaspb.layout[,2] rm(snaspb.layout)
plot(snaspb2013, vertex.size=5, vertex.label=V(snaspb2013)$id, vertex.label.family="sans", vertex.label.cex=.75, edge.arrow.size=.5, margin=c(0,0,0,0), edge.curved=.33)
Visualization
What are your empirical interests?
- Sizes
- Colors and shading
- Categorical variables
- Interval and ordinal variables
- Transparency
Parameters to Vary
- Nodes
- Non-continuous: Shape, node color, border color
- Continuous: Node size, border width
- Edges
- Non-continuous: Color, line type, arrowhead type
- Continuous: Width
- Labels: Size, color, visibility
Questions?
How would you represent:
-
k-cores?
-
Eigenvector centrality?
-
Edge betweenness?
Advisability aside, would it be possible to represent all of them at once?
png("snaspb2013.kc.png", height=8, width=11, units="in", bg="transparent", res=300)
plot(snaspb2013, vertex.size=8*(.5+V(snaspb2013)$evcent), vertex.label = V(snaspb2013)$id, edge.width = log(E(snaspb2013)$eb+1)/2, vertex.color = rev(heat.colors(max(V(snaspb2013)$kc.undir)+1))[V(snaspb2013)$kc.undir+1], vertex.label.color="white", vertex.label.family="sans", vertex.label.cex=.75, edge.arrow.size=.5, edge.curved=.33, margin=c(0, 0, 0, 0))
dev.off()
Divergent Color Schemes
Let's try it for the communities detected!
snaspb2013.sg.layout<-layout.fruchterman.reingold(snaspb2013.sg, params=list(niter=5000, area=vcount(snaspb2013.sg)^3)) V(snaspb2013.sg)$x<-snaspb2013.sg.layout[,1] V(snaspb2013.sg)$y<-snaspb2013.sg.layout[,2] rm(snaspb2013.sg.layout)
plot(snaspb2013.sg, vertex.size=5, vertex.color=rainbow(max(V(snaspb2013.sg)$comms.w))[V(snaspb2013.sg)$comms.w], vertex.label = V(snaspb2013.sg)$id, vertex.label.family="sans", vertex.label.cex=.75, edge.arrow.size=.5, margin=c(0,0,0,0), edge.curved=.33)
Exercises
Calculate the non-null dyadic reciprocity on the Wikipedia network.
Calculate the modularity for the largest weak component in the Wikipedia network. Assign ("set") it as a graph-level attribute.
(Hint: ??"graph attribute")
What are the maximum in, out, and "all" k-core values in the Wikipedia network? Assign those three k-core values as vertex attributes.
Plot the #snaspb2013 network that illustrates weighted communities with nodes scaled by betweenness centrality, their labels scaled by local transitivity, edges scaled according to their betweenness centrality, and using a Kamada Kawai layout.
SNA 2013-R Lab 2
By Benjamin Lind
SNA 2013-R Lab 2
- 7,925