Social Network Analysis in R
Laboratory Four
Benjamin Lind
Social Network Analysis: Internet Research
St. Petersburg
22 August, 2013 (11:45-13:15)
Social Network Analysis: Internet Research
St. Petersburg
22 August, 2013 (11:45-13:15)
Why Model with p*?
Let's discuss the alternatives
Benefit of flexibility: p* / ERGM as "a language"
Expressed as:
P(Y = y) = exp(θ'*g(y)) / k(y)
Generalizations and Applications
Seperable Termporal Exponential Random Graph Models (STERGM)
- Can take either network or networkDynamic objects
- Model tie formation and dissolution as different processes
Simulate entire network from ego network parameters
- Tie distribution
- Homophily
- Triadic effects
- Uses "target" statistics
p* in R
Requires statnet and the ergm package
Data must be in statnet's network format
For documentation on p*, see ?ergm
For documentation on which network features can be readily modeled, see ?ergm.terms
Acquire the Data
About the data
-
#snaspb2014 name network from Netlytic.org on Aug 21, 2013
-
Added user attributes from profiles (via twitteR)
-
Coded whether or not attending #snaspb2013
library(RCurl)
dl.data<-getURL("http://pastebin.com/raw.php?i=WdjebzzW", ssl.verifypeer=FALSE)
dl.data<-textConnection(dl.data)
source(dl.data)
rm(dl.data)
Includes both igraph and network objects
For ERGM compatibility, we'll be working with network
sapply(list.vertex.attributes(snaspb.net), function(x) return(get.vertex.attribute(snaspb.net, x)))
Plot the Data
plot(snaspb.net, vertex.cex=.75*log(snaspb.net %v% "statusesCount"), vertex.col=rev(rainbow(2, s=.75))[1+(snaspb.net %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(snaspb.net %n% "n")^3))
(Bear in mind this syntax is for the network package, not igraph)
Hypotheses?
Modeling
Start simple
Select the best models based upon the BIC
Does the model reflect reality?
-
Plot simulations for quick diagnosis
-
Compare simulated measurements to the observed
Check MCMC diagnostics when MCMC called
Baseline Model
Erdős–Rényi
basemod<-ergm(snaspb.net~edges)
summary(basemod)
Monte Carlo MLE Results:
Estimate Std. Error MCMC % p-value
edges -2.9836 0.1153 NA <1e-04 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null Deviance: 2273.5 on 1640 degrees of freedom
Residual Deviance: 633.3 on 1639 degrees of freedom
AIC: 635.3 BIC: 640.7 (Smaller is better.)
Interpreting the Baseline Model
Coefficient estimates are conditional log-odds
Probability of a single tie forming = exp(-2.98)/(1+exp(-2.98))
Probability of a single tie forming = 0.04833763
Equal to density
library(sna)
gden(snaspb.net)
Our reference BIC is 640.7
Simulate the Model
base.sim <- simulate(basemod)
plot(base.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(base.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(base.sim %n% "n")^3))
base.gof<-gof(basemod)
summary(base.gof)
par(mfrow=c(2,2)); plot(base.gof)
Differential Homophily Model
mod1<-ergm(snaspb.net ~ edges + nodematch("attended", diff=TRUE))
summary(mod1)
Monte Carlo MLE Results:
Estimate Std. Error MCMC % p-value
edges -3.4139 0.1993 NA <1e-04 ***
nodematch.attended.FALSE -1.0969 0.4562 NA 0.0163 *
nodematch.attended.TRUE 1.8480 0.2558 NA <1e-04 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null Deviance: 2274 on 1640 degrees of freedom
Residual Deviance: 547 on 1637 degrees of freedom
AIC: 553 BIC: 569.2 (Smaller is better.)
Differential Homophily Model
Effects are significant and in the expected directions
Probability of a tie between two attendees: 0.17
exp(-3.4139+1.8480)/(1+exp(-3.4139+1.8480))
Probability of a tie between two non-attendees: 0.01
exp(-3.4139+(-1.0969))/(1+exp(-3.4139+-(1.0969)))
BIC of 569.2 is much better than in the edges-only model (640.7)
Model didn’t require Markov chain Monte Carlo estimation
Simulate the Model
mod1.sim <- simulate(mod1)
plot(mod1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod1.sim %n% "n")^3))
mod1.gof<-gof(mod1)
summary(mod1.gof)
par(mfrow=c(2,2)); plot(mod1.gof)
Edgewise Shared Partners
Does given edge among two actors result in each of those actors linking to a third party?
mod2<-ergm(snaspb.net~edges+nodematch("attended", diff=TRUE)+gwesp(.5))
mod2.mcmc.diag<-mcmc.diagnostics.ergm(mod2)
Are sample statistics significantly different from observed?
Nope? Good.
(If it did, we'd increase the MCMC sample size.)
Sample statistics auto-correlation:...
These are far too high. Increase the MCMC interval.
Sample statistics burn-in diagnostic (Geweke):...
Some of the p-values are a bit low. Increase MCMC burnin.
Edgewise Shared Partners
#If we had trouble with the sample statistics, we would try something like
#mod2.1<-ergm(snaspb.net ~ edges+nodematch("attended", diff = TRUE)+gwesp(.5), control = control.ergm( MCMC.burnin = 50000, MCMC.interval = 5000, MCMC.samplesize=50000))
#Instead, we'll use mod2.1<-ergm(snaspb.net ~ edges+nodematch("attended", diff = TRUE)+gwesp(.5), control = control.ergm( MCMC.burnin = 50000, MCMC.interval = 5000))
mod2.1.mcmc.diag<-mcmc.diagnostics.ergm(mod2.1)
#The sample statistics aren't great, but everything else looks okay.
Edgewise Shared Partners
summary(mod2.1)
Monte Carlo MLE Results:
Estimate Std. Error MCMC % p-value
edges -3.6282 0.2023 0 < 1e-04 ***
nodematch.attended.FALSE -0.9298 0.4589 0 0.042908 *
nodematch.attended.TRUE 1.1690 0.2747 0 < 1e-04 ***
gwesp 0.5958 0.1174 0 < 1e-04 ***
gwesp.alpha 0.8021 0.2310 0 0.000531 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null Deviance: 2273.5 on 1640 degrees of freedom
Residual Deviance: 535.4 on 1635 degrees of freedom
AIC: 545.4 BIC: 572.4 (Smaller is better.)
Edgewise Shared Partners
Shared Partner effect is significant.
Differential homophily effects are still significant.
BIC is slightly worse.
Simulate the Model
mod2.1.sim <- simulate(mod2.1)
plot(mod2.1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod2.1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod2.1.sim %n% "n")^3))
Title
mod1.gof<-gof(mod1)
summary(mod1.gof)
par(mfrow=c(2,2)); plot(mod1.gof)
So is it homophily?
An alternative hypothesis.
#What if those attending are just more popular?
mod1.1<-ergm(snaspb.net~edges+nodefactor("attended"))
summary(mod1.1)
mod1.1.sim<-simulate(mod1.1)
plot(mod1.1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod1.1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod1.1.sim %n% "n")^3))
mod1.1.gof<-gof(mod1.1)
summary(mod1.1.gof)
par(mfrow=c(2,2)); plot(mod1.1.gof)
#Compare to the findings from mod1. Which is the best model?
Exercise: Homophily & Politics
Comparing different homophily forms
library(igraph); library(intergraph)
polblogs<-nexus.get(id="polblogs") #0=liberal, 1=conservative
polblogs<-asNetwork(polblogs)
detach("package:igraph", unload=TRUE)
pol.baseline<-ergm(polblogs~edges)
pol.nodecov<-ergm(polblogs~edges+nodecov("LeftRight")) #Is side more popular?
pol.nodematch<-ergm(polblogs~edges+nodematch("LeftRight")) #General homophily?
pol.nodematch.diff<-ergm(polblogs~edges+nodematch("LeftRight", diff=TRUE)) #Is one side more homophilous than the other?
pol.nodemix<-ergm(polblogs~edges+nodemix("LeftRight")) #Is one side more likely to link to the other?
Which homophilous mechanism is at play, if any?
Exercise: Other Explanations
- Look at list.vertex.attributes(snaspb.net)
-
Open ?ergm.terms
- Look for terms appropriate for directed graphs
- Experiment!
follower.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + nodecov("followersCount"))
#Consider recoding the skewed attributes on a log scale and reassigning the attribute
snaspb.net %v% "followersCount.log" <- log(snaspb.net %v% "followersCount")
follower.log.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + nodecov(log("followersCount.log")))
gwesp.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + gwesp(.25))
cohort.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + absdiff("created"))
SNA 2013-R Lab 4
By Benjamin Lind
SNA 2013-R Lab 4
- 7,086