Social Network Analysis in R

Laboratory Four
Benjamin Lind

Social Network Analysis: Internet Research
St. Petersburg
22 August, 2013 (11:45-13:15)

Why Model with p*?

Let's discuss the alternatives

Benefit of flexibility: p* / ERGM as "a language"

Expressed as:

P(Y = y) = exp(θ'*g(y)) / k(y)

Generalizations and Applications


Seperable Termporal Exponential Random Graph Models (STERGM)
  • Can take either network or networkDynamic objects
  • Model tie formation and dissolution as different processes

Simulate entire network from ego network parameters
  • Tie distribution
  • Homophily
  • Triadic effects
  • Uses "target" statistics


p* in R

Requires statnet and the ergm package
Data must be in statnet's network format

For documentation on p*, see ?ergm
For documentation on which network features can be readily modeled, see ?ergm.terms


Acquire the Data

About the data
  • #snaspb2014 name network from Netlytic.org on Aug 21, 2013
  • Added user attributes from profiles (via twitteR)
  • Coded whether or not attending #snaspb2013
library(RCurl)
dl.data<-getURL("http://pastebin.com/raw.php?i=WdjebzzW", ssl.verifypeer=FALSE)
dl.data<-textConnection(dl.data)
source(dl.data)
rm(dl.data)
Includes both igraph and network objects
For ERGM compatibility, we'll be working with network
sapply(list.vertex.attributes(snaspb.net), function(x) return(get.vertex.attribute(snaspb.net, x)))

Plot the Data

plot(snaspb.net, vertex.cex=.75*log(snaspb.net %v% "statusesCount"), vertex.col=rev(rainbow(2, s=.75))[1+(snaspb.net %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(snaspb.net %n% "n")^3))
(Bear in mind this syntax is for the network package, not igraph)

Hypotheses?

Modeling

Start simple
Select the best models based upon the BIC

Does the model reflect reality?
  • Plot simulations for quick diagnosis
  • Compare simulated measurements to the observed


Check MCMC diagnostics when MCMC called

Baseline Model

Erdős–Rényi
basemod<-ergm(snaspb.net~edges)summary(basemod)
Monte Carlo MLE Results:
      Estimate Std. Error MCMC % p-value    
edges  -2.9836     0.1153     NA  <1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 2273.5  on 1640  degrees of freedom
 Residual Deviance:  633.3  on 1639  degrees of freedom
 
AIC: 635.3    BIC: 640.7    (Smaller is better.) 

Interpreting the Baseline Model

Coefficient estimates are conditional log-odds
Probability of a single tie forming = exp(-2.98)/(1+exp(-2.98))
Probability of a single tie forming = 0.04833763

Equal to density
library(sna)gden(snaspb.net)

Our reference BIC is 640.7


Simulate the Model

base.sim <- simulate(basemod)plot(base.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(base.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(base.sim %n% "n")^3))








base.gof<-gof(basemod)
summary(base.gof)
par(mfrow=c(2,2)); plot(base.gof) 

Differential Homophily Model

mod1<-ergm(snaspb.net ~ edges + nodematch("attended", diff=TRUE))summary(mod1)
Monte Carlo MLE Results:
                         Estimate Std. Error MCMC % p-value    
edges                     -3.4139     0.1993     NA  <1e-04 ***
nodematch.attended.FALSE  -1.0969     0.4562     NA  0.0163 *  
nodematch.attended.TRUE    1.8480     0.2558     NA  <1e-04 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Null Deviance: 2274  on 1640  degrees of freedom
Residual Deviance:  547  on 1637  degrees of freedom
 
AIC: 553    BIC: 569.2    (Smaller is better.) 

Differential Homophily Model

Effects are significant and in the expected directions
Probability of a tie between two attendees: 0.17
exp(-3.4139+1.8480)/(1+exp(-3.4139+1.8480))
Probability of a tie between two non-attendees: 0.01
exp(-3.4139+(-1.0969))/(1+exp(-3.4139+-(1.0969)))

BIC of 569.2 is much better than in the edges-only model (640.7)

Model didn’t require Markov chain Monte Carlo estimation


Simulate the Model

mod1.sim <- simulate(mod1)plot(mod1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod1.sim %n% "n")^3))








mod1.gof<-gof(mod1)
summary(mod1.gof)
par(mfrow=c(2,2)); plot(mod1.gof) 

Edgewise Shared Partners

Does given edge among two actors result in each of those actors linking to a third party?
mod2<-ergm(snaspb.net~edges+nodematch("attended", diff=TRUE)+gwesp(.5))mod2.mcmc.diag<-mcmc.diagnostics.ergm(mod2)
Are sample statistics significantly different from observed?
Nope? Good.
(If it did, we'd increase the MCMC sample size.)
Sample statistics auto-correlation:...
These are far too high. Increase the MCMC interval.
Sample statistics burn-in diagnostic (Geweke):...
Some of the p-values are a bit low. Increase MCMC burnin.

Edgewise Shared Partners


#If we had trouble with the sample statistics, we would try something like#mod2.1<-ergm(snaspb.net ~ edges+nodematch("attended", diff = TRUE)+gwesp(.5), control = control.ergm( MCMC.burnin = 50000, MCMC.interval = 5000, MCMC.samplesize=50000))#Instead, we'll use
mod2.1<-ergm(snaspb.net ~ edges+nodematch("attended", diff = TRUE)+gwesp(.5), control = control.ergm( MCMC.burnin = 50000, MCMC.interval = 5000))

mod2.1.mcmc.diag<-mcmc.diagnostics.ergm(mod2.1)#The sample statistics aren't great, but everything else looks okay.
(This command takes about 30 minutes to complete on my laptop)

Edgewise Shared Partners

summary(mod2.1) 
Monte Carlo MLE Results:
                         Estimate Std. Error MCMC %  p-value    
edges                     -3.6282     0.2023      0  < 1e-04 ***
nodematch.attended.FALSE  -0.9298     0.4589      0 0.042908 *  
nodematch.attended.TRUE    1.1690     0.2747      0  < 1e-04 ***
gwesp                      0.5958     0.1174      0  < 1e-04 ***
gwesp.alpha                0.8021     0.2310      0 0.000531 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Null Deviance: 2273.5  on 1640  degrees of freedom
 Residual Deviance:  535.4  on 1635  degrees of freedom
 
AIC: 545.4    BIC: 572.4    (Smaller is better.) 

Edgewise Shared Partners



Shared Partner effect is significant.


Differential homophily effects are still significant.


BIC is slightly worse.

Simulate the Model

mod2.1.sim <- simulate(mod2.1)
plot(mod2.1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod2.1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod2.1.sim %n% "n")^3))

Title







mod1.gof<-gof(mod1)
summary(mod1.gof)
par(mfrow=c(2,2)); plot(mod1.gof)  

So is it homophily?

An alternative hypothesis.
#What if those attending are just more popular?mod1.1<-ergm(snaspb.net~edges+nodefactor("attended"))
summary(mod1.1)mod1.1.sim<-simulate(mod1.1)plot(mod1.1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod1.1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod1.1.sim %n% "n")^3))mod1.1.gof<-gof(mod1.1)summary(mod1.1.gof)par(mfrow=c(2,2)); plot(mod1.1.gof)#Compare to the findings from mod1. Which is the best model?

Exercise: Homophily & Politics

Comparing different homophily forms
library(igraph); library(intergraph)
polblogs<-nexus.get(id="polblogs") #0=liberal, 1=conservative
polblogs<-asNetwork(polblogs)detach("package:igraph", unload=TRUE)pol.baseline<-ergm(polblogs~edges)pol.nodecov<-ergm(polblogs~edges+nodecov("LeftRight")) #Is side more popular?pol.nodematch<-ergm(polblogs~edges+nodematch("LeftRight")) #General homophily?pol.nodematch.diff<-ergm(polblogs~edges+nodematch("LeftRight", diff=TRUE)) #Is one side more homophilous than the other?
pol.nodemix<-ergm(polblogs~edges+nodemix("LeftRight")) #Is one side more likely to link to the other?
Which homophilous mechanism is at play, if any?

Exercise: Other Explanations

  1. Look at list.vertex.attributes(snaspb.net)
  2. Open ?ergm.terms
  3. Look for terms appropriate for directed graphs
  4. Experiment! 
follower.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + nodecov("followersCount"))#Consider recoding the skewed attributes on a log scale and reassigning the attributesnaspb.net %v% "followersCount.log" <- log(snaspb.net %v% "followersCount")follower.log.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + nodecov(log("followersCount.log")))
gwesp.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + gwesp(.25))
cohort.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + absdiff("created"))
Made with Slides.com