Social Network Analysis in R

Laboratory Four

Benjamin Lind

Social Network Analysis: Internet Research

St. Petersburg

22 August, 2013 (11:45-13:15)

Why Model with p*?

Let's discuss the alternatives

Benefit of flexibility: p* / ERGM as "a language"

Expressed as:

P(Y = y) = exp(θ'*g(y)) / k(y)

Generalizations and Applications

Seperable Termporal Exponential Random Graph Models (STERGM)

Can take either network or networkDynamic objects
Model tie formation and dissolution as different processes

Simulate entire network from ego network parameters

Tie distribution
Homophily
Triadic effects
Uses "target" statistics

p* in R

Requires statnet and the ergm package

Data must be in statnet's network format

For documentation on p*, see ?ergm

For documentation on which network features can be readily modeled, see ?ergm.terms

Acquire the Data

About the data

#snaspb2014 name network from Netlytic.org on Aug 21, 2013
Added user attributes from profiles (via twitteR)
Coded whether or not attending #snaspb2013

library(RCurl)
dl.data<-getURL("http://pastebin.com/raw.php?i=WdjebzzW", ssl.verifypeer=FALSE)
dl.data<-textConnection(dl.data)
source(dl.data)
rm(dl.data)

Includes both igraph and network objects

For ERGM compatibility, we'll be working with network

sapply(list.vertex.attributes(snaspb.net), function(x) return(get.vertex.attribute(snaspb.net, x)))

Plot the Data

plot(snaspb.net, vertex.cex=.75*log(snaspb.net %v% "statusesCount"), vertex.col=rev(rainbow(2, s=.75))[1+(snaspb.net %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(snaspb.net %n% "n")^3))

(Bear in mind this syntax is for the network package, not igraph)

Hypotheses?

Modeling

Start simple

Select the best models based upon the BIC

Does the model reflect reality?

Plot simulations for quick diagnosis
Compare simulated measurements to the observed

Check MCMC diagnostics when MCMC called

Baseline Model

Erdős–Rényi

basemod<-ergm(snaspb.net~edges)summary(basemod)

Monte Carlo MLE Results:

Estimate Std. Error MCMC % p-value

edges -2.9836 0.1153 NA <1e-04 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null Deviance: 2273.5 on 1640 degrees of freedom

Residual Deviance: 633.3 on 1639 degrees of freedom

AIC: 635.3 BIC: 640.7 (Smaller is better.)

Interpreting the Baseline Model

Coefficient estimates are conditional log-odds

Probability of a single tie forming = exp(-2.98)/(1+exp(-2.98))

Probability of a single tie forming = 0.04833763

Equal to density

library(sna)gden(snaspb.net)

Our reference BIC is 640.7

Simulate the Model

base.sim <- simulate(basemod)plot(base.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(base.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(base.sim %n% "n")^3))

base.gof<-gof(basemod)
summary(base.gof)
par(mfrow=c(2,2)); plot(base.gof)

Differential Homophily Model

mod1<-ergm(snaspb.net ~ edges + nodematch("attended", diff=TRUE))summary(mod1)

Monte Carlo MLE Results:

Estimate Std. Error MCMC % p-value

edges -3.4139 0.1993 NA <1e-04 ***

nodematch.attended.FALSE -1.0969 0.4562 NA 0.0163 *

nodematch.attended.TRUE 1.8480 0.2558 NA <1e-04 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null Deviance: 2274 on 1640 degrees of freedom

Residual Deviance: 547 on 1637 degrees of freedom

AIC: 553 BIC: 569.2 (Smaller is better.)

Differential Homophily Model

Effects are significant and in the expected directions

Probability of a tie between two attendees: 0.17

exp(-3.4139+1.8480)/(1+exp(-3.4139+1.8480))

Probability of a tie between two non-attendees: 0.01

exp(-3.4139+(-1.0969))/(1+exp(-3.4139+-(1.0969)))

BIC of 569.2 is much better than in the edges-only model (640.7)

Model didn’t require Markov chain Monte Carlo estimation

Simulate the Model

mod1.sim <- simulate(mod1)plot(mod1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod1.sim %n% "n")^3))

mod1.gof<-gof(mod1)
summary(mod1.gof)
par(mfrow=c(2,2)); plot(mod1.gof)

Edgewise Shared Partners

Does given edge among two actors result in each of those actors linking to a third party?

mod2<-ergm(snaspb.net~edges+nodematch("attended", diff=TRUE)+gwesp(.5))mod2.mcmc.diag<-mcmc.diagnostics.ergm(mod2)

Are sample statistics significantly different from observed?

Nope? Good.

(If it did, we'd increase the MCMC sample size.)

Sample statistics auto-correlation:...

These are far too high. Increase the MCMC interval.

Sample statistics burn-in diagnostic (Geweke):...

Some of the p-values are a bit low. Increase MCMC burnin.

Edgewise Shared Partners

#If we had trouble with the sample statistics, we would try something like#mod2.1<-ergm(snaspb.net ~ edges+nodematch("attended", diff = TRUE)+gwesp(.5), control = control.ergm( MCMC.burnin = 50000, MCMC.interval = 5000, MCMC.samplesize=50000))#Instead, we'll use
mod2.1<-ergm(snaspb.net ~ edges+nodematch("attended", diff = TRUE)+gwesp(.5), control = control.ergm( MCMC.burnin = 50000, MCMC.interval = 5000))

mod2.1.mcmc.diag<-mcmc.diagnostics.ergm(mod2.1)#The sample statistics aren't great, but everything else looks okay.

(This command takes about 30 minutes to complete on my laptop)

Edgewise Shared Partners

summary(mod2.1)

Monte Carlo MLE Results:

Estimate Std. Error MCMC % p-value

edges -3.6282 0.2023 0 < 1e-04 ***

nodematch.attended.FALSE -0.9298 0.4589 0 0.042908 *

nodematch.attended.TRUE 1.1690 0.2747 0 < 1e-04 ***

gwesp 0.5958 0.1174 0 < 1e-04 ***

gwesp.alpha 0.8021 0.2310 0 0.000531 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null Deviance: 2273.5 on 1640 degrees of freedom

Residual Deviance: 535.4 on 1635 degrees of freedom

AIC: 545.4 BIC: 572.4 (Smaller is better.)

Edgewise Shared Partners

Shared Partner effect is significant.

Differential homophily effects are still significant.

BIC is slightly worse.

Simulate the Model

mod2.1.sim <- simulate(mod2.1)
plot(mod2.1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod2.1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod2.1.sim %n% "n")^3))

Title

mod1.gof<-gof(mod1)
summary(mod1.gof)
par(mfrow=c(2,2)); plot(mod1.gof)

So is it homophily?

An alternative hypothesis.

#What if those attending are just more popular?mod1.1<-ergm(snaspb.net~edges+nodefactor("attended"))
summary(mod1.1)mod1.1.sim<-simulate(mod1.1)plot(mod1.1.sim, vertex.cex=.75, vertex.col=rev(rainbow(2, s=.75))[1+(mod1.1.sim %v% "attended"==T)], edge.col="light grey", bg="transparent", usecurve=TRUE, edge.curve=.01, layout.par=list(niter=5000, area=(mod1.1.sim %n% "n")^3))mod1.1.gof<-gof(mod1.1)summary(mod1.1.gof)par(mfrow=c(2,2)); plot(mod1.1.gof)#Compare to the findings from mod1. Which is the best model?

Exercise: Homophily & Politics

Comparing different homophily forms

library(igraph); library(intergraph)
polblogs<-nexus.get(id="polblogs") #0=liberal, 1=conservative
polblogs<-asNetwork(polblogs)detach("package:igraph", unload=TRUE)pol.baseline<-ergm(polblogs~edges)pol.nodecov<-ergm(polblogs~edges+nodecov("LeftRight")) #Is side more popular?pol.nodematch<-ergm(polblogs~edges+nodematch("LeftRight")) #General homophily?pol.nodematch.diff<-ergm(polblogs~edges+nodematch("LeftRight", diff=TRUE)) #Is one side more homophilous than the other?
pol.nodemix<-ergm(polblogs~edges+nodemix("LeftRight")) #Is one side more likely to link to the other?

Which homophilous mechanism is at play, if any?

Exercise: Other Explanations

Look at list.vertex.attributes(snaspb.net)
Open ?ergm.terms
Look for terms appropriate for directed graphs
Experiment!

follower.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + nodecov("followersCount"))#Consider recoding the skewed attributes on a log scale and reassigning the attributesnaspb.net %v% "followersCount.log" <- log(snaspb.net %v% "followersCount")follower.log.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + nodecov(log("followersCount.log")))
gwesp.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + gwesp(.25))
cohort.mod<-ergm(snaspb.net~ edges + nodefactor("attended") + absdiff("created"))

Social Network Analysis in R

Laboratory Four Benjamin Lind Social Network Analysis: Internet Research St. Petersburg 22 August, 2013 (11:45-13:15)

Why Model with p*?

Generalizations and Applications

p* in R

Acquire the Data

Plot the Data

Hypotheses?

Modeling

Baseline Model

Interpreting the Baseline Model

Simulate the Model

Differential Homophily Model

Differential Homophily Model

Simulate the Model

Edgewise Shared Partners

Edgewise Shared Partners

Edgewise Shared Partners

Edgewise Shared Partners

Simulate the Model

Title

So is it homophily?

Exercise: Homophily & Politics

Exercise: Other Explanations

Laboratory Four

Benjamin Lind

Social Network Analysis: Internet Research

St. Petersburg

22 August, 2013 (11:45-13:15)