Seminar on Research Design II

Introductions

Course

Continuation of Seminar on Research Methods I

Room #513 (Tue 10:30-11:50)

Instructor

Benjamin Lind, PhD

blind@hse.ru *

Office #419 (Tue 18.10-20.10)

Students

*Introductions

Foreign instructor has a PhD, but no patronymic, how do we address him?!?

Preferred

"Ben" or "Benjamin"
Young, male sociologist from California--it's all good

Appropriate Formalities

"Professor Lind," "Prof Lind," "Dr Lind"
American-appropriate, yet not Russian-appropriate

Poor Formalities

"Mr Lind"

Course

Goal

Build upon previous course
Provide a comprehensive understanding of advanced and specialized research design strategies

Instructional style

American, conversational and informal
Emphasis on interaction
De-emphasis on physical documents

Timing

One week full-lecture
One week odds and ends

Language vow

Students

State your name
Favorite methodological subject
Last item bought

Syllabus

Click me for the syllabus

Click me to download the readings

Syllabus

Nine subjects

Experiments, sampling, case studies, content analysis, secondary data, historical analyses, social network analysis, simulations, and synthesis

Readings

< 50 pages on subject
General approach

One general intro
One empirical work
Students present empirical work

Grading

Attendance (15%)
Participation (15%)
Group presentation (15%)

Review (3.75%)
Evaluate (3.75%)
Relate (3.75%)
Dialog (3.75%)

Assignments (15%)
Final examination (40%)

Presentation Assignments

Readings

Salganik et al. (2006) [next week]

Rhomberg (2010)

Robinson (1976)

Barber (2001)

Armstrong and Crage (2006)

Christakis and Fowler (2007)

Morris and Kretzschmar (1997)

Random Groups of Three

Presentation Assignments

Did I miss anyone?

Did I mistakenly include anyone?

Experiments

The following lesson relies upon and draws heavily from Neuman (2007), Chapter 8.

Experiments

Basic Troubleshooting Steps

Hypothesis
Modify situation
Compare with and without modification

Comic courtesy of xkdc

Experiments are the Best for a Causal Test

Causation requirements

Temporal Order
Association
No Alternative Explanation

Experiments Require an Intervention

Which units of analysis would this requirement preclude?
Which types of questions would it preclude due to issues of ethics and practicality?

E.g.,

E.g.,

Etc.

Photo by Magharebia

What are some ways we can intervene?

Random Assignment

What is meant by random assignment?
How are assignments determined?

The Seven Parts to a True Experiment

Random assignment
Control group
Experimental group
Pretest
Treatment / Independent variable
Posttest
Dependent variable

Why use a control?

To eliminate alternative explanations

Photo by Chlaborewan

Deception

What are some examples?
What are some ethical considerations?
Role of debriefing subjects.

Types of Experimental Designs

Notation

O = Dependent variable

O1 = Pretest
O2 = Posttest

X = Treatment

X1 = First treatment
X2 = Second treatment
...

R = Random assignment
Rows represent groups
Z = Confounding factor (factorial design)

Neuman (2007) Basics of Social Research, p. 213

Debates

To pretest or not to pretest?

Benefits of pretest
Benefits of avoiding pretest

Which designs lends themselves to macro inquiry?

How can we identify them?
How strong is their causal claim?

Internal Validity

"[T]he ability to eliminate alternative explanations of the dependent variable" Neuman (2007:212)

Selection bias (if no random assignment)
Contamination between subjects
Historical circumstances
Testing (i.e., pretest effects)
Instrumentation (i.e., slow equipment failure)
Experimenter effects
Maturation
"Mortality"
Statistical regression

External Validity

"[T]he ability to generalize experimental findings to events and settings outside the experiment itself" Neuman (2007:216)

Reactivity

"Hawthorne effect"
Subject awareness of experiment changes how they respond

Field experiments

Natural settings
Verify external validity
Greater generalization, but limited control

Assignment One

I will e-mail you a link to take an online survey

Results will be kept confidential

Please take it seriously

I will use your responses throughout the course
Please do not discuss it with your peers

Fill it out in a private setting
Do not talk to your classmates about its

Questions
Format
Images

You must complete it by Friday

Assignment One Debriefing

What did the survey ask?
What were the researcher's main interests?

How did you know?

Was the survey an experiment?

Can one conduct an experiment with a survey?
What leads you to believe it was or was not?
If so, what was the treatment?

Did any questions stand out as unusual?

How did you recognize them as unusual?
What do you think their purposes were?

Assignment One Debriefing

Yes, it was an experiment

Randomly selected 50% of students for treatment*
Two e-mails, two mostly identical surveys

Some questions were repeated.

Examples?
Why ask the same questions a few times?

Reliability
Pretest/Posttest

Treatment must occur between them
What was the treatment?

Assignment One Debriefing

No, it was not about cats!

Sociologists don't care about people buying cats

Pretest/post test on that was a ruse

Slightly interested in cultural consensus

"Lenin Cat" and "Hitler Cat"
Human emotion ascription

Research question

"Can additional survey questions affect the number of peer nominations?"

Peer Nominations

Which questions?

What are they useful for?

Assignment One Debriefing

Theory: Reactivity

"Respondents might nominate a different number of peers if they are aware the researcher was scrutinizing their responses."

Two-tailed hypothesis
H0: No difference between the control and the treatment
H1: Difference between the control and the treatment

Privacy concerns -> fewer nominations
Interest in appearing thorough -> more nominations

Assignment One Debriefing

Operationalizing Reactivity

"Over the past two weeks, have you..."

"...gone on a river safari with a student from our class?"
"...been enrolled in the same course with at least two students from our class?"

Assignment One Debriefing

Results

Change in Nominations

	Fewer	Same	More
Control	1	6	3
Treatment	3	5	3

Did the treatment affect the posttest?
What are the implications?

Assignment One Debriefing

Limitations and Problems

All subjects had already received a reactivity treatment!

Which question(s)?
Are surveys even appropriate to test reactivity?

Deception came later, not well-integrated
Small sample and absolutely no external validity

Student Presentation

Matthew J. Salganik, Peter Sheridan Dodds, and Duncan J. Watts. 2006. "Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market." Science 311:854-6.

Next Week: Sampling

Sampling

From the Truman Library Photographs

The following lesson relies upon and draws heavily from Neuman (2007), Chapter 6.

Materials

Data

Computational Environment

Computer Scripts

Fundamental Terms

Universe

"The broad class of units that are covered in a hypothesis. All the units to which the findings of a specific study might be generalized." (Neuman 2007)

Population

"The name for the large group of many cases from which a researcher draws a sample and which is usually stated in theoretical terms." (Neuman 2007)

Fundamental Terms

Sampling Frame

"A list of cases in a population, or the best approximation of it." (Neuman 2007)

Sample

"A smaller set of cases a researcher selects from a larger pool and generalizes to the population." (Neuman 2007)

Fundamental Terms

What are some examples of a...

...universe?
...population?
...sampling frame?

Using the data from the class survey, what is the...

...universe?
...population?
...sampling frame?

Other Terms

Sampling Ratio

The number of sampled cases divided by the size of the population they represent

Population Parameter

A characteristic of the population, typically estimated with statistics

Sampling Error

The difference between the measured parameter in a sample and the population parameter

Statistical Terms

Central Limit Theorem

As the number of random samples on a measurement increase, their average approaches the population parameter

Confidence Interval

An interval in which a research claims, with a given degree of certainty, includes the population parameter

Sampling Distribution

"A distribution created by drawing many random samples from the same population" (Neuman 2007)

The Law of Large Numbers

Probability vs Nonprobability Sampling

Is there a known probability of a case being selected?

Nonprobability Samples

Types

Convenience Sampling
Quota Sampling
Purposive Sampling

Sequential Sampling
Deviant Case Design

Snowball Sampling

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling

Photo courtesy of Anneli Salo

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling

Poll at opensource.com

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling

Link to Tweet

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling

Population parameter:

Mean number of communication partners

Convenience sample:

First five students to take the survey

source("http://pastebin.com/raw.php?i=BU2tycrK") #Loads data and libraries
pop.param <- mean(studentdat$commties) #Measures population
first.five <- which(studentdat$timerank %in% 1:5) #Samples
convien.param <- mean(studentdat$commties[first.five]) #Measures sample
cat("\n  Population parameter = ", pop.param, ", Sample parameter = ", convien.param, "\n")

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling

What are the limitations to this sampling method?

When should this sampling method be used for substantive knowledge?

NEVER!

EVER!

Nonprobability Samples

Quota Sampling

Photo courtesy of BrianZim

Nonprobability Samples

Quota Sampling

Steps

Determine categories
Determine how many to sample from each category
Sample haphazardly until quotas are met

What are the problems with this sampling method?

Nonprobability Samples

Quota Sampling Application

Categories: Females and Males
Quotas: Four women and one man
Convenience: First to fill out survey

Population parameter:

Mean number of communication partners

Quota sample:

m <- sort(studentdat$timerank[which(studentdat$male)])[1] #First male time ranks
f <- sort(studentdat$timerank[which(!studentdat$male)])[1:4] #First four females time ranks
quota.samp <- which(studentdat$timerank %in% c(m, f)) #Samples
quota.param <- mean(studentdat$commties[quota.samp]) #Measures sample
cat("\n  Population parameter = ", pop.param, ", Sample parameter = ", quota.param, "\n")

Nonprobability Samples

Purposive/Judgmental Sampling

When is it appropriate?

Different information from unique cases
Population is generally inaccessible
More detailed information on a targeted group

Continues until data or research exhaustion

Nonprobability Samples

Purposive/Judgmental Sampling

Outliers: The Story of Success is a non-fiction book written by Malcolm Gladwell….. In Outliers, Gladwell examines the factors that contribute to high levels of success. To support his thesis, he examines the causes of why the majority of Canadian ice hockey players are born in the first few months of the calendar year, how Microsoft co-founder Bill Gates achieved his extreme wealth, how The Beatles became one of the most successful musical acts in human history.... Throughout the publication, Gladwell repeatedly mentions the "10,000-Hour Rule".... (Wikipedia)

Nonprobability Samples

Purposive/Judgmental Sampling

Variant: Sequential Sampling

Continues until no new information or sample diversity attained

Nonprobability Samples

Purposive/Judgmental Sampling

Variant: Deviant Case

("extreme" case)

To be discussed during our lessons on case studies.

Nonprobability Samples

Snowball Sampling

("network," "chain referral," or "reputational" sampling)

Photo courtesy of Kamyar Adl

Nonprobability Samples

Snowball Sampling

Steps

Begin with seed(s)
Referrals from seed(s)
Sample referrals

Nonprobability Samples

Snowball Sampling

Nonprobability Samples

Snowball Sampling

Simulating the process

Start with seeds
Seeds refer peers
Peers might not respond
If peers respond, gain more referrals
Continue until sample size met or hit dead ends

Nonprobability Samples

Snowball Sampling

The Simulation Function

set.seed(666) #Standardizes random number generator
sb.samp <- function(n = 5, rr = .5, nseeds = 1, net = studentel){
  recruit.wave <- rep(NA, times=vcount(studentel))
  orig.seeds <- sample.int(nrow(studentdat), nseeds)
  new.seeds <- orig.seeds
  ncount <- nseeds
  wave <- 1
  recruit.wave[orig.seeds] <- wave
  deadend <- FALSE
  while ((ncount < n) & !deadend){
    wave <- wave + 1
    new.seeds <- unlist(sapply(new.seeds, neighbors, graph = net))
    if (all(new.seeds %in% orig.seeds))
      deadend <- TRUE
    else{
      new.seeds <- new.seeds[which(new.seeds %in% orig.seeds == FALSE)]
      recruited <- which(rbinom(length(new.seeds), 1, rr)==1)
      new.seeds <- new.seeds[recruited]
      if ((length(recruited) == 0) | (length(new.seeds) == 0))
        deadend <- TRUE
      else{
        orig.seeds <- c(orig.seeds, new.seeds)
        recruit.wave[new.seeds] <- wave
        ncount <- length(orig.seeds)
        }
      }
    }
  return(recruit.wave)
  }

Nonprobability Samples

Snowball Sampling

Results

#General plot of "population" sociogram
library(RColorBrewer)
plot.wrap <- function(...)
  plot(studentel, edge.width=7, vertex.label.family="sans", vertex.label.cex=1.25, vertex.label.dist=.85, vertex.label.degree=-1.9*pi/4, ...)
plot.wrap.sb <- function(sb.recruited){
  sb.param <- mean(degree(studentel)[which(!is.na(sb.recruited))])
  sb.maintitle <- "Snowball Sample Simulation"
  sb.sub <- paste("Population Parameter = ", round(pop.param, 3), ", Sample parameter = ", round(sb.param, 3), sep="")
  sb.recruits.color <- rev(brewer.pal(max(sb.recruited, na.rm=TRUE), "Blues"))[sb.recruited]
  sb.recruits.color[is.na(sb.recruited)] <- "white"
  plot.wrap(vertex.color = sb.recruits.color, main = sb.maintitle, sub = sb.sub)  
  }
plot.wrap.sb(sb.samp(5))
plot.wrap.sb(sb.samp(10, rr = .3, nseeds = 2))
plot.wrap.sb(sb.samp(10, rr = .8, nseeds = 2))

Nonprobability Samples

Snowball Sampling

Which social phenomena is this method good for studying?

Who are we more likely to reach in this population?

Who are we least likely to reach in this population?

Probability Samples

Types

Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling

Probability Samples

Simple Random Sampling

Steps

Acquire a reasonable sampling frame
Determine sample size
Randomly sample cases from the sampling frame

Repeat until the sample size is met
Sample without replacement

Photo courtesy of saschapohflepp

Webpages

http://www.random.org/

Computer Programs

#Sample size of 10, ranging from 1 to 1000set.seed(666)
random.samp <- sample.int(1000, 10)
random.samp
random.samp.meanfun <- function(ss)
  return(mean(studentdat$commties[sample.int(vcount(studentel), ss)]))
cat("\n  Population parameter = ", pop.param, ", Sample parameter = ", round(random.samp.meanfun(5), digits = 3), "\n")

#To demonstrate the central limit theorem
random.mean.5samp.1k <- replicate(1000, random.samp.meanfun(5))
mean.across.randomsamples <- mean(random.mean.5samp.1k)
mean.across.randomsamples <- round(mean.across.randomsamples, digits = 3)
hist(random.mean.5samp.1k, main = "Frequency of Mean Communication Ties\nin 1000 Random Samples of 5", sub = paste("Population Parameter = ", pop.param, ", Mean Across Distributions = ", sep=""), xlab = "Mean Number of Communication Partners")

Probability Samples

Systematic Sampling

Steps

Begin with a non-cyclical sampling frame
Select a starting case at random
Move up and down the list by every k cases

How would this method compare to random sampling?

How could a cyclical sampling frame affect your results?

Probability Samples

Systematic Sampling

# To demonstrate systematic samples
start.point <- sample.int(vcount(studentel), 1)
k <- 4
first.half <- seq(from = 1, to = start.point, by = k)
second.half <- seq(from = start.point, to = vcount(studentel), by = k)
syst.samp <- unique(c(first.half, second.half))
syst.samp
syst.samp.mean <- mean(studentdat$commties[syst.samp])
syst.samp.mean <- round(syst.samp.mean, digits = 3)
cat("\n  Population parameter = ", pop.param, ", Sample parameter = ", syst.samp.mean, "\n")

Probability Samples

Stratified Sampling

Steps

Identify mutually exclusive strata

E.g., geographical units

Randomly sample within each strata
Weight to balance representation

Main benefits

Better representation than simple random sampling

Why?

Probability Samples

Stratified Sampling

Consider our example "population"

(i.e., students in our class)

How could we construct a stratified sample?

Probability Samples

Cluster Sampling

(aka, "multistage sampling")

Steps

Identify mutually exclusive strata
Randomly sample strata

Identify mutually exclusive strata within
Randomly sample these strata

Identify mutually exclusive strata within
Randomly sample...

Weight to balance representation

Probability Samples

Cluster Sampling

Consider our example "population"

(i.e., students in our class)

How could we construct a cluster sample?

Probability Samples

Cluster Sampling

Advantages

Cost
Speed

Disadvantages

Less accurate than simple random sampling
Requires detailed sampling frames

Tradeoff on cluster numbers and cluster size

Using the Telephone

Photo courtesy of Takkk

Using the Telephone

Is there a sampling frame?

Should you use it?

Random digit dialing as cluster sampling

What is it?
Why is it cluster sampling?

Words of caution

Role of telephones in social life
Nonresponse
Privacy

Weights

What do weights do?

Why are weights sometimes needed?

On which criteria should respondents be weighted?

Hidden Populations

What are hidden populations?

Capture-Recapture

Respondent-Driven Sampling

Scale-up Methods

Hidden Populations

Photo courtesy of Oldmaison

Hidden Populations

Photo courtesy of Todd Huffman

Hidden Populations

Photo courtesy of Orangeadnan

Hidden Populations

Photo courtesy of AdamCohn

Hidden Populations

Hidden Populations

Photo courtesy of maxintosh

Hidden Populations

Photo courtesy of kargaltsev

Hidden Populations

Photo courtesy of T-Hino

Hidden Populations

Lack a Sampling Frame

Characteristics

Population members interact with each other

Isn't this true for all "populations?"

Often illegal or stigmatized, though not always

Commonalities: They're not Weberian Bureaucracies.

E.g., No clear organizational hierarchy or authority,
no written rules or neutral governing body,
no expert training, and
no meritocratic advancement

Hidden Populations

Capture-Recapture

Photo courtesy of Mickey Samuni-Blank

Hidden Populations

Capture-Recapture

Lincoln-Petersen Method

Two Capture Sweeps

M = First sweep, captured and marked
R = Recaptured with marks
C = All captured during second sweep
N = Estimated total population size

N = M * C / R

R / M = C / N

How do we ethically "capture" and "mark" humans?

Hidden Populations

Scale-up Methods

Hidden Populations

Scale-up Methods

"How many incarcerated people do you know?"
"How many licensed pilots do you know?"
"How many people with the first name 'David' do you know?"
...
"How many people do you know who died in the September 11, 2001 attacks?"

Basic points:

Determine how many people respondent knows
Extrapolate to a hidden population

Hidden Populations

Respondent Driven Sampling

(Heckathorn and Jeffri 2001)

Location sampling

Problem: Locations must be large & public

Institutional samples

Problem: Requires affiliation with institution

Chain referrals ("snowball")

Problem: Nonrandom "seeds"
Problem: Volunteerism
Problem: Differential recruitment
Problem: Popularity effects
Problem: Homophily and in-group effects

Hidden Populations

Respondent Driven Sampling

(Heckathorn and Jeffri 2001)

Address Problems of Chain Referrals

Law of large numbers and Markov chains

After enough waves, starting seeds don't matter
Transition states and equilibrium

Pay your respondents and make them comfortable

Incentivize both the recruiter and the recruit

E.g., Dropbox, plasma donation

Use a limited number of "coupons"
Respondents should come to the researcher

Hidden Populations

Respondent Driven Sampling

(Heckathorn and Jeffri 2001)

Address Problems of Chain Referrals

Apply weights

Understand who is likely/unlikely to be recruited
Understand who is likely to recruit whom

Homophilous recruitment as structure

Recruitment is a behavioral network
Can indicate communities and inequality

Boundaries

Screening process

Crowd Sampling

Photo courtesy of Roland zh

Crowd Sampling

Steps

Understand event geography
Break into teams
Section crowd into geographical regions
Interview every k person

Record their responses
Record basic information on refusals

(Milkman, Luce, and Penny. 2012. "Changing the Subject: A Bottom-Up Account of Occupy Wall Street in New York City." Appendix B.)

Assignment Two

http://goo.gl/hJzwWi

Next Week

Case Study Design

Case Studies

Picture courtesy of Bill Ebbesen

The following lesson relies upon and draws heavily from Gerring (2007)

Definitions

“Case connotes a spatially delimited phenomenon (a unit) observed at a single point in time or over some period of time.” Gerring (2007:19)

“A case study may be understood as the intensive study of a single case where the purpose of that study is -- at least in part -- to shed light on a larger class of cases (a population).” Gerring (2007:20)

Definitions

“At the point where the emphasis of a study shifts from the individual case to a sample of cases, we shall say that a study is cross-case.” Gerring (2007:20)

“An observation is the most basic element of any empirical endeavor.” Gerring (2007:20)

Typically, " N " refers to the number of observations

Population > Sample > Case ≥ Observation

Case Studies are Research Designs

No prescribed data format
No prescribed method of analysis
No upper limits on the N

Neither inherently qualitative nor quantitative.

There are certain affinities, though.

Population typically difficult to discern.

Definitions

“A single observation may be understood as containing several dimensions, each of which may be measured ...as a variable.” Gerring (2007:20)

Y

Dependent Variable
Outcome of Interest

X

Independent Variable
Explanatory Variable/Factor

Y ~ X

Data Organization

Data Frames and Matrices

Rows represent observations
Columns represent variables
Grouping variable (could) represent cases

Typically done in a spreadsheet

Research Design Typology

Research must examine variation across cases or units

Dimensions of variation

Number of cases

One, "several," or "many"

Form of variation

Spatial and/or temporal

Location of variation

Within case and/or across cases

Research Design Typology

Why Case Studies?

Research Goals

Empirical Considerations

Case Study Research Goals

Tendencies

Role of hypotheses

Generating, rather than testing
Needed to study new phenomena

Validity

Internal, rather than external
Difficult to speak outside of sample

Causality

Focus on

Mechanisms, rather than effects

Inference

Deep, rather than broad

Effects

Picture courtesy of @alexlundry

Kosinskia, Stillwella, and Graepelb. 2013. "Private traits and attributes are predictable from digital records of human behavior." PNAS.

Mechanisms

Griffin (1993:1110) AJS

Mechanisms

Griffin (1993:1110) AJS

Case Studies & Empirical Considerations

Tendencies

Population of cases

Heterogeneous, rather than homogeneous

Causal relationship

Strong, rather than weak

Useful variation

Rare, rather than common

Data availability

Concentrated, rather than disperse

Case Selection Strategies

Typical
Diverse
Extreme
Deviant
Crucial test
Pathway
Most-similar
Most-different

Practical Reasons

Language
Data availability
Theoretical background

Typical Case Selection

Representation

Hypothesis testing

Diverse Case Selection

Values range rather than distribution

Hypothesis testing

Hypothesis generation

Extreme Case Selection

Outliers

Representative only relative to larger sample of cases

Hypothesis generation

Deviant Case Selection

Outlier due to nonconforming relationship

Identify alternative relationships

Hypothesis generation

Crucial Test

Case least likely to exhibit relationship

Explain Y
Typical explanations X1, X2, X3, etc

None explain Y for this case

New variable, X4 works

Representativeness questionable

Hypothesis testing

Pathway

Select cases based on covariational patterns

(Combinations)

Interests

Mechanisms
Dependencies

Hypothesis testing

Most-Similar

Select very similar cases with different outcomes

Cases should have only one independent difference

That difference is the key variable

Maybe representative

Hypothesis testing and generating

Most-Different

Select very different cases with similar outcomes

Cases should have only one independent commonality

That commonality is the key variable

Maybe representative

Hypothesis testing and generating

Student Presentation

Chris Rhomberg. 2010. "A Signal Juncture: The Detroit Newspaper Strike and Post-Accord Labor Relations in the United States." American Journal of Sociology 115:6:1853-94.

Assignment 3

Break into five groups
Each group is a case study design
Create a "research design" with

Unit of analysis
Observation(s)
Case(s)
Outcome of interest, Y
Explanatory factors, X1 and X2

Next week: Data entry and content analysis

Data Entry

Source

Data Entry

Typical Ways to Enter Data

Spreadsheet
Text editor
Automated

Demonstration

Let's create some simple data!

Typical Layout

Units/Cases represented in rows
Variables represented in columns

Delimited Text Files

Advantages

Cross-platform
Easy to read

Disadvantages

Size
Somewhat limited detail

Exercise

Without looking at your neighbor's responses, write down

No questions.

Your name
Your gender
Your date of birth
Your study group
How many hours you spent studying this week
How cats make you feel (Happy, Neutral, or Unhappy)
One word to describe your class with Prof. Flores
Who is your favorite character in Game of Thrones?

Now write it on the board, exactly as you wrote it on paper.

Typical Sociology Data Types

Boolean
Numeric

Integer vs. Real

Date and Time

Numeric, Text, Other Formats

Ordinal

Numeric
Text

Text

Factors
Descriptive

What are examples of each form of data type?

Missing Values

Missing Values

Types of missingness
Representation
Handling

Additional data (Best)
Mean substitution (Bad)
Case-wise deletion (Conventional)
(Multiple) Imputation (Good)

Random ("Hot deck")
Nearest neighbor
Estimation

Content Analysis

Much of the material from this lesson draws from

Krippendorff (2004) Content Analysis: An Introduction to Its Methodology

Content Analysis

How to Collect Data from Texts

What is a "text?"

What sort of projects is this method good for?

Advantages

"Dead"
Reliability
Replication

What are some disadvantages?

Content Analysis

Steps

Unitizing
Sampling
Recording/Coding
Reducing data
Inferring context
Narration

Provide an example of a text.

Krippendorff (2004:83)

Unitizing

Which unit will you be recording?

Unitizing

"Even though large tracts of Europe and many old and famous states have fallen or may fall into the grip of the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our Island whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we shall never surrender."

How could this text be unitized?

Unitizing

Sampling

Which samples could these examples represent?

Which limitations would these samples face?

How could we sample within these examples?

Recording/Coding

Let's use a panel as our unit. What could we code?

Daredevil Battles Hitler 001

Reducing Data

E.g.,

What proportion of panels portray violence?

What proportion of panels with violence display violence directed against one or more Nazis?

What proportion of nouns in Churchill's speech were first person plural?

Inferring Context

What would be the context for these two examples?

Which sociological topics do they speak to?

Narration

Natural Language Processing

Object character recognition (OCR)
Named entity recognition

Relationship extraction

Sentiment analysis

Suggested Application

Data Science Toolkit

Object Character Recognition

Images of text are not machine-readable

Software required to convert text images

(Always check for quality.)

Named Entity Recognition

"Latoya Ammons, from Gary Indiana, and her three children claimed to have been possessed by evil spirits....Recently, the priest who dealt with the actual exorcisms of this family, Rev. Michael Maginot has signed with Evergreen Media Holdings to make his account of the story into a movie."

"Latoya Ammons [person], from Gary Indiana [location], and her three children claimed to have been possessed by evil spirits....Recently, the priest who dealt with the actual exorcisms of this family, Rev. Michael Maginot [person] has signed with Evergreen Media Holdings [organization] to make his account of the story into a movie."

Source: Exorcism in Gary Indiana by Wikinews

Named Entity Recognition

What are some possible uses for sociologists?

Which relationships do you think could be extracted?

Sentiment Analysis

What is the emotional state of the author?

<Words associated with happiness> :D

<Words associated with unhappiness> :'(

Positive values suggest happiness

Negative values suggest unhappiness

Sentiment Analysis

Remember when I asked you to describe yourself?

"I am a hockey player. My course paper is about migration. I am from Tomsk."

Sentiment = 0

"I am happy to get the opportunity to study sociology in the honors group. I am conducting my course project with <peer>. Our topic is on economic sociology."

Sentiment = 2.5

Sentiment Analysis

Let's test a hypothesis!

Female students tend to express more positive sentiments than male students.

Data: Student self-descriptions

Processing: Data Science Toolkit

Analysis: Difference of Means, t-test

Findings:

Means: males = 0.22, females = 1.57
Statistics

t = 6.91***
df = 12.53

95% confidence interval: [1.00, ∞)

Photo Analysis

Can a computer detect...

The number of people?
Their gender?
If they are smiling?
If they have mustaches?

Photo by nosound

Photo Analysis

Application

Jetpac

Secondary Data

Data is all around us...

...and it's often free or cheap.

Secondary Data

Where to find it?

Libraries and electronic archives
Statistical abstracts
Published articles

Bibliographies
Tables and figures
Ask the authors

Libraries and Electronic Archives

Some examples

Statistical Abstracts

Organized by subject

Offers description and reference to contemporary data

Publication of the Statistical Abstract of the United States stopped in 2012 due to budget cuts

Published Articles

Read the friggin' bibliography!

Andrews (2001:91)

Andrews (2001:92, 94-5)

Figures

McAdam (1983:739)

Figures

Marshakova, Irina V. 1981. Scientometrics 3, 1: 13-26.

Marshakova, Irina V. 1973. Scientific and Technical Information Serial of VINITI 6: 3-8

Figures

How would you convert a time series like McAdam (1983) into data?

Software

Ask the Authors

Their contact information is provided for a reason!

Reasons they say "no":

Privacy
Proprietary restrictions
Future intentions with data
Data lost
Busy
Dishonesty

Reasons they say "yes":

It helps science progress
It strengthens the community
Demonstrates honesty and finding integrity

Who creates and releases data?

Image courtesy of Nicknilov

Questions

What types of organizations are these?

What types of information do they release?

What the intended purposes for the data?

Typical units with public data?

Image courtesy of Brion VIBBER

Decades

...1900

1910

1920

1930

1940

1950

1960

1970

1980

1990

2000...

Years

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011...

Divisions within a Year

Quarter 1

Quarter 2

Quarter 3

Quarter 4

Module I

Module II

Module III

Module IV

Months of the Year

January

February

March

April

May

June

July

August

September

October

November

December

What type of units are these?

Which fallacy are these units prone to?

Combining Datasets

"Mashups"

Benefits

Greater context
Additional insight and contribution
Often free

Requires common identifier

Geographical unit
Time
Industry, School, other organizations

Doesn't require common unit of analysis

Multilevel models

Limitations

Have the analyses been done before?
Is the data appropriate for the research question?
Do you know the details of measurement and collection?
Limited to the original variables and constructs.
Are the items measured a proxy for your interests?
No control over data collection.
What information isn't collected?
Administrative capacity.

Assignment

Challenge

Characteristics of the data behind the next diagram:

All participants understood the data would be analyzed

(Admittedly, for a different purpose)
The data was collected unobtrusively

The data was reactive

All members of our class are participants
All members of our class

Are capable of collecting it--no researcher privilege
Have seen the original data--no privacy violation

The data was free and quick to collect
Data is a proxy measurement for "association"

Where did the data come from?

Implications

Accuracy

Behavioral
Proxy requires researcher inference

Ethical

Informed consent
Anonymity

Examples of related data?

Building pass
Assignment completion time
Credit card purchases
Metro rides
...

Historical Research

The following lesson relies upon and draws heavily from Neuman (2007), Chapter 12.

Content Analysis Assignment

Two comics

Linda Lark: Student Nurse #1

Nurse Betsy Crane Vol 2 #13

Similar books

Subject matter

Young, white female nurses
Romance

Audience

Time: Oct - Dec & Oct 1961

Different publishers

Price
Comic Code Authority policy

Historical Circumstances

"Golden Age" ~ 1938-1956
1954

Comics Code Authority established
Seduction of the Innocent

"Silver Age" ~ 1956-1971

Comics and gender
Fear and anxiety

Second Wave of Feminism ~ 1960s-1980s

Feminine Mystique (1963)

Hypotheses

Less gender equity in comics without CCA stamp

Niche market and limited oversight
(Not applicable for Dell Comics)

Less gender equity in comics with CCA stamp

Code recreate existing unequal discourse
Code reduced creative outlet

Emphasis on male-centric themes

No difference in gender equity

Non-objectionable content, no need for code
Mass popularity

Cultural regression to the mean

Operationalization

How did we measure gender equity?

Results

Assignment Questions

What was the case selection strategy?

What was the sampling strategy employed?

What are the suggested historical implications for gender socialization?

What are the other limitations to this exercise?

Historical Research:
Introduction

Purpose

Challenge existing explanations and assumptions
Expand subject of inquiry to new settings

Specify or generalize

Difficulties

Requires rich knowledge base on both

Culture
History

Steps

Conceptualization
Locating evidence
Evaluating evidence
Organizing evidence
Synthesizing findings
Narrative

Conceptualization

Loose theoretical models

Read some existent theoretical literature
Imagine plausible models

Background materials on case

Encyclopedias
Chronologies
Generalist histories

Locating Evidence

Bibliographies

Additional literature
Data sources

Periodicals, reports, and white papers
Datasets
Archival materials

Specialist libraries and archives

Identify the important sites
Locate pertinent and related materials
Follow the archive's rules
Record information and citation details

Voice, notes, photocopies, scans, etc

Evaluating Evidence

Authenticity

Original vs. secondary
Assess probabilistically

E.g., date created vs. date occurred

Reliability

Internal and external consistency
Literal vs. real meanings
Created for researcher purposes?

Author's ability to be truthful

Socially able
Physically able

(Milligan, JD. [1979] History and Theory 18:2:177-96.)

Organizing Evidence

Create a system of organization

Spreadsheets
Tagging systems

Schemes

Theoretical
Chronological
Case variation

Synthesis

How does your research fit into the existent literature?

Narration

Tell a compelling story for readers.

Evidence Types

Primary Sources
Secondary Sources
Running Records
Recollections

Primary Sources

Uses

Originality
Basis of historical knowledge

Downsides

Laborious
Often inaccuracy
Biases

Document retention
Organization
Literacy skills

Examples?

Example: Diaries

Northrop, John Worrell. 1904. Chronicles from the diary of a war prisoner in Andersonville and other military prisons of the South in 1864. Wichita, KS. p. 66.

Secondary Sources

Uses

General understandings
Broad descriptions

Downsides

Subjectivity
Selective inclusion/exclusion
Causality

Organization
Narration
Multiple and interactive effects

Varying empirical strength

Running Records

Files and statistical documents produced by organizations.

Refer to the previous lesson.

Recollections

Individuals recounting their past experiences.

Uses

Counteracts elite bias
Absence of documentation

People
Activities

Downsides

Accuracy and recollection
Sensitivity

Example: Oral History

Dora Strong Dennis. 1995. Behind the Veil.

Comparative Research

Uses

Wider range of observations
Test cultural sensitivity behind theories

Downsides

Laborious
Sampling
Limited generalizations

Units of Comparison

Cultural regimes. Boundaries?
Nation-states. Appropriate?

Comparative Research: Data

Field research

Primary
Secondary

Survey research

Primary
Secondary

E.g., World Values Survey

Content analysis
Governmental statistical records
Historical materials

Comparative Research:
Equivalence

Is the comparison appropriate?

Types of equivalence

Lexical
Contextual
Conceptual
Measurement

Social Network Analysis

Features

Intuition of social structure as ties bonding social actors
Informed by systematic empirical data
Visualization plays a substantial role
Requires mathematical and/or computational models

(Freeman 2004:3, 5)

What is a social network?

"A finite set or sets of actors and the relation or relations defined on them"
(Wasserman and Faust 1994:20)

What is an actor?

Actors are social entities

Actors do not necessarily have the ability to act

Actors (typically) are all of the same type

Formal terms for actors

Vertex
Node

Examples?

Actors may also have attributes

(e.g., age, sex, ethnicity)

What are relations?

Social ties link pairs of actors

Relations collect a specific set of ties among group members

Related formal terms

Edges
Arcs

Conceptualizing Relations

Directed undirected?
Weighted or unweighted?

Nominal, ordinal, interval, or ratio scale?

Signed or unsigned?
Loops?
Time sensitivity?

Static
Moving window
Real-time
Accumulation and decay

Relations may also have attributes

Two Basic Measurements

Degree

"Number of edges incident upon a node"

Undirected
Directed

Indegree
Outdegree
Total (Freeman) Degree

Density

"Proportion of observed edges in a network"

Two Basic Measurements

Ways to Express a Social Network

Sociogram

Ways to Express a Social Network

Sociogram
Matrix

Ways to Express a Social Network

Sociogram
Matrix
Edgelist

Social Network Data

Surveys

Free response
Roster
Ego networks

Field observations
Documents

Official reports
Content analysis
Published papers

Websites

Social media
Links

Subgraphs

A set of nodes and edges within a graph

Node-generated subgraphs
Edge-generated subgraphs

Dyad Census

Dyad Census & Graph Properties

Undirected

Density (i.e., tie probability)

Directed

Density (i.e., tie probability)
Reciprocity

"You should attend funerals, because if you don't go to people's funerals, they won't go to yours."

Dyad Census & Graph Properties

Directed

Density (i.e., tie probability)
Reciprocity

Conceptual questions

Are null ties reciprocal?
Defined by edges or dyads?

Common measurements

Edgewise

2* M / (2* M + A )

Dyadic

( M + N ) / ( M + A + N )

Dyadic, non-null ("ratio")

M / ( M + A )

Triad Census, Undirected

Brokerage

Characterized by only two ties among three actors

Transitivity, "clustering," triadic closure

Your friends are often friends with each other
Typically = (3*Triangles) / (Connected Triples)

Triad Census, Undirected

Triad Census, Directed

Triad Census, Directed

Brokerage

i → j → k , i ↛ k, k ↛ i

Transitivity

Weak (most common)

i → j → k, if i → k

Strong

i → j → k, iff i → k

Cycles

i → j → k → i

Walks

"A walk is a sequence of nodes and lines, starting and ending with nodes, in which each node is incident with the lines following and proceeding it in the sequence." - Wasserman and Faust (1994:105)

Trail

A walk such that every edge traversed is unique

(yet not necessarily every node )

Path

A trail such that every vertex traversed is distinct

There could be zero, one, or multiple walks, trails, and paths between any two vertices!

Seven Bridges of Königsberg

Problem: Walk must cross every bridge only once

Euler (1735) proved there is no solution for the walk

Land masses are nodes, bridges are edges
Would need zero or two nodes of odd degree

(Image modified by Bogdan Giuşcă)

Measurements of Distance

Pairwise

Path length: Number of edges traversed between two nodes

Geodesic: Shortest path between two nodes

Geodesic distance: Length of the shortest path between two nodes

Graph and Subgraph

Average path length

Mean geodesic distance

Diameter: Longest geodesic distance

Courtesy of A.Savin

Measurements of Distance

Cycles

A walk "that begins and ends at the same node" and has "at least three nodes in which all lines are distinct, and all nodes except the beginning and ending node are distinct."
Wasserman and Faust (1994:107-8)

Cycles have a length

Connectivity and Components

If a path exists between each pair of vertices in a graph, then the graph is connected

Strong connectivity: preserves path directionality
Weak connectivity: ignores path directionality

A component is a maximally connected subgraph

An isolate is the smallest possible component: a single vertex without any ties to other vertexes in the graph

How many components?

Connectivity and Components

A bridge is an edge that, if removed, creates more components

A cutpoint is a node that, if removed, creates more components

Find Bridges and Cutpoints

Centrality and Centralization

Centrality: Nodal measurement

Who are the most important actors in a network?

Centralization: Graph measurement

How much difference in "importance" is there between actors within a network?

Generally, compares the observed network's centralization against the theoretical maximum

Centrality and Centralization

Degree
Betweenness
Closeness

(Freeman 1979)

Betweenness

How many geodesics go through a node (or edge)?

Variations

Edge weighted
Edge betweenness
Proximity, Scale Long Paths, and Cutoff
Endpoints

Random walk

Closeness

Q: What is closeness?

A: The inverse of farness!

Q: What is farness?

If connected, the sum of a node's geodesic distances to all other nodes

Variations:

Unconnected graphs

Edge weighted

Random walk

Cohesive Subgroups

“the forces holding the individuals within the groupings in which they are” - Moreno and Jennings (1937:137)

Cohesive groups tend to

Interact relatively frequently
Have strong, direct ties within themselves
Display high internal density
Share attitudes and behaviors within themselves
Exert pressure and social norms internally

Cliques

A maximally complete subgroup - Luce and Perry (1949)

~In other words~

Everyone has a tie to everyone else in the subgroup (complete)

No other, smaller subgroups include only a subset of the same actors (maximal)

Critique: Too stingy!

4-cliques

3-cliques

2-Cliques

k-cores

Cohesive "seedbeds" nested within a network

Minimum #ties (k) each member of a subgroup has to other subgroup members

Directed graphs may measure k -cores through

Ties going inward
Ties going outward
Total ties

Alvarez-Hamelin et al. (2006); Seidman (1983)

1-core

1 and 2-cores

1, 2, and 3-cores

Major Research Topics in Brief

Homophily
Diffusion
Tie formation models

Homophily

("Assortativity")

Birds of a feather flock together

Homophily

Categorical vs. continuous variables

Sources?

Which relationships?

Felds's Foci

Forms of homophily

Generalized
Differential
Matching

Intervening considerations

Population effects
Degree correlated attributes
Triadic closure

Homophily

E-I Index

One (of many) measurements

EI = ( E - I ) / ( E + I )

E = #Ties between subunits

I = #Ties within subunits

Range: [-1, 1]

Lower values: More homophily

Higher values: Less homophily

Krackhardt (2003) The Journal of Applied Behavioral Science

Diffusion

The spread of a behavior or attribute

Diffusion

Requirements

An artifact
A sender
A receiver
A channel

Diffusion

Relationship to previous adopter increases a receiving node's propensity to adopt

Diffusion

Considerations

Account for homophily
Theorizing channels and artifacts

What are some artifacts that could diffuse?
Which channels could diffuse these artifacts?

Conceptualizing time

Adoption rate
Decay

Inhibitors

Modeling

How do ties form?

Preferential attachment
Homophily / assortativity
Block models
Small world
Other network evolution models

Preferential Attachment

Cumulative Advantage
Matthew Effect (Merton)

"For everyone who has will be given more, and he will have an abundance. Whoever does not have, even what he has will be taken from him." (Matthew 25:29)

Friendship Paradox (Feld 1991)

Sensor research & epistemology

P(X=x) ~ x^(-alpha)

Nodes are of degree greater than or equal to x

P (X=x) is the probability of observing a node with degree x or greater

alpha is the scalar

(Barabási and Albert 1999)

Blockmodels

Focus upon positions or "roles," not actors

Comprised of

Discrete subsets of actors into "positions"
Relationships within and between positions

Potential hypotheses

Relationship between positions and attributes
Structure of relationships

The following examples from Wasserman and Faust (1994:423)

Cohesive Subgroups

Center-periphery

Centralized

Hierarchy

Transitivity

Block Modeling

Which hypotheses do we have regarding the blocked structure of our class?

Time Point 1

Time Point 2

Time Point 3

Small World

Watts and Strogatz (1998)

Properties

High clustering
Short path lengths

Inspired by Milgram's "small world" experiment

Small World

Watts and Strogatz (1998)

Begin with a ring-type graph ("lattice")
Connect each node to k others
Rewire (switch) the edges with β probability

Reality somewhere between

"Connected caveman graph"
Random ties

Small World

Picture courtesy of Arpad Horvath

Other Evolutionary Models

Key questions as time proceeds:

Actors

Can they join a network?

If so, do they form a tie upon joining?

Can they exit a network?

Edges

Can they form? Dissolve? Rewire?

Mechanisms

Under which circumstances do...

Actors join or exit a network?
Edges form, dissolve, or rewire?

Actor-oriented or tie-based?

Other Evolutionary Models

References

Robins and Pattison. 2001. "Random graph models for temporal processes in social networks." Journal of Mathematical Sociology, 25:5-41.
Toivonen et al. 2009. "A comparative study of social network models: Network evolution models and nodal attribute models." Social Networks, 31:4:240-54.
Snijders et al. 2010. "Introduction to stochastic actor-based models for network dynamics." Social Networks 32:1:44-60.
Krivitsky and Handcock. 2014. "A separable model for dynamic networks." Journal of the Royal Statistical Society, 76:1:29-46.

Separable Temporal Exponential Random Graph Models

(STERGM)

Modeling two separate processes

Tie formation
Tie dissolution

Constants

Actors do not join or leave
Actor attributes remain fixed

Data: Student communication networks

STERGM

Which hypotheses do you have regarding...

tie formation?
tie dissolution?

My hypotheses

edges + group homophily
triangles
two-paths (dissolution only)

STERGM

Edges + Homophily

Probability of a within group tie forming: 0.24
Probability of a between group tie forming: 0.08
Probability of a within group tie persisting: 0.80
Probability of a between group tie persisting: 0.44

Edges + Homophily + Triangle

Tie formation

No significant triangle effect
Probability of a within group tie forming: 0.18

Tie dissolution

Probability of a triangle persisting: 0.71
No significant homophily effect now

STERGM

Differential models

Formation: Edges + Homophily
Dissolution: Edges + Triangles + Two-Paths
Results

Formation

Probability of within group tie formation: 0.22
Probability of between group tie formation: 0.08

Dissolution

Probability of a triangle persisting: 0.96
Probability of a two-path persisting: 0.78

Observed vs. Simulated Networks

Exercise

Produce a research design with the following elements

A research question
A social network with clearly defined

Actors
Relationships

Two hypotheses
A means to test them with network

Data
Measurements

Simulations and Agent-Based Models

Why would a social scientist conduct computer simulations?

Ideas?

An Anecdote

FiveThirtyEight.com

Image from Kyle Hill

An Anecdote

FiveThirtyEight.com

Methodology

Weighted Polling Average
Adjusted Polling Average
FiveThirtyEight Regression
FiveThirtyEight Snapshot
Election Day projection
Error analysis
Simulation

Why would a social scientist conduct computer simulations?

Ideas?

Methodological Robustness

Data limitations

Measurement
Model fit

Theory

Hypothesis testing
Hypothesis generating

Prediction

Data Limitations

What are some typical data problems?

(Assuming the operationalization matches the conceptualization.)

Response rate issues

Low response rate
Unrepresentative respondents

Missing data
Error

Respondent error
Interviewer error

Response Rate Issues

Low response rates are often addressed with bootstrapping

Steps to bootstrap a measurement

Calculate an initial measurement
Resample within the responses with replacement
Recalculate the measurement

Repeat Steps 2 and 3 many times (thousands)

Construct a confidence interval

Infer the population parameter

Missing Data

Why worry about it?

Techniques

Case removal (default)

Some problems, though...

Imputation Methods

Additional data and inference (best)
Random value from within the data

Completely at random
Similar case

Model missingness and estimate
Multiple imputation

Impute -> Measure -> Repeat -> Combine

Error

How does respondent error enter a dataset?

How does interviewer error enter a dataset?

Error

Practically all data contain some error

Are your findings robust against it?

A way to find out:

Take original measurement
Introduce error to your data
Remeasure from error-riddled data
Repeat Steps 2 and 3 many times
Repeat Step 4, increasing the simulated error
Compare simulated error to observed

Simulated Experiment

What are some forms of respondent error in our peer communication networks?

I propose the following peer selection error

I intend to select <A>, but I instead select <B>
Assumed constants

Number of vertices, edges
A student's degree
Mistakes are random

Simulated Experiment

Let's focus on transitivity

For each time point we introduce more selection error

Error rates: 5%, 10%, 15%, 20%, 25%

At each time point and error rate we "rewire" edges

"Rewiring" means edges randomly trade partners
The proportion of edges rewired equals the error

We measure transitivity using these faulty networks
We repeat each step a thousand times

Results of the Simulation

Is transitivity on our networks robust against random selection error?

Why or why not?
Under which conditions?

Measurement

What if your measurement results from structural properties?

E.g., Is a transitivity score of .54 high or low?

It certainly depends upon both

The number of vertices in a network
The number of edges in a network

A Refresher

On our communication networks

Transitivity decreased over time
Number of edges increased
Number of vertices stayed the same
Homophilous networks

Can interact with transitivity

Transitivity here is the proportion of closed two-paths

Let's Simulate a Homophily Network!

Constant parameters

Number of nodes
Density (approximately)
Number of groups and their sizes
Density within and between groups

Varying parameters

Random edge assignment, given above parameters
Simulated networks modeled after each time point

One thousand simulations for each time point

Interpretation?

Model Fit

Does the model adequately produce the outcome of interest?

Linear Regression Equation

y = B * x + e

It expresses a relationship that can be simulated!

Tie formation (above) and dissolution (below)

Agent-Based Models

Exercises in Theory

Portions of the next few slides draw from Macy and Willer (2002)

Agent-Based Models

Four Common Assumptions

Agents are autonomous decision makers
Agents are interdependent and influence one another
Agents follow simple behavioral rules
Agents adaptive and look back to the past

Agent-Based Models

Simplicity and generality are key to good models.

Why is a particular model less theoretically useful?

Agent-Based Models

Two major questions

Emergent structure, e.g.,

Convergence vs. differentiation
Influence and diffusion

Emergent social order, e.g.,

Adaptation from interactions
Yields cooperation, trust, collective action

Agent-Based Models

Common explanatory factors

Homophily
Transitivity
Density
Relational stability (persistence)
Reciprocity

Agent-Based Models

Quality models should

Be simple
Avoid relying upon biological metaphors
Provide rigorous experiments
Be robust to changes
Exhibit external validity
Test the validity of the field's claims
Emphasize affects resulting from social factors

Benefits of Simulation Methods

Cost
Respondent cooperation
Ethical

Limitations of Simulation Methods

Often lacks of empirical data
Assumptions and complexity reduction

Or is it a benefit?

Assignment

Demonstration

Review one network model
Review one diffusion process
Formulate three hypotheses
Simulate network and diffusion process
Evaluate hypotheses

One Network Model

Alexei Vazquez. 2003. "Growing Network with Local Rules: Preferential Attachment, Clustering Hierarchy, and Degree Correlations." Physical Review E 67, 056104

Network Growth

Upon each turn

Networks add a vertex (probability [1 - u])

New vertex ties to random old vertex
Potential ties created from new two-paths

A random two-path closes (probability u)

Turns continue until N vertices introduced

One Network Model

Vazquez (2003)

Parameters

N: Number of vertices
u: Density & transitivity

Properties

Skewed degree distribution
Transitivity
Degree correlation

One Network Model

Vazquez (2003)

We're going to add one slight modification for added realism...

Randomly rewire edges with probability p.

Implications:

Constant

Degree distribution
Density
Number of vertices

As p increases

Transitivity decreases
Number of components increases

One Diffusion Process

Mark Granovetter. 1978. "Threshold Models of Collective Behavior." American Journal of Sociology 83:6:1420-43.

Outcome of interest: binary decisions

Examples

Innovation adoption
Rumors and diseases
Strikes
Voting
Educational attainment
Leaving social occasions and migration
Jumping onto the dance floor

One Diffusion Process

Mark Granovetter. 1978. "Threshold Models of Collective Behavior." American Journal of Sociology 83:6:1420-43.

Two main ideas

Each Individual has a threshold

A minimum proportion of people engaging in collective action required for the individual engage
Varies across vertices from 0 to 1

We'll be using a random uniform distribution

Network ties can encourage or discourage engagement

Granovetter says they count twice

We'll be varying this constant

One Diffusion Process

Mark Granovetter. 1978. "Threshold Models of Collective Behavior." American Journal of Sociology 83:6:1420-43.

F(thresholdi, xit) = Decision of i to engage at time t

F(thresholdi, xit) = xit > thresholdi

xit = Engagedit / (Engagedit + Unengagedit )

Engagedit = EngagedPeersit * (PeerEffect - 1) + AllEngagedt

Unengagedit = UnengagedPeers it * (PeerEffect - 1) + Allunengaged t

Example

Network

N = 100
u = 0.35
Rewiring probability, p, = 0.20

Diffusion process

Initially engaged = 5%
Thresholds = random uniform distribution
Peer effect = 2
Number of trials = 5

Research Question

Given these theories on network tie formation and decision-making, why do some collective action episodes escalate more quickly than others?

Hypotheses

Density and transitivity

As density increases along with transitivity, the rate of collective action engagement will increase.

Randomization

As the proportion of rewired edges increases, transitivity will decrease and the number of components will increase, preventing collective action growth.

Strength of peer effect

Increasing the effect from one's network ties will increase the rate of collective action growth.

Our Simulation

Code available here:

http://pastebin.com/u0dNiizy

Steps:

Create a random network

Use Vazquez model with N = 100 and u
Randomly rewire edges with probability p

Set thresholds with a random uniform distribution
Set a random 5% of population as engaged
Run through threshold model for five trials
Measure average change from one trial to the next
Repeat 100 times
Regress effects on average change

Density and Transitivity Experiment

Parameters

N = 100
Rewiring probability (p) = 0.20
Number of trials = 5
Peer effect = 2
Randomly vary u from 0 to 1

Findings

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.795      1.069   4.487 1.97e-05 ***
u              7.434      1.991   3.733 0.000317 ***
Multiple R-squared:  0.1245,    Adjusted R-squared:  0.1156

Implications: Positive density (or transitivity) effect

Randomization Experiment

Parameters

N = 100
u = 0.35
Number of trials = 5
Peer effect = 2
Randomly vary rewiring probability, p, from 0 to 1

Findings

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    7.855      1.113   7.057 2.45e-10 ***
p             -2.234      1.846  -1.210    0.229
Multiple R-squared:  0.01472,   Adjusted R-squared:  0.004666

Implications: Transitivity has no effect aside from density

Both Experiments

Parameters

N = 100
Number of trials = 5
Peer effect = 2
Randomly vary rewiring probability, p, from 0 to 1
Randomly vary u from 0 to 1

Findings

                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.594      1.294   5.094 1.73e-06 ***
u                      3.880      1.651   2.350   0.0208 *  
p                     -2.790      1.750  -1.595   0.1141    
Multiple R-squared:  0.07746,   Adjusted R-squared:  0.05844

Implications: Transitivity has no effect aside from density

Peer Effect Experiment

Parameters

N = 100
u = 0.35
Rewiring probability (p) = 0.20
Number of trials = 5
Randomly vary peer effect from 1 to 5

Findings

                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           8.2278     1.5214   5.408  4.5e-07 ***
Peer Effect          -0.4018     0.4621  -0.869    0.387    
Multiple R-squared:  0.007653,  Adjusted R-squared:  -0.002473

Implications: Peer effect does not affect growth rate.

Conclusions

We have simulated two processes

Network tie formation
Interdependent decision making

We tested the "rate of engagement"

Averages change over time in engagement

We have confirmed one hypothesis and rejected two

Confirmed: Density increases growth rate
Rejected:

Transitivity has no effect after controlling density
Peer effect does not affect growth rate

Somewhat contradictory finding

Discussion

How would you improve upon this model?

Varying peer effect

Across actors
Over time

Add and remove edges

Contemporary Challenges and Future Directions in Methods

Types of Challenges

Methods Guiding Theory

Every Day Methodological Proliferation

Methods Guiding Theory

Abbott. 1988. “Transcending Generalized Linear Reality.” Sociological Theory.

y = X * b + u

X(t) = X (t - 1) * B + U

Issues

Fixed set of actors*
Monotonic causal flow
Univocal meaning*
No sequential effects
Casewise independence
Context independent*

Methods Guiding Theory

Abbott. 1988. “Transcending Generalized Linear Reality.” Sociological Theory.

y = X * b + u

X(t) = X (t - 1) * B + U

Better Reality Models

Demographic models
Sequential models
Network models

Everyday Methodological Proliferation

Savage and Burrows. 2007. “The Coming Crisis of Empirical Sociology." Sociology.

Savage and Burrows. 2009. “Some Further Reflections on the Coming Crisis of Empirical Sociology.” Sociology.

Sociologists no longer have a monopoly on social data.

Who collects the most data? How?

Anecdote

Target

Everyday Methodological Proliferation

Savage and Burrows. 2007. “The Coming Crisis of Empirical Sociology." Sociology.

Savage and Burrows. 2009. “Some Further Reflections on the Coming Crisis of Empirical Sociology.” Sociology.

Survey Problems

Response rates
Homogeneous units
Commercial surveys

Related Search Terms

Everyday Methodological Proliferation

Savage and Burrows. 2007. “The Coming Crisis of Empirical Sociology." Sociology.

Savage and Burrows. 2009. “Some Further Reflections on the Coming Crisis of Empirical Sociology.” Sociology.

In-Depth Interview Problems

Mastery outside of sociology

Who?

More commonplace
Less interesting

Why conduct interviews when...

Individual-level Public Data

Advantages

Relatively accurate and detailed
Behavioral
Contextual, ripe for secondary data

Education
Occupation
Geography
Relationships

Problems

Selective sample
Selective information
Same issues can be said of interviews

Demonstration

Netlytic.org

#NotAllMen / #YesAllWomen

Suggested Directions

Continue to carefully construct good theories
Collect data based upon its ability to address theory
Case study designs

Field research
Historical cases
Comparative cases

Triangulate data sources
Be very wary of sampling issues
Select (and create) analytic methods according to theory

Example: Wiki Surveys

Primary characteristics

Greedy
Collaborative
Adaptive

Pairwise comparisons

Relative importance
Establishes rank order with uncertainty modeled

Assignment

Which ethnic group is more successful in Russia?

Take a screenshot after you cast your first vote
Cast a total of 25 votes
Take a screenshot after your last vote
Upload those screenshots to imgur.com

Not required to create an account and log in

Complete this form

Research Methods II

Outline

Introductions

*Introductions

Course

Students

Syllabus

Syllabus

Syllabus

Grading

Presentation Assignments

Presentation Assignments

Presentation Assignments

Experiments

Experiments

Experiments are the Best for a Causal Test

Experiments Require an Intervention

E.g.,

E.g.,

Etc.

What are some ways we can intervene?

Random Assignment

The Seven Parts to a True Experiment

Why use a control?

Deception

Types of Experimental Designs

Debates

Internal Validity

External Validity

Assignment One

Assignment One Debriefing

Assignment One Debriefing

Assignment One Debriefing

Peer Nominations

Assignment One Debriefing

Assignment One Debriefing

Assignment One Debriefing

Change in Nominations

Assignment One Debriefing

Student Presentation

Next Week: Sampling

Sampling

Materials

Fundamental Terms

Fundamental Terms

Fundamental Terms

Other Terms

Statistical Terms

The Law of Large Numbers

Probability vs Nonprobability Sampling

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling Poll at opensource.com

Nonprobability Samples

Haphazard/Accidental/Convenience Sampling Link to Tweet

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Nonprobability Samples

Probability Samples

Probability Samples

Webpages

Computer Programs

Probability Samples

Probability Samples

Probability Samples

Probability Samples

Haphazard/Accidental/Convenience Sampling

Poll at opensource.com

Haphazard/Accidental/Convenience Sampling

Link to Tweet