Data Methods: Survey

# Data Methods: Survey

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

## What is survey?

American Statistical Society (ASA)

”Survey" is used most often to describe a method of gathering information from a sample of individuals. This "sample" is usually just a fraction of the population being studied.

## What is survey?

A "survey" is a systematic method for gathering information from (a sample of) entities for the purposes of constructing quantitative descriptors of the attributes of the larger population of which the entities are members.  (Groves et al. 2009)

## What is survey?

The word "systematic" is deliberate and meaningfully distinguishes surveys from other ways of gathering information.

The phrase "(a sample of)" appears in the definition because sometimes surveys attempt to measure everyone in a population and sometimes just a sample.

## Historical moments of survey

Harry Truman displays a copy of the Chicago Daily Tribune newspaper that erroneously reported the election of Thomas Dewey in 1948. Truman's narrow victory embarrassed pollsters, members of his own party, and the press who had predicted a Dewey landslide.

## What is survey?

The purpose of survey is to produce statistics, that is, quantitative or numerical descriptions about some aspects of the population in study.

The main way of collecting information is by asking people questions; their answers constitute the data to be analyzed.

## What is survey methodology?

Survey methodology seeks to identify principles about the design, collection, processing, and analysis of surveys that are linked to the cost and quality of survey estimates.

## What is survey methodology?

This means that the field focuses on improving quality within cost constraints, or, alternatively, reducing costs for some fixed level of quality. “Quality” is defined within a framework labeled the total survey error paradigm. Survey methodology is both a scientific field and a profession. (Groves et al. 2009)

## Critiques to Internet Survey

• Question: Abuse of online questionnaire
• Question: Internet survey findings questionable
• Answer: Different modes generate very similar results (Sanders et al 2007)
• Question: Representativeness?

## Budget factors

• Staff time for planning and administration
• Sample selection costs
• Under/over-representative segments
• Cost of “cleaning” the final data
• Analyst costs
• Reporting

## To start a new survey:

• Is it an ongoing survey or a one-time survey?
• What is the target population (whom is it studying)?
• What is the sampling frame (how do they identify the people who have a chance to be included in the survey)?
• What is the sample design (how do they select the respondents)?
• What is the mode of data collection (how do they collect data)?

## Survey Design

1. What is the population?
2. How to sample?
3. Anticipated data
4. Single mode or mixed mode
5. Single wave or multi-wave
6. Pilot

## Population

1. Concept of Inference

2. Sample and Population

3. What is a panel?

## Population

It is imperative to understand the concept of inference and how much the respondents provide:

1. Answers accurately describe characteristics of individual respondent
2. ​Answers representative of the population

1. Valid measure
2. Reliable measure

## Survey Measurement Design

1. Valid measure
• Validity is measuring what is supposed to be measured.
• Example:
• Wealth vs. Income
• Happiness vs. Satisfaction

## Survey Measurement Design

2. Reliable measure

• Reliability is measuring well what is supposed to be measured.
• Consistency
• Methods
• ​Cronbach alpha

## Survey Measurement Design

1. Question wording
1. Closed-ended questions
2. Open-ended questions
2. Question order

## Questionnaire bias

1. E.g.
The government should force you to pay higher taxes.

## Question wording

Start designing by thinking of the answers
1. Scale (Likert, Thermometer)
2. Mutually exclusive choices
3. Allow multiple selection?

## Question wording

• Add an open ended choice

## Question wording

1. Use even number of choices
1. To avoid middle choice inertia
2. Allow nonresponse

## Survey response process

### Four major components

- Tourangeau, Rips, and Rasinski (2000)
The Psychology of Survey Response

## Survey response process

1. ### Willingness

1. Question types

2. Question wording

3. Social desirability (Hawthorne effect)

2. ### Ability

1. Memory

2. Comprehensibility

3. Culture

### Respondent Willingness and Ability to Participate in a Survey

`Robinson, S.B. and Leonard, K.F., 2018. Designing Quality Survey Questions. SAGE Publications.`

## Sampling

How well a sample represents a population depends on the sample frame, the sample size, and the specific design of selection procedures. If probability sampling procedures are used, the precision of sample estimates can be calculated.  (Fowler 2009)

## Sampling and Population

Source: Fricker, R.D., 2008. Sampling methods for web and e-mail surveys. The SAGE handbook of online research methods, pp.195-216.

## Sample size

One general conservative formula:

N=1/error^2

Example:

Use .05 as acceptable error rate (± 5 percent):

N=1/.05^2=1/.0025=400

## Terminology

1. Target Population:
The population to be studied/ to which the investigator wants to generalize his results

2. Sampling Unit:
smallest unit from which sample can be selected

3. Sampling frame
List of all the sampling units from which sample is drawn

4. Sampling scheme
Method of selecting sampling units from sampling frame

## Probability samples

A probability-based sample is one in which the respondents are selected using some sort of probabilistic mechanism, and where the probability with which every member of the frame population could have been selected into the sample is known.

The sampling probabilities do not necessarily have to be equal for each member of the sampling frame

## Types of probability sample

1. Simple random sampling (SRS) is a method in which any two groups of equal size in the population are equally likely to be selected. Mathematically, simple random sampling selects n units out of a population of size N such that every sample of size n has an equal chance of being drawn.

## Types of probability sample

1. Stratified random sampling is useful when the population is comprised of a number of homogeneous groups. In these cases, it can be either practically or statistically advantageous (or both) to first stratify the population into the homogeneous groups and then use SRS to draw samples from each group.

## Types of probability sample

1. Cluster sampling is applicable when the natural sampling unit is a group or cluster of individual units. For example, in surveys of Internet users it is sometimes useful or convenient to first sample by discussion groups or Internet domains, and then to sample individual users within the groups or domains.

## Types of probability sample

1. Systematic sampling is the selection of every kth element from a sampling frame or from a sequential stream of potential respondents. Systematic sampling has the advantage that a sampling frame does not need to be assembled beforehand. In terms of Internet surveying, for example, systematic sampling can be used to sample sequential visitors to a website. The resulting sample is considered to be a probability sample as long as the sampling interval does not coincide with a pattern in the sequence being sampled and a random starting point is chosen.

## Non-Probability samples

Non-probability samples, sometimes called convenience samples, occur when either the probability that every unit or respondent included in the sample cannot be determined, or it is left up to each individual to choose to participate in the survey.

## Types of non-probability sample

Snowball sampling is often used when the desired sample characteristic is so rare that it is extremely difficult or prohibitively expensive to locate a sufficiently large number of respondents by other means (such as simple random sampling). Snowball sampling relies on referrals from initial respondents to generate additional respondents. While this technique can dramatically lower search costs, it comes at the expense of introducing bias because the technique itself substantially increases the likelihood that the sample will not be representative of the population.

## Types of non-probability sample

Judgement sampling is a type of convenience sam- pling in which the researcher selects the sample based on his or her judgement. For example, a researcher may decide to draw the entire random sample from one ‘representative’ Internet-user community, even though the population of interest includes all Internet users. Judgment sampling can also be applied in even less structured ways without the application of any random sampling.

## Probability vs. Non-probability samples

For probability samples, the surveyor selects the sample using some probabilistic mechanism and the individuals in the population have no control over this process. In contrast, for example, a web survey may simply be posted on a website where it is left up to those browsing through the site to decide to participate in the survey (‘opt in’) or not. As the name implies, such non-probability samples are often used because it is somehow convenient to do so.

## Survey errors

• Literary Digest 1936 Poll

• Gallup Poll 1948

## Survey errors

Literary Digest 1936 Poll

‘Literary Digest’ mailed 10 million straw- vote ballots, of which 2.3 million were returned, an impressively large number, although it represented less than a 25 percent response rate. Based on the poll data, ‘Literary Digest’ predicted that Alfred Landon would beat Franklin Roosevelt 55 percent to 41 percent. In fact, Roosevelt beat Landon by 61 percent to 37 percent.

## Survey errors

Gallup 1948 Poll

Gallup used a quota sampling method in which each pollster was given a set of quotas of types of people to interview, based on demographics. While that seemed reasonable at the time, the survey interviewers, for whatever conscious or subconscious reason, were biased towards interviewing Republicans more often than Democrats. As a result, Gallup predicted a Dewey win of 49.5 percent to 44.5 percent: but almost the opposite occurred, with Truman beating Dewey with 49.5 percent of the popular vote to Dewey’s 45.1 percent (a difference of almost 2.2 million votes).

## Survey errors

Types of error

Cause

Coverage

‘...the failure to give any chance of sample selection to some persons in the population’.

Sampling

‘...the failure to give any chance of sample selection to some persons in the population’.

Nonresponse

‘...the failure to collect data on all persons in the sample’.

Measurement

‘...inaccuracies in responses recorded on the survey instruments’.

“To err is human, to forgive divine – but

to include errors in your design is statistical."

Leslie Kish, 1977

Two most common approaches to reducing coverage error

• obtaining as complete a sampling frame as possible (or employing a frameless sampling strategy in which most or all of the target population has a positive chance of being sampled)

• post-stratifying to weight the survey sample to match the population of inference on some observed key characteristics.

## Illustration: HKES

YouGov has conducted two waves of election surveys in 2016 before and after the Legislative election.  The company provided multiple weights created using rim weighting (also called Raking) using the following data:

1. Registered voter gender

2. Registered voter age

3. Registered voter district

4. Education based on Pre-election survey result

5. Income based on Pre-election survey result

## Illustration: HKES

The pre and post weights have maximum values to 18.

The general weight value is under 5.

## Illustration: HKES

Possible reasons were:

1.    Weights were created using different populations

2.    Panelists were more representative of the younger population

## Illustration: HKES

For Point 1:

Hong Kong population has a male to female ratio of 47:53 according to the Census.  Registered voter population however has an even distribution of 49:51.

## Illustration: HKES

Previous figures illustrate the big difference between the HKES sample (which has more panelists from the younger group) and registered voter population.  The latter indicates a large proportion in the elderly population.  This can be attributed to some political parties’ concerted efforts in mobilizing the elderly to register to vote.

## Illustration: HKES

Source: SCMP http://www.scmp.com/news/hong-kong/article/1855887/hong-kong-elderly-sign-droves-vote-district-council-elections

## Illustration: HKES

For point 2, YouGov acknowledges that the company has more access to the younger population via their recruitment channel.  It can be due to the highly savvy and active internet user population in the younger age groups.

Another reason that can be posting a problem is using two other demographic variables education and income from other population, that can be more representative of the population or the online population but not necessary the registered voter population.

## Illustration: HKES

Raking is employed to generate a weight using age, gender and district only.  The range of the weight for pre wave is from .269 to 8.939.   They are slightly less varied that the original weights.