Economics 210: Economic Statistics

Finding & Preparing Data

Ryan Clement | Middlebury Libraries | Spring 2018

slides at go/findingdata/

What are we covering?

  • How to think about your data search
  • A few major data sources
  • Codebooks
  • Dirty, unprepared data

Before you start your data search

  • What variables do I need?
    • Independent variable
    • Dependent variable
  • What unit of observation do I need?
    • microdata vs macrodata
  • What time period/frequency do I need?

Before you start your data search

  • "Who would care about this?"
    • And who would care about keeping it?
  • What type of organization are they?
    • Educational institutions, government organization, private company, etc.
  • If not government, how valuable is the data?
    • And who would pay for it?
  • Are there privacy/confidentiality issues?
  • Cross-Sectional
    • data that are only collected once
    • many public opinion surveys are cross-sectional
  • Time Series 
    • studies the same variable over time
    • the Census or the National Health Interview Study are examples
    • the questions generally remain the same over time, but the individual respondents vary
  • Longitudinal Studies 
    • conducted repeatedly, same group of respondents surveyed each time
    • allows for examining changes over the life course
    • Add Health is an example

Types of studies

Searching Google for Data

  • Don't start with Google
  • Be as specific as possible in search terms (i.e. "microdata")
  • Remember the "who would care" rule
  • Some access points:
    • Social Explorer (for tables and many different geographies)
    • IPUMS (historical, harmonized, microdata)
    • NHGIS (historical spatial data)
  • When working with historical/time series data:
    • Watch for changing values
    • Watch for changing geographic coverage
    • Watch for changing questions
  • Which ACS is right for you?
    • The 3-year ACS is going away
  • Sociological survey on demographic, behavioral, and attitudinal topics
  • Annually from 1972-1994, then biennially since 1994
  • Randomly selected sample of adults (18+) in United States
    • Two samples of ~1500 respondents each
  • Some questions appear every year; some come and go; some come and then never return
  • Longitudinal study of students in grades 7-12 in 1994-95 (most recent follow up in 2008)
  • Survey data on social, economic, psychological and physical well-being
    • Contextual data on family, neighborhood, community, school, friendships, peer groups, and romantic relationships
  • Public use and restricted versions of the data; public use available through ICPSR
  • Part of the Institute for Social Research at University of Michigan
  • First attempt at openly sharing data amongst researchers (started with election studies data)
  • Curated, digitized, diverse historical data sets


IPUMS Project Goals

  • Collect and preserve data and documentation
  • Harmonize data
  • Disseminate the data absolutely free!
  • Use it for GOOD -- never for EVIL

Other government sources

Other data repositories


  • Column locations and widths for each variable (if necessary)
  • Definitions of different record types
  • Response codes for each variable
  • Codes used to indicate nonresponse and missing data
  • Exact questions and skip patterns used in a survey
  • Other indications of the content and characteristics of each variable

What's in a codebook?

Dirty, Unprepared Data

Missing data
Bad data
Unclear data


By Ryan Clement


  • 1,688