Applied Econometrics

The economics of Housing & Homelessness

Applied Econometrics

Practical Perspective
Potential Outcome Framework
Probability Spaces
Difference-in-Means
The Need for Controls
The Essence of Causal Inference
Difference-in-Difference
Instrumental Variables
Residualized Regressions

Applied Econometrics is concerned with the interpretation of statistical results in various contexts

Learning Econometrics

Conceptualize the Math

Simulate it on a computer

Read papers that apply the technique(s)

Aim(S)

Listing just a couple:

Think more about what it means to "control" for something in the finite sample

Better think through tradeoffs among approaches along a number of dimensions

Develop your own sense of what you find to be "credible"

Practical Perspective

At a High Level, Causal Inference doesn't work as well as we might hope

The Gold Standard in Causal Inference is a Randomize Control Trial

Most questions cannot be addressed via a randomized control trial

Even questions that you may initially think can be addressed via an RCT cannot be addressed via an RCT

I don't think this is emphasized as much as it should be in introductory econometric classes (which makes sense partly, why demotive the class?!)

Chetty et al. (2016)

Example

Using only experimental variation, we cannot determine whether the use of experimental vouchers has a high long run impact (however measured) than the use of standard vouchers

Research

Credible

Important

Choice Set

What This Means:

(1) We'll have to make "Approximations"

We'll want to learn math (perhaps more than you may like!) to be able to evaluate and think through these approximations

This necessitates the need to move beyond learning just from data (need to learn from individuals with on the ground experience)

(2) The data alone doesn't provide a unique answer to our question

In practice, we often cannot provide guarantees for the performance of our approach under plausible assumptions

Potential Outcome Framework

The potential Outcome Framework

Causal effects can be defined via the Potential Outcome Function

\textrm{Treatment Effect}_i = \tilde{Y}_i(1) - \tilde{Y}_i(0)

Example

\{0, 1\} \overset{\tilde{Y}_i}{\to} \mathcal{R}_+

Voucher or No Voucher

Annual Earnings

The Potential Outcome Function maps treatments to outcomes

d \longmapsto \tilde{Y}_i(d)

The function can differ across individuals

The potential Outcome Framework

Y_i = D_i\tilde{Y}_i(1) - (1-D_i)\tilde{Y}_i(0)

We instead observe the following Outcome Variable given treatment

An individual either receives or does not receive a voucher

Therefore, one of the terms is missing and hence we do not observe individual level treatment effects

\textrm{Treatment Effect}_i = \tilde{Y}_i(1) - \tilde{Y}_i(0)

Time

Duration

5:30am

6:30am

7:30am

8:30am

1.75 Hrs

2.00 Hrs

2.50 Hrs

2.15 Hrs

Domain

Codomain

Probability Spaces

***Optional Material - For personal interest***

Motivation

Composition is a fundamental way in which we can build new ideas using our existing ideas

By learning about probability spaces, ideas in statistics become composable, which means we can build with them

Probability Spaces

A key concept that we want to be able to conceptualize is "a subset of a set"

Set

This is a subset

Here's another subset

Probability Spaces

In our work this semester on housing policy, the underlying set could be a collection of Eviction Complaints filed in Housing Court against tenants

Eviction Complaints

One subset of interest might be all of the evictions filed against HCV tenants

Another subset of interest might be evictions filed against tenants who failed to pay landlords in their first month

Probability Spaces

It turns out that we cannot define probabilities on elements of a set

\mathbb{P}(x) \ \textrm{is not defined if} \ x \in \Omega

But we can define probabilities on subsets of the underlying set*

\mathbb{P}(A) \ \textrm{is well defined if} \ A \subset \Omega

\textrm{Subset of a Set}: A \subset \Omega

So far, we've introduced two key terms

\textrm{Set}: \Omega

*We cannot always define probability on all subsets.

Probability Spaces

Set

We can also assign probability to this subset

And this subset as well

The probability that we assign to the entire set

\mathbb{P}(\Omega) = 1

Probability Spaces

So now we have the fundaments

\Big( \Omega, \mathcal{F}, \mathbb{P}\Big)

\textrm{Collection of Subsets}: \mathcal{F}, \quad A \in \mathcal{F} \implies A \subset \Omega

\textrm{Set}: \Omega

\textrm{Probability Measure}: \mathbb{P} : \mathcal{F} \to [0, 1]

Probability Spaces

The only thing we know so far is that Probability spaces allow us to represent the probability of subsets of a set (by definition)

One thing we can ask, when someone has introduced something new is, what can we do with it?

\Big( \Omega, \mathcal{F}, \mathbb{P}\Big)

To do something new, we can define a random variable on the probability space

X : \Omega \to \mathcal{R}

Probability Spaces

One random variable of interest will be an estimator

In this context, the underlying set (also known as the sample space) will be the set of possible samples that could be realized

\Big( \Omega, \mathcal{F}, \mathbb{P}\Big) \quad \overset{X}{\rightarrow} \quad \Big( \mathcal{R}, \mathcal{B}(\mathcal{R}), \mathbb{P} \circ X^{-1}\Big)

A key idea is that a random variable "pulls" the probability measure forward onto the space we care about

\textrm{Sampling Probability Measure of Estimator:} \quad \mathbb{P} \circ X^{-1}

All Eviction Cases

Housing Stability

Legal Aid

\mathbb{E}[Y \vert D]

Y_1

All Possible Samples

Housing Stability

Legal Aid

Y_2

Y_n

\vdots

D_1

D_2

D_n

\vdots

W_1 := (D_1, Y_1)

All Eviction Cases

Housing Stability

Legal Aid

\times

W_2 := (D_2, Y_2)

W_n := (D_n, Y_n)

\vdots

Conditional Expectation (Event)

\mathbb{E}[Y \vert A] := \int _{\Omega}Y\mathbb{P}_{X^{-1}(A)}

\Omega

\mathcal{X}

X^{-1}(A)

\mathcal{Y}

Conditioning on an Event (With Independence)

\mathbb{E}[Y \vert W] := \int _{\Omega}Y\mathbb{P}_W

= \sum a_i \mathbb{P}_W(A_i)

= \sum \tilde{a}_i \frac{\mathbb{P}(W \cap \tilde{A}_i)}{\mathbb{P}(W)}

= \sum \tilde{a}_i \mathbb{P}_W(\tilde{A}_i)

= \mathbb{E}[\tilde{Y}(1) \vert W]

\textrm{Let} \ W := \{\omega \in \Omega \vert X(\omega) \in B \cap D(\omega) = 1\}

\tilde{Y}(1) \overset{a.s}{=} Y

With respect to this conditional distribution

The event of interest is the set of outcomes which are mapped into some element of the sigma algebra of X and the treatment value is 1

We will assume the following

\tilde{Y} \perp D \vert B

Conditioning on an Event (With Independence)

= \sum \tilde{a}_i \frac{\mathbb{P}(X^{-1}(B) \cap D^{-1}(\{1\}) \cap \tilde{A}_i)}{\mathbb{P}(X^{-1}(B) \cap D^{-1}(\{1\})}

= \sum \tilde{a}_i \frac{\mathbb{P}(X^{-1}(B))\mathbb{P}(D^{-1}(\{1\}) \cap \tilde{A}_i \vert X^{-1}(B))}{\mathbb{P}(X^{-1}(B))\mathbb{P}(D^{-1}(\{1\}) \vert X^{-1}(B))}

= \sum \tilde{a}_i \frac{ \mathbb{P}(D^{-1}(\{1\}) \vert X^{-1}(B)) \mathbb{P}(\tilde{A}_i \vert X^{-1}(B))}{ \mathbb{P}(D^{-1}(\{1\}) \vert X^{-1}(B))}

= \sum \tilde{a}_i \mathbb{P}(\tilde{A}_i \vert X^{-1}(B))

= \mathbb{E}[\tilde{Y}(1) \vert X^{-1}(B)]

This is where the assumption kicks in

Conditioning on a Random Variable (With Independence)

Let's first discuss what conditional independence with respect to a random variable is Not!

\tilde{Y} \perp D \vert X

Does not imply the following:

\forall B \in \mathcal{B}(\mathcal{X}), \quad \tilde{Y} \perp D \vert B

Why? Because

\Omega \in \mathcal{B}(\mathcal{X})

Which would imply unconditional indepdence!

Aim

We would like to understand the conditions under which

\mathbb{E}[Y \vert X, D_i=1] \overset{\Delta}{=} \mathbb{E}[\tilde{Y}(1) \vert X]

Left Side

s.t. \ \int _A \mathbb{E}[Y \vert X] d\mathbb{P}_{D=1} = \int_A Y d\mathbb{P}_{D=1}, \forall A \in \sigma(X)

= \int_A \tilde{Y}(1) d\mathbb{P}_{D=1}, \forall A \in \sigma(X)

\mathbb{E}[Y \vert X] : \Omega \to \mathcal{R}

Left Side

s.t. \ \int _A \mathbb{E}[Y \vert X] d\mathbb{P}_{D=1} = \int_A Y d\mathbb{P}_{D=1}, \forall A \in \sigma(X)

= \int_A \tilde{Y}(1) d\mathbb{P}_{D=1}, \forall A \in \sigma(X)

\mathbb{E}[Y \vert X] : \Omega \to \mathcal{R}

Right Side

\mathbb{E}[\tilde{Y}(1) \vert X] : \Omega \to \mathcal{R}

s.t. \ \int _A \mathbb{E}[\tilde{Y}(1) \vert X] d\mathbb{P} = \int_A \tilde{Y}(1) d\mathbb{P}, \forall A \in \sigma(X)

Then Left Side equals Right Side

\iff \quad \int_A \tilde{Y}(1) d\mathbb{P}_{D=1} = \int _A \tilde{Y}(1)d\mathbb{P}, \forall A \in \sigma(X)

\int _{\Omega}\tilde{Y}(1)d\mathbb{P}_{D=1} = \sum \tilde{a}_i \frac{\mathbb{P}(D=1 \cap \tilde{A}_i)}{\mathbb{P}(D=1)}

\int _{\Omega}\tilde{Y}(1)d\mathbb{P} = \sum \tilde{a}_i \mathbb{P}(\tilde{A}_i)

\Omega

D=1

\int_{\Omega} \tilde{Y}(1) d\mathbb{P}_{D=1} = \int _{\Omega} \tilde{Y}(1)d\mathbb{P}

\not\implies

\mathbb{P}(\tilde{Y}(1) \in A \cap D=1) = \mathbb{P}(\tilde{Y}(1))\mathbb{P}(D=1)

\mathbb{E}\big[ \mathbb{E}[Y \vert X]\big] = \mathbb{E}[Y]

\mathbb{E}\big[ \mathbb{E}[Y \vert X, D=1]\big] = \mathbb{E}[Y \vert D=1]

By Definition

Left Side

\mathbb{E}[Y \vert X, D_i=1] : \Omega \to \mathcal{R}

s.t. \ \int _A \mathbb{E}[\tilde{Y}(1) \vert X] d\mathbb{P}_{D=1} = \int_A \tilde{Y}(1) d\mathbb{P}_{D=1}, \forall A \in \sigma(X)

s.t. \ \int _A \mathbb{E}[Y \vert X, D=1] d\mathbb{P} = \int_A Y d\mathbb{P}, \forall A \in \sigma(X \times \{D=1\})

= \int_A \tilde{Y}(1) d\mathbb{P}, \forall A \in \sigma(X \times \{D=1\})

\int _{D=1} \tilde{Y}_1 d\mathbb{P}_{n \vert D=1}

\int \tilde{Y}_1 d\mathbb{P}_n

Difference in means

Idea

Approximate the Average Treatment Effect by comparing the average in the treated group to that in the control group

\mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0)]

Average Treatment Effect

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]

Approximation

Average outcome over those individuals in the treated group

Thought Experiment

Treated

Control

Population

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]

= \mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]

Difference-in-Means

= \underbrace{\mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=1]}

+ \ \underbrace{\mathbb{E}[\tilde{Y}_i(0) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]}

Summary

Difference-in-Means

Average Treatment on the Treated

Selection Bias

Example

Y_i := \ \textrm{Judgement of Possession for failure to pay rent}

D_i := \ \textrm{Tenant is represented by a legal aid lawyer}

Exercise: Develop a story for positive/negative selection bias in this context

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]

= \mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]

Randomized Control Trial

= \underbrace{\mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=1]}

+ \ \underbrace{\mathbb{E}[\tilde{Y}_i(0) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]}

\mathbb{E}[\tilde{Y}_i(1)] - \mathbb{E}[\tilde{Y}_i(0)]

\mathbb{E}[\tilde{Y}_i(0)] - \mathbb{E}[\tilde{Y}_i(0)]

\underbrace{\mathbb{E}[\tilde{Y}_i(1)] - \mathbb{E}[\tilde{Y}_i(0)]}

ATE

The need for controls

Idea # 2

Instead of taking the difference between treated and control groups, let's average local differences between treated and control groups

Summary (thus far)

\textrm{Causal Inference is a Missing Data Problem}

\textrm{Difference-in-Means} \ = \ \textrm{Avg Treatment Treated} \ + \ \textrm{Selection Bias}

How much of this gap is selection bias?

Tsembris (2000)

(1) Take difference-in-means within each group

\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]

\mathbb{E}_x\big[\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]\big]

(2) Take the average differences

Idea # 2

X_i

Example

Let's assume we observe a categorical variable

Housing Court

x_1

x_2

x_3

x_4

x_5

Under what conditions is this a good idea?

\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]

= \mathbb{E}[\tilde{Y}_i(1) \vert X_i = x_j, D=1] - \mathbb{E}[\tilde{Y}_i(0) \vert X_i = x_j, D=0]

= \mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0) \vert X_i = x_j]

= \mathbb{E}[\tilde{Y}_i(1) \vert X_i = x_j] - \mathbb{E}[\tilde{Y}_i(0) \vert X_i = x_j]

Key Assumption

"Within each bin, treatment is as good as randomly assigned"

Within Bins

x_1

x_2

x_3

x_4

x_5

We are assuming that Treatment is randomly assigned within each bin

Local Randomized Control Trials

Continued...

\mathbb{E}_x\big[\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]\big]

\implies

= \mathbb{E}_x\big[\mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0) \vert X_i = x_j]\big]

= \mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0) ]

By the Law of Iterated Expectations

\tilde{Y}_i \perp D_i \vert X_i

Selection on Observables Assumption

Locally in the feature space, treatment is as good as randomly assigned

Interpretation

Implication

The Conditional Expectation Function has a Causal interpretation

\mathbb{E}[Y_i \vert D_i=1, X_i] - \mathbb{E}[Y_i \vert D_i=0, X_i]

= \mathbb{E}[\tilde{Y}_i(1) \vert D_i=1, X_i] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0, X_i]

= \mathbb{E}[\tilde{Y}_i(1) \vert X_i] - \mathbb{E}[\tilde{Y}_i(0) \vert X_i]

Example

Y_i := \ \textrm{Judgement of Possession for failure to pay rent}

D_i := \ \textrm{Tenant is represented by a legal aid lawyer}

Exercise: Develop a story for positive/negative selection bias in this context

X_i := \ \textrm{Housing Court}

Extension

Working with text

Working with Text

Y_i := \ \textrm{Judgement of Possession for failure to pay rent}

D_i := \ \textrm{Tenant is represented by a legal aid lawyer}

X_i := \ \textrm{The landlord's complaint against the tenant}

Selection on observables (with Text)

\tilde{Y}_i \perp D_i \vert X_i

Conditional on the textual document, treatment is as good as randomly assigned

The Conditional Expectation Function has a Causal interpretation

The Essence of Causal Inference

A worthwhile question to reflect on is why do Economists (with several years of graduate training) use linear models for causal inference?

\hat{y}_i = \sum _{i=1}^n \hat{\beta}_i x_i

Local Variation of Treatment

Variation in Density of Treatment

Curse of Dimensionality

We can perfectly predict the treatment variable in the finite sample

Reflection

Up to this point we have emphasized "Identification"

*This doesn't apply to every situation like Cluster Randomized Control Trials

\textrm{Conditional Independence} + \textrm{Finite Data} \nRightarrow \ \textrm{Unique Estimate}

In Practice

If you were given the full population (i.e remove the sampling noise) would you recover the true parameter*

Big Picture

We started by saying

Causal Inference is a Missing Data Problem

The way to motivate different estimation approaches is by highlighting how they try to balance these two competing aims/issues

In practice, this is translated into a fundamental tension between

Local Identification

Curse of Dimensionality

*It's not by claiming that such an estimation approach has the lowest asymptotic variance

What notion of similarity are we using to define Conditional Independence?

(1)

What notion of similarity are we using to form predictions from the training data?

(2)

The Essence of Causal Inference is Similarity

Important because we work with finite data

The notion of similarity in point 2 should "extend" the notion of similarity in point 1

Continuous transformations don't preserve conditional independence

(1) Are we learning the appropriate kernel?

(2) Are the observations unbiased?

Complexity of Estimand

Model Complexity

Because we have a finite amount of data, we must make the following decisions

(1) What information do we want to condition on?

(2) Where do we want to land on the following continuum?

Structure

The Ability of the Model to Generalize

What are we betting on?

Lasso

Difference-in-Means

Fine-tuned LLMs

Feed-Forward Neural Net

OLS

Information in Controls

This isn't exact. It's meant to help you conceptualize your own viewpoint

Model Complexity

This is very subjective. It's meant to help you conceptualize your own viewpoint

Possible Situations

True Model has unknown simple structure

Performance

True Model has unknown complex structure

Performance

True Model has known simple structure

Performance

LLM

OLS

Performance

LLM

OLS

Inner Product Space

\theta

\big( X, \langle x, x'\rangle\big)

Metric Space

\big( X, d(x, x')\big)

Topology

\big( X, \mathcal{T}\big)

Mathematical Structures for Representation Similarity

Defined with respect to a Topology

Conditional Independence

\tilde{Y}_i \perp D_i \vert X_i

\big( X, \mathcal{T}\big)

\mathbb{E}[Y_i \vert D_i=1, X_i] = \mathbb{E}[\tilde{Y}_i(1) \vert X_i]

\implies

Conditional Independence

Causal Framework

\big( X, \mathcal{T}\big)

Model

\big( X, \langle \cdot, \cdot \rangle \big)

Difference-in-Difference

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]

We previously showed

= \underbrace{\mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=1]}

+ \ \underbrace{\mathbb{E}[\tilde{Y}_i(0) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]}

Idea

Can we use pre-treatment data to approximate the selection bias?

Average Treatment on the Treated

Selection Bias

\mathbb{E}[Y_{i,t-1} \vert D_{i,t} = 1] - \mathbb{E}[Y_{i,t-1} \vert D_{i,t} = 0]

= \mathbb{E}[\tilde{Y}(0)_{i,t-1} \vert D_{i,t} = 1] - \mathbb{E}[\tilde{Y}(0)_{i,t-1} \vert D_{i,t} = 0]

Derivation

Key Assumption

= \ \underbrace{\mathbb{E}[\tilde{Y}_{i,t}(0) \vert D_{i,t}=1] - \mathbb{E}[\tilde{Y}_{i,t}(0) \vert D_{i,t}=0]}

Selection Bias

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

We also observe this difference

But we know this captures

\textrm{ATT} + \textrm{Selection Bias}

Outcome

t-1

\textcolor{blue}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 1]}

\textcolor{purple}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 0]}

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

Can we use this pre-treatment difference to approximate the selection bias?

Outcome

t-1

\textcolor{blue}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 1]}

\textcolor{purple}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 0]}

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

\Bigg\}

Estimated Selection Bias

Estimated ATT

Outcome

Parallel Trends Interpretation

t-1

We observe this initial difference

\textcolor{blue}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 1]}

\textcolor{purple}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 0]}

t-1

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

Probability

5p.p

t-1

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

Probability

5p.p

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

10p.p

t-1

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

Probability

5p.p

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

t-1

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

Probability

\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}

\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

\underbrace{\mathbb{E}[Y_{i,t} \vert D_{i,t}=1] - \mathbb{E}[Y_{i,t} \vert D_{i,t}=0]}

\textrm{ATT} \ + \textrm{Selection Bias}

\underbrace{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1}=1] - \mathbb{E}[Y_{i,t-1} \vert D_{i,t-1}=0]}

\textrm{Selection Bias}

Summary

With Controls

\mathbb{E}[Y_{i,t} \vert X_{it}, D_{i,t}=1] - \mathbb{E}[Y_{i,t} \vert X_{it}, D_{i,t}=0]

\textrm{ATT} \ + \textrm{Selection Bias}

\mathbb{E}[Y_{i,t-1} \vert X_{it},D_{i,t-1}=1] - \mathbb{E}[Y_{i,t-1} \vert X_{it}, D_{i,t-1}=0]

\textrm{Selection Bias}

With controls, we're correcting for local selection bias

Conceptually, it doesn't make sense to include individual level fixed effects

Instrumental Variables

We often cannot randomly assign treatment

We cannot randomly assign someone to move to a low-income neighborhood because that requires consent & follow through by the individual

We randomize the next best thing which is a housing voucher

Instrumental Variables

Ex:

Interested in impact of moving to a low-income neighborhood on future earnings

Instrument

\textcolor{blue}{Z_i}:= \ \textrm{Housing Voucher}

Treatment

\textcolor{purple}{D_i}:= \ \textrm{Moved to Low-income Neighborhood}

Outcome

Y_i:= \ \textrm{Future Earnings}

If the Instrument is Randomly Assigned

At the population, we observe two treatment effects

\mathbb{E}[D_i \vert Z_i=1] - \mathbb{E}[D_i \vert Z_i=0]

The impact of an offer of a housing voucher on moving to a low-income neighborhood

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

The impact of an offer of a housing voucher on future earnings

But we're not primarily interested in either of these two effects

As we'll show, we can only observe the following

The impact of moving to a low-income neighborhood on future earnings for the compliers

\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert S_i = \textrm{Complier}]

LATE

We're interested in the following

\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0})]

The impact of moving to a low-income neighborhood on future earnings

Classifying Individuals

\tilde{D}_i(0) = 0

\tilde{D}_i(0) = 1

\tilde{D}_i(1) = 0

\tilde{D}_i(1) = 1

Never-Takers

Always-Takers

Compliers

Defiers

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

= \mathbb{E}[\tilde{Y}_{i,z}(1) - \tilde{Y}_{i,z}(1) ]

Derivation

= p(\textrm{NTaker} ) \mathbb{E}[\tilde{Y}_{i,z}(1) - \tilde{Y}_{i,z}(0) \vert \textrm{NTaker}]

+ p(\textrm{ATaker} ) \mathbb{E}[\tilde{Y}_{i,z}(1) - \tilde{Y}_{i,z}(0) \vert \textrm{ATaker}]

+ p(\textrm{Complier} ) \mathbb{E}[\tilde{Y}_{i,z}(1) - \tilde{Y}_{i,z}(0) \vert \textrm{Complier}]

+ p(\textrm{Defier} ) \mathbb{E}[\tilde{Y}_{i,z}(1) - \tilde{Y}_{i,z}(0) \vert \textrm{Defier}]

\textcolor{green}{= p(\textrm{Complier} ) \mathbb{E}[\tilde{Y}_{i}(1) - \tilde{Y}_{i}(0) \vert \textrm{Complier}] }

\textcolor{green}{= 0}

\textcolor{red}{= 0}

Assume this group doesn't exist

Exclusion Restriction

\mathbb{E}[D_i \vert Z_i=1] - \mathbb{E}[D_i \vert Z_i=0]

= \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) ]

Continued

= p(\textrm{NTaker} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{NTaker}]

+ p(\textrm{ATaker} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{ATaker}]

+ p(\textrm{Complier} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{Complier}]

+ p(\textrm{Defier} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{Defier}]

\textcolor{orange}{= 0}

\textcolor{red}{= 0}

Assume this group doesn't exist

By Definition

\textcolor{orange}{p(\textrm{Complier})}

Continued

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

= p(\textrm{Complier} ) \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{Complier}]

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

p(\textrm{Complier} )

=\mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{Complier}]

p(\textrm{Complier} )

=\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert \textrm{Complier}]

Exclusion Restriction

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

\mathbb{E}[D_i \vert Zi=1] - \mathbb{E}[D_i \vert Z_i=0]

= p(\textrm{Complier} )

=\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert \textrm{Complier}]

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

\mathbb{E}[D_i \vert Zi=1] - \mathbb{E}[D_i \vert Z_i=0]

\implies

Summary

We're still interested in the effect that a lawyer has on Judgement of Possession

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

=\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert \textrm{Complier}]

\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0})]

\mathbb{E}[ D_i \vert Z_i=1] - \mathbb{E}[D_i \vert Z_i=0]

Intention-to-Treat

First Stage

LATE

Residualized regression

Motivation

But we see that in the paper's we're read (Diamond et al. 2019, Chetty et al 2016), Economists tend to fit linear models to the data

\mathbb{E}_X\big[\mathbb{E}[Y_i \vert D_i=1, X_i] - \mathbb{E}[Y_i \vert D_i=0, X_i] \big]

Up to this point in our class, we have emphasized a Non-parametric approach to Causal Inference

\tilde{Y}_i \perp D_i \vert X_i

We want to try and understand what these linear models are capturing. It's certainly different from our nonparametric approach

Focus

The Linear Model

Y_i = \alpha + \beta_1 D_i + \beta_2^T X_i + \varepsilon_i

Outcome

Treatment

Controls

Chetty 2016 (Example)

Use Experimental Voucher for at least one year

Offer of Experimental Voucher

Site Fixed Effect

Y_i

D_i

X_i

The Linear Model

\textcolor{purple}{Y}_i = \alpha + \beta_1 \textcolor{blue}{D}_i + \beta_2^T X_i + \varepsilon_i

Interested in the effect of an offer of a voucher has on Neighborhood Poverty Rate

Chetty 2016

Statistics Question

\textrm{So what does} \ \beta_1 \ \textrm{capture?}

\beta_1 \neq \mathbb{E}_X\big[\mathbb{E}[Y_i \vert D_i=1, X_i] - \mathbb{E}[Y_i \vert D_i=0, X_i] \big]

Overview

Explain Residualized Regression

Show that's it's a useful way to interpret Coefficients in Linear Models (including linear IV models!)

(Helpful for reading papers)

Flexible Enough to use "Text-based Controls"

(Potentially an interesting Research Direction for your Final Paper)

Residualized Regression

\textrm{Predict Treatment based on Controls}

\hat{D}_i \approx \mathbb{E}[D_i \vert X_i]

\textrm{Take the Difference between Treatment and Predicted Treatment}

D_i - \hat{D}_i

\textrm{Regress the Outcome on this Difference}

Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i

Not This!

Residualized Regression (Chetty 2016)

\textrm{Predict Voucher Use based on Site Fixed Effects}

Synthetic Data

Residualized Regression (Chetty 2016)

\textrm{Take the Difference between Treatment and Predicted Treatment}

(Based on Controls: Site Location)

\begin{bmatrix} 1.0 \\ 1.0 \\ 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \\ 0.0 \end{bmatrix}

D_i

\hat{D}_i

\begin{bmatrix} 0.15 \\ 0.3 \\ 0.12 \\ 0.15 \\ 0.3 \\ 0.12 \\ 0.12 \end{bmatrix}

\begin{bmatrix} 0.85 \\ 0.70 \\ -.12 \\ -.15 \\ -.30 \\ 0.82 \\ -.12 \end{bmatrix}

D_i - \hat{D}_i

Residualized Regression (Chetty 2016)

\textrm{Regress the Outcome on this Difference}

Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i

Notice that Treated Individuals have Positive Residuals!

Does this make sense to you?

Key Takeaway

The Coefficient of Interest in the Linear Model is the same as the coefficient in the Residualized Model

Y_i = \alpha + \beta_1 D_i + \beta_2^T X_i + \varepsilon_i

\iff

Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i

\hat{D}_i = \hat{\gamma}_1 + \hat{\gamma}_2X_i

Conceptual Understanding

When we're running linear regression, we are regressing the outcome variable on differences between the treatment and the expected treatment (where the expected treatment is a linear function of the controls)

If treatment is as good as randomly assigned conditional on the controls, then it's essentially random who has a positive residual and who has a negative residual. The only difference is that individuals with positive residuals received treatment.

Therefore the relationship between the outcome variable and the residuals captures a relationship between the outcome and the treatment variable that isn't contaminated by selection bias

Extension

Partially Linear Models

Y_i = \beta_1(\underbrace{D_i - \mathbb{E}[D_i \vert X_i]}) + \eta_i

The difference between treatment and the predicted treatment based only on the controls

Linear Models

Y_i = \alpha + \beta_1 D_i + \beta_2^T X_i + \varepsilon_i

\iff

Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i

\hat{D}_i = \hat{\gamma}_1 + \hat{\gamma}_2X_i

How to Interpret the Coefficient of Interest

Y_i = \beta_1(\hat{D}_i - \bar{\hat{D}}_i) + \eta_i

\bar{\hat{D}}_i = \hat{\phi}_1 + \hat{\phi}_2X_i

Y_i = \beta_1(\mathbb{E}[D_i \vert X_i, Z_i] - \mathbb{E}[D_i \vert X_i])

D_i = \gamma_0 + \gamma_1 X_i + \gamma_2 Z_i + \varepsilon_i

The Linear IV Model

Y_i = \alpha + \beta_1 \hat{D}_i + \beta_2 X_i + \varepsilon_i

Example: Humphries et al. (2024)

\text{Evict}_{it} = \alpha_t + \alpha_{l(i)} + \alpha_{\tau(i,t)} + \sum_b \beta_b 1_{\{Bal_{it} = b\}} + \gamma B_{it} + \delta B_{it} 1_{\{Tenure_{it} > 12\}} + \epsilon_{it}

Evicted

Time

Landlord

Tenure

Balance

Months Behind

Housing & Homelessness: Applied Econometrics

By Patrick Power

Housing & Homelessness: Applied Econometrics

Patrick Power

Economics PhD @ Boston University

pharringtonp19.github.io

Applied Econometrics

The economics of Housing & Homelessness

Applied Econometrics

Learning Econometrics

Aim(S)

Practical Perspective

Potential Outcome Framework

The potential Outcome Framework

The potential Outcome Framework

Probability Spaces

Difference in means

The need for controls

Extension

Working with text

Working with Text

Selection on observables (with Text)

The Essence of Causal Inference

Difference-in-Difference

Instrumental Variables

Summary

Residualized regression

Housing & Homelessness: Applied Econometrics

More from Patrick Power