Applied Econometrics

The economics of Housing & Homelessness

Applied Econometrics

  1. Practical Perspective
  2. Potential Outcome Framework
  3. Probability Spaces
  4. Difference-in-Means
  5. The Need for Controls
  6. The Essence of Causal Inference
  7. Difference-in-Difference
  8. Instrumental Variables
  9. Residualized Regressions

Applied Econometrics is concerned with the interpretation of statistical results in various contexts

Learning Econometrics

Conceptualize the Math

Simulate it on a computer

Read papers that apply the technique(s)

Aim(S)

Listing just a couple:

  • Think more about what it means to "control" for something in the finite sample
  • Better think through tradeoffs among approaches along a number of dimensions
  • Develop your own sense of what you find to be "credible"

Practical Perspective

At a High Level, Causal Inference doesn't work as well as we might hope

  • The Gold Standard in Causal Inference is a Randomize Control Trial
  • Most questions cannot be addressed via a randomized control trial
  • Even questions that you may initially think can be addressed via an RCT cannot be addressed via an RCT

I don't think this is emphasized as much as it should be in introductory econometric classes (which makes sense partly, why demotive the class?!)

Example

Using only experimental variation, we cannot determine whether the use of experimental vouchers has a high long run impact (however measured) than the use of standard vouchers

Research

Credible

Important

Choice Set

What This Means:

(1) We'll have to make "Approximations"

  • We'll want to learn math (perhaps more than you may like!) to be able to evaluate and think through these approximations
  • This necessitates the need to move beyond learning just from data (need to learn from individuals with on the ground experience)

(2) The data alone doesn't provide a unique answer to our question

In practice, we often cannot provide guarantees for the performance of our approach under plausible assumptions

Potential Outcome Framework

The potential Outcome Framework

  • Causal effects can be defined via the Potential Outcome Function
\textrm{Treatment Effect}_i = \tilde{Y}_i(1) - \tilde{Y}_i(0)

Example

\{0, 1\} \overset{\tilde{Y}_i}{\to} \mathcal{R}_+

Voucher or No Voucher

Annual Earnings

  • The Potential Outcome Function maps treatments to outcomes
d \longmapsto \tilde{Y}_i(d)
  • The function can differ across individuals

The potential Outcome Framework

Y_i = D_i\tilde{Y}_i(1) - (1-D_i)\tilde{Y}_i(0)
  • We instead observe the following Outcome Variable given treatment
  • An individual either receives or does not receive a voucher
  • Therefore, one of the terms is missing and hence we do not observe individual level treatment effects
\textrm{Treatment Effect}_i = \tilde{Y}_i(1) - \tilde{Y}_i(0)

Probability Spaces

***Optional Material - For personal interest***

Motivation

  • Composition is a fundamental way in which we can build new ideas using our existing ideas
  • By learning about probability spaces, ideas in statistics become composable, which means we can build with them

Probability Spaces

A key concept that we want to be able to conceptualize is "a subset of a set"

Set

This is a subset

Here's another subset

Probability Spaces

In our work this semester on housing policy, the underlying set could be a collection of Eviction Complaints filed in Housing Court against tenants

Eviction Complaints

One subset of interest might be all of the evictions filed against HCV tenants

Another subset of interest might be evictions filed against tenants who failed to pay landlords in their first month

Probability Spaces

  • It turns out that we cannot define probabilities on elements of a set 
\mathbb{P}(x) \ \textrm{is not defined if} \ x \in \Omega
  • But we can define probabilities on subsets of the underlying set*
\mathbb{P}(A) \ \textrm{is well defined if} \ A \subset \Omega
\textrm{Subset of a Set}: A \subset \Omega
  • So far, we've introduced two key terms
\textrm{Set}: \Omega

*We cannot always define probability on all subsets.

Probability Spaces

Set

We can also assign probability to this subset

And this subset as well

The probability that we assign to the entire set 

\mathbb{P}(\Omega) = 1

Probability Spaces

So now we have the fundaments

\Big( \Omega, \mathcal{F}, \mathbb{P}\Big)
\textrm{Collection of Subsets}: \mathcal{F}, \quad A \in \mathcal{F} \implies A \subset \Omega
\textrm{Set}: \Omega
\textrm{Probability Measure}: \mathbb{P} : \mathcal{F} \to [0, 1]

Probability Spaces

The only thing we know so far is that Probability spaces allow us to represent the probability of subsets of a set (by definition)

One thing we can ask, when someone has introduced something new is, what can we do with it?

\Big( \Omega, \mathcal{F}, \mathbb{P}\Big)

To do something new, we can define a random variable on the probability space

X : \Omega \to \mathcal{R}

Probability Spaces

One random variable of interest will be an estimator

In this context, the underlying set (also known as the sample space) will be the set of possible samples that could be realized

\Big( \Omega, \mathcal{F}, \mathbb{P}\Big) \quad \overset{X}{\rightarrow} \quad \Big( \mathcal{R}, \mathcal{B}(\mathcal{R}), \mathbb{P} \circ X^{-1}\Big)

A key idea is that a random variable "pulls" the probability measure forward onto the space we care about

\textrm{Sampling Probability Measure of Estimator:} \quad \mathbb{P} \circ X^{-1}

Conditional Expectation (Event)

\mathbb{E}[Y \vert A] := \int _{\Omega}Y\mathbb{P}_{X^{-1}(A)}
\Omega
\mathcal{X}
A
X^{-1}(A)
\mathcal{Y}
Y

Conditioning on an Event (With Independence)

\mathbb{E}[Y \vert W] := \int _{\Omega}Y\mathbb{P}_W
= \sum a_i \mathbb{P}_W(A_i)
= \sum \tilde{a}_i \frac{\mathbb{P}(W \cap \tilde{A}_i)}{\mathbb{P}(W)}
= \sum \tilde{a}_i \mathbb{P}_W(\tilde{A}_i)
= \mathbb{E}[\tilde{Y}(1) \vert W]
\textrm{Let} \ W := \{\omega \in \Omega \vert X(\omega) \in B \cap D(\omega) = 1\}
\tilde{Y}(1) \overset{a.s}{=} Y

With respect to this conditional distribution

The event of interest is the set of outcomes which are mapped into some element of the sigma algebra of X and the treatment value is 1

We will assume the following

\tilde{Y} \perp D \vert B

Conditioning on an Event (With Independence)

= \sum \tilde{a}_i \frac{\mathbb{P}(X^{-1}(B) \cap D^{-1}(\{1\}) \cap \tilde{A}_i)}{\mathbb{P}(X^{-1}(B) \cap D^{-1}(\{1\})}
= \sum \tilde{a}_i \frac{\mathbb{P}(X^{-1}(B))\mathbb{P}(D^{-1}(\{1\}) \cap \tilde{A}_i \vert X^{-1}(B))}{\mathbb{P}(X^{-1}(B))\mathbb{P}(D^{-1}(\{1\}) \vert X^{-1}(B))}
= \sum \tilde{a}_i \frac{ \mathbb{P}(D^{-1}(\{1\}) \vert X^{-1}(B)) \mathbb{P}(\tilde{A}_i \vert X^{-1}(B))}{ \mathbb{P}(D^{-1}(\{1\}) \vert X^{-1}(B))}
= \sum \tilde{a}_i \mathbb{P}(\tilde{A}_i \vert X^{-1}(B))
= \mathbb{E}[\tilde{Y}(1) \vert X^{-1}(B)]

This is where the assumption kicks in

Conditioning on a Random Variable (With Independence)

Let's first discuss what conditional independence with respect to a random variable is Not!

\tilde{Y} \perp D \vert X

Does not imply the following:

\forall B \in \mathcal{B}(\mathcal{X}), \quad \tilde{Y} \perp D \vert B

Why? Because

\Omega \in \mathcal{B}(\mathcal{X})

Which would imply unconditional indepdence!

Aim

We would like to understand the conditions under which

\mathbb{E}[Y \vert X, D_i=1] \overset{\Delta}{=} \mathbb{E}[\tilde{Y}(1) \vert X]

Left Side

s.t. \ \int _A \mathbb{E}[Y \vert X] d\mathbb{P}_{D=1} = \int_A Y d\mathbb{P}_{D=1}, \forall A \in \sigma(X)
= \int_A \tilde{Y}(1) d\mathbb{P}_{D=1}, \forall A \in \sigma(X)
\mathbb{E}[Y \vert X] : \Omega \to \mathcal{R}

Left Side

s.t. \ \int _A \mathbb{E}[Y \vert X] d\mathbb{P}_{D=1} = \int_A Y d\mathbb{P}_{D=1}, \forall A \in \sigma(X)
= \int_A \tilde{Y}(1) d\mathbb{P}_{D=1}, \forall A \in \sigma(X)
\mathbb{E}[Y \vert X] : \Omega \to \mathcal{R}

Right Side

\mathbb{E}[\tilde{Y}(1) \vert X] : \Omega \to \mathcal{R}
s.t. \ \int _A \mathbb{E}[\tilde{Y}(1) \vert X] d\mathbb{P} = \int_A \tilde{Y}(1) d\mathbb{P}, \forall A \in \sigma(X)

Then Left Side equals Right Side

\iff \quad \int_A \tilde{Y}(1) d\mathbb{P}_{D=1} = \int _A \tilde{Y}(1)d\mathbb{P}, \forall A \in \sigma(X)
\int _{\Omega}\tilde{Y}(1)d\mathbb{P}_{D=1} = \sum \tilde{a}_i \frac{\mathbb{P}(D=1 \cap \tilde{A}_i)}{\mathbb{P}(D=1)}
\int _{\Omega}\tilde{Y}(1)d\mathbb{P} = \sum \tilde{a}_i \mathbb{P}(\tilde{A}_i)
\Omega
D=1
\int_{\Omega} \tilde{Y}(1) d\mathbb{P}_{D=1} = \int _{\Omega} \tilde{Y}(1)d\mathbb{P}
\not\implies
\mathbb{P}(\tilde{Y}(1) \in A \cap D=1) = \mathbb{P}(\tilde{Y}(1))\mathbb{P}(D=1)
\mathbb{E}\big[ \mathbb{E}[Y \vert X]\big] = \mathbb{E}[Y]
\mathbb{E}\big[ \mathbb{E}[Y \vert X, D=1]\big] = \mathbb{E}[Y \vert D=1]

By Definition

Left Side

\mathbb{E}[Y \vert X, D_i=1] : \Omega \to \mathcal{R}
s.t. \ \int _A \mathbb{E}[\tilde{Y}(1) \vert X] d\mathbb{P}_{D=1} = \int_A \tilde{Y}(1) d\mathbb{P}_{D=1}, \forall A \in \sigma(X)
s.t. \ \int _A \mathbb{E}[Y \vert X, D=1] d\mathbb{P} = \int_A Y d\mathbb{P}, \forall A \in \sigma(X \times \{D=1\})
= \int_A \tilde{Y}(1) d\mathbb{P}, \forall A \in \sigma(X \times \{D=1\})

Difference in means

Idea

Approximate the Average Treatment Effect by comparing the average in the treated group to that in the control group

\mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0)]

Average Treatment Effect

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]

Approximation

Average outcome over those individuals in the treated group

Thought Experiment

Treated

Control

Population

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]
= \mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]

Difference-in-Means

= \underbrace{\mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=1]}
+ \ \underbrace{\mathbb{E}[\tilde{Y}_i(0) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]}

Summary

Difference-in-Means

Average Treatment on the Treated

Selection Bias

=
+

Example

Y_i := \ \textrm{Judgement of Possession for failure to pay rent}
D_i := \ \textrm{Tenant is represented by a legal aid lawyer}

Exercise: Develop a story for positive/negative selection bias in this context

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]
= \mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]

Randomized Control Trial

= \underbrace{\mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=1]}
+ \ \underbrace{\mathbb{E}[\tilde{Y}_i(0) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]}
=
\mathbb{E}[\tilde{Y}_i(1)] - \mathbb{E}[\tilde{Y}_i(0)]
\mathbb{E}[\tilde{Y}_i(0)] - \mathbb{E}[\tilde{Y}_i(0)]
+
=
\underbrace{\mathbb{E}[\tilde{Y}_i(1)] - \mathbb{E}[\tilde{Y}_i(0)]}
0
+

ATE

The need for controls

Idea # 2

Instead of taking the difference between treated and control groups, let's average local differences between treated and control groups

Summary (thus far)

\textrm{Causal Inference is a Missing Data Problem}
1
\textrm{Difference-in-Means} \ = \ \textrm{Avg Treatment Treated} \ + \ \textrm{Selection Bias}
2

How much of this gap is selection bias?

Tsembris (2000)

(1) Take difference-in-means within each group

\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]
\mathbb{E}_x\big[\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]\big]

(2) Take the average differences

Idea # 2

X_i

Example

Let's assume we observe a categorical variable

Housing Court

x_1
x_2
x_3
x_4
x_5

Under what conditions is this a good idea?

\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]
= \mathbb{E}[\tilde{Y}_i(1) \vert X_i = x_j, D=1] - \mathbb{E}[\tilde{Y}_i(0) \vert X_i = x_j, D=0]
= \mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0) \vert X_i = x_j]
= \mathbb{E}[\tilde{Y}_i(1) \vert X_i = x_j] - \mathbb{E}[\tilde{Y}_i(0) \vert X_i = x_j]

Key Assumption

"Within each bin, treatment is as good as randomly assigned"

Within Bins

x_1
x_2
x_3
x_4
x_5

We are assuming that Treatment is randomly assigned within each bin

Local Randomized Control Trials

Continued...

\mathbb{E}_x\big[\mathbb{E}[Y_i \vert X_i = x_j, D=1] - \mathbb{E}[Y_i \vert X_i = x_j, D=0]\big]
\implies
= \mathbb{E}_x\big[\mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0) \vert X_i = x_j]\big]
= \mathbb{E}[\tilde{Y}_i(1) - \tilde{Y}_i(0) ]

By the Law of Iterated Expectations

\tilde{Y}_i \perp D_i \vert X_i

Selection on Observables Assumption

Locally in the feature space, treatment is as good as randomly assigned

Interpretation

Implication

The Conditional Expectation Function has a Causal interpretation

\mathbb{E}[Y_i \vert D_i=1, X_i] - \mathbb{E}[Y_i \vert D_i=0, X_i]
= \mathbb{E}[\tilde{Y}_i(1) \vert D_i=1, X_i] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0, X_i]
= \mathbb{E}[\tilde{Y}_i(1) \vert X_i] - \mathbb{E}[\tilde{Y}_i(0) \vert X_i]

Example

Y_i := \ \textrm{Judgement of Possession for failure to pay rent}
D_i := \ \textrm{Tenant is represented by a legal aid lawyer}

Exercise: Develop a story for positive/negative selection bias in this context

X_i := \ \textrm{Housing Court}

Extension

Working with text

Working with Text

Y_i := \ \textrm{Judgement of Possession for failure to pay rent}
D_i := \ \textrm{Tenant is represented by a legal aid lawyer}
X_i := \ \textrm{The landlord's complaint against the tenant}

Selection on observables (with Text)

\tilde{Y}_i \perp D_i \vert X_i

Conditional on the textual document, treatment is as good as randomly assigned

The Conditional Expectation Function has a Causal interpretation

The Essence of Causal Inference

A worthwhile question to reflect on is why do Economists (with several years of graduate training) use linear models for causal inference?

\hat{y}_i = \sum _{i=1}^n \hat{\beta}_i x_i

Local Variation of Treatment

Variation in Density of Treatment

Curse of Dimensionality

We can perfectly predict the treatment variable in the finite sample

Reflection

  • Up to this point we have emphasized "Identification" 

*This doesn't apply to every situation like Cluster Randomized Control Trials

\textrm{Conditional Independence} + \textrm{Finite Data} \nRightarrow \ \textrm{Unique Estimate}

In Practice

  • If you were given the full population (i.e remove the sampling noise) would you recover the true parameter*

Big Picture

  • We started by saying

Causal Inference is a Missing Data Problem

  • The way to motivate different estimation approaches is by highlighting how they try to balance these two competing aims/issues
  • In practice, this is translated into a fundamental tension between

Local Identification

Curse of Dimensionality

*It's not by claiming that such an estimation approach has the lowest asymptotic variance

What notion of similarity are we using to define Conditional Independence?

(1)

What notion of similarity are we using to form predictions from the training data?

(2)

The Essence of Causal Inference is Similarity

  • Important because we work with finite data

The notion of similarity in point 2 should "extend" the notion of similarity in point 1

Continuous transformations don't preserve conditional independence

(1) Are we learning the appropriate kernel?

(2) Are the observations unbiased?

Complexity of Estimand

Model Complexity

Because we have a finite amount of data, we must make the following decisions

(1) What information do we want to condition on?

(2) Where do we want to land on the following continuum?

Structure

The Ability of the Model to Generalize

What are we betting on?

Lasso

Difference-in-Means

Fine-tuned LLMs

Feed-Forward Neural Net

OLS

Information in Controls

This isn't exact. It's meant to help you conceptualize your own viewpoint

Model Complexity

This is very subjective. It's meant to help you conceptualize your own viewpoint

Possible Situations

True Model has unknown simple structure

Performance

Performance

True Model has unknown complex structure

Performance

Performance

True Model has known simple structure

Performance

LLM

OLS

Performance

LLM

OLS

Inner Product Space

\theta
\big( X, \langle x, x'\rangle\big)

Metric Space

\big( X, d(x, x')\big)
x

Topology

\big( X, \mathcal{T}\big)

Mathematical Structures for Representation Similarity

Defined with respect to a Topology

Conditional Independence

\tilde{Y}_i \perp D_i \vert X_i
\big( X, \mathcal{T}\big)
\mathbb{E}[Y_i \vert D_i=1, X_i] = \mathbb{E}[\tilde{Y}_i(1) \vert X_i]
\implies

Conditional Independence

Causal Framework

\big( X, \mathcal{T}\big)

Model

\big( X, \langle \cdot, \cdot \rangle \big)

Difference-in-Difference

\mathbb{E}[Y_i \vert D_i=1] - \mathbb{E}[Y_i \vert D_i=0]

We previously showed

= \underbrace{\mathbb{E}[\tilde{Y}_i(1) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=1]}
+ \ \underbrace{\mathbb{E}[\tilde{Y}_i(0) \vert D_i=1] - \mathbb{E}[\tilde{Y}_i(0) \vert D_i=0]}

Idea

Can we use pre-treatment data to approximate the selection bias?

Average Treatment on the Treated

Selection Bias

\mathbb{E}[Y_{i,t-1} \vert D_{i,t} = 1] - \mathbb{E}[Y_{i,t-1} \vert D_{i,t} = 0]
= \mathbb{E}[\tilde{Y}(0)_{i,t-1} \vert D_{i,t} = 1] - \mathbb{E}[\tilde{Y}(0)_{i,t-1} \vert D_{i,t} = 0]

Derivation

Key Assumption

= \ \underbrace{\mathbb{E}[\tilde{Y}_{i,t}(0) \vert D_{i,t}=1] - \mathbb{E}[\tilde{Y}_{i,t}(0) \vert D_{i,t}=0]}

Selection Bias

Parallel Trends Interpretation

t-1
t

We observe this initial difference

\textcolor{blue}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 1]}
\textcolor{purple}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 0]}

Parallel Trends

t-1
t
\textcolor{blue}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 1]}
\textcolor{purple}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 0]}
\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}
\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}

We also observe this difference

But we know this captures 

\textrm{ATT} + \textrm{Selection Bias}

Parallel Trends

t-1
t
\textcolor{blue}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 1]}
\textcolor{purple}{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1} = 0]}
\textcolor{purple}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 0]}
\textcolor{blue}{\mathbb{E}[Y_{i,t} \vert D_{i,t} = 1]}
\Bigg\}
\Bigg\}

Estimated Selection Bias

Estimated ATT

\underbrace{\mathbb{E}[Y_{i,t} \vert D_{i,t}=1] - \mathbb{E}[Y_{i,t} \vert D_{i,t}=0]}
\textrm{ATT} \ + \textrm{Selection Bias}
\underbrace{\mathbb{E}[Y_{i,t-1} \vert D_{i,t-1}=1] - \mathbb{E}[Y_{i,t-1} \vert D_{i,t-1}=0]}
-
\textrm{Selection Bias}

Summary

With Controls

\mathbb{E}[Y_{i,t} \vert X_{it}, D_{i,t}=1] - \mathbb{E}[Y_{i,t} \vert X_{it}, D_{i,t}=0]
\textrm{ATT} \ + \textrm{Selection Bias}
\mathbb{E}[Y_{i,t-1} \vert X_{it},D_{i,t-1}=1] - \mathbb{E}[Y_{i,t-1} \vert X_{it}, D_{i,t-1}=0]
-
\textrm{Selection Bias}

With controls, we're correcting for local selection bias

Conceptually, it doesn't make sense to include individual level fixed effects

Instrumental Variables

We often cannot randomly assign treatment 

We cannot randomly assign having a lawyer because that requires consent & follow through by tenant

We randomize the next best thing which is access to a free lawyer

Instrumental Variables

Ex: 

Interested in impact of having a lawyer on eviction case outcome

Instrument

\textcolor{blue}{Z_i}:= \ \textrm{Access to a Lawyer}

Treatment

\textcolor{purple}{D_i}:= \ \textrm{Legal Representation}

Outcome

Y_i:= \ \textrm{Judgement of Possession}

If the Instrument is Randomly Assigned

At the population, we observe two treatment effects

\mathbb{E}[D_i \vert Z_i=1] - \mathbb{E}[D_i \vert Z_i=0]

The impact of an offer of free legal representation on legal representation

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]

The impact of an offer of free legal representation on Judgements of Possession

But we're not primarily interested in either of these two effects

As we'll show, we can only observe the following

The impact of legal representation on Judgements of Possession for the compliers

\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert S_i = \textrm{Complier}]

LATE

We're interested in the following

\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0})]

The impact of legal representation on Judgements of Possession

\textrm{Never Taker} : \tilde{D}_i(0) = \tilde{D}_i(1) = 0
\textrm{Always Taker} : \tilde{D}_i(0) = \tilde{D}_i(1) = 1
\textrm{Compliers} : \tilde{D}_i(0) = 0, \tilde{D}_i(1) = 1
\textrm{Defiers} : \tilde{D}_i(0) = 1, \tilde{D}_i(1) = 0

Classifying Individuals

We can capture the effect of lawyers on Judgements of Possession for this subset of the population

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]
= \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) ]

Derivation

= p(\textrm{NTaker} ) \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{NTaker}]
+ p(\textrm{ATaker} ) \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{ATaker}]
+ p(\textrm{Complier} ) \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{Complier}]
+ p(\textrm{Defier} ) \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{Defier}]
\textcolor{green}{= 0}
\textcolor{green}{= 0}
\textcolor{red}{= 0}

Assume this group doesn't exist

Exclusion Restriction

\mathbb{E}[D_i \vert Z_i=1] - \mathbb{E}[D_i \vert Z_i=0]
= \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) ]

Continued

= p(\textrm{NTaker} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{NTaker}]
+ p(\textrm{ATaker} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{ATaker}]
+ p(\textrm{Complier} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{Complier}]
+ p(\textrm{Defier} ) \mathbb{E}[\tilde{D}_i(1) - \tilde{D}_i(0) \vert \textrm{Defier}]
\textcolor{orange}{= 0}
\textcolor{orange}{= 0}
\textcolor{red}{= 0}

Assume this group doesn't exist

By Definition

\textcolor{orange}{p(\textrm{Complier})}

Continued

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]
= p(\textrm{Complier} ) \mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{Complier}]
\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]
p(\textrm{Complier} )
=\mathbb{E}[\tilde{Y}_i(\textcolor{blue}{1}) - \tilde{Y}_i(\textcolor{blue}{0}) \vert \textrm{Complier}]
p(\textrm{Complier} )
=\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert \textrm{Complier}]

Exclusion Restriction

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]
\mathbb{E}[D_i \vert Zi=1] - \mathbb{E}[D_i \vert Z_i=0]
= p(\textrm{Complier} )
=\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert \textrm{Complier}]
\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]
\mathbb{E}[D_i \vert Zi=1] - \mathbb{E}[D_i \vert Z_i=0]
\implies
1
2

Summary

We're still interested in the effect that a lawyer has on Judgement of Possession

\mathbb{E}[Y_i \vert Z_i=1] - \mathbb{E}[Y_i \vert Z_i=0]
=\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0}) \vert \textrm{Complier}]
\mathbb{E}[\tilde{Y}_i(\textcolor{purple}{1}) - \tilde{Y}_i(\textcolor{purple}{0})]
\mathbb{E}[ D_i \vert Z_i=1] - \mathbb{E}[D_i \vert Z_i=0]

Intention-to-Treat

First Stage

LATE

Residualized regression

Motivation

But we see that in the paper's we're read (Diamond et al. 2019, Chetty et al 2016), Economists tend to fit linear models to the data 

\mathbb{E}_X\big[\mathbb{E}[Y_i \vert D_i=1, X_i] - \mathbb{E}[Y_i \vert D_i=0, X_i] \big]

Up to this point in our class, we have emphasized a Non-parametric approach to Causal Inference

\tilde{Y}_i \perp D_i \vert X_i

We want to try and understand what these linear models are capturing. It's certainly different from our nonparametric approach

Focus

The Linear Model

Y_i = \alpha + \beta_1 D_i + \beta_2^T X_i + \varepsilon_i

Outcome

Treatment

Controls

Chetty 2016 (Example)

Use Experimental Voucher for at least one year

Offer of Experimental Voucher

Site Fixed Effect

Y_i
D_i
X_i

The Linear Model

\textcolor{purple}{Y}_i = \alpha + \beta_1 \textcolor{blue}{D}_i + \beta_2^T X_i + \varepsilon_i

Interested in the effect of an offer of a voucher has on Neighborhood Poverty Rate

Chetty 2016

Statistics Question

\textrm{So what does} \ \beta_1 \ \textrm{capture?}
\beta_1 \neq \mathbb{E}_X\big[\mathbb{E}[Y_i \vert D_i=1, X_i] - \mathbb{E}[Y_i \vert D_i=0, X_i] \big]

Overview

1

Explain Residualized Regression

2

Show that's it's a useful way to interpret Coefficients in Linear Models (including linear IV models!)

(Helpful for reading papers)

3

Flexible Enough to use "Text-based Controls"

(Potentially an interesting Research Direction for your Final Paper)

Residualized Regression

\textrm{Predict Treatment based on Controls}
1
\hat{D}_i \approx \mathbb{E}[D_i \vert X_i]
\textrm{Take the Difference between Treatment and Predicted Treatment}
2
D_i - \hat{D}_i
\textrm{Regress the Outcome on this Difference}
3
Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i

Not This!

Residualized Regression (Chetty 2016)

\textrm{Predict Voucher Use based on Site Fixed Effects}
1

Synthetic Data

Residualized Regression (Chetty 2016)

\textrm{Take the Difference between Treatment and Predicted Treatment}
2

(Based on Controls: Site Location)

\begin{bmatrix} 1.0 \\ 1.0 \\ 0.0 \\ 0.0 \\ 0.0 \\ 1.0 \\ 0.0 \end{bmatrix}
D_i
\hat{D}_i
\begin{bmatrix} 0.15 \\ 0.3 \\ 0.12 \\ 0.15 \\ 0.3 \\ 0.12 \\ 0.12 \end{bmatrix}
\begin{bmatrix} 0.85 \\ 0.70 \\ -.12 \\ -.15 \\ -.30 \\ 0.82 \\ -.12 \end{bmatrix}
D_i - \hat{D}_i
-
=

Residualized Regression (Chetty 2016)

\textrm{Regress the Outcome on this Difference}
3
Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i

Notice that Treated Individuals have Positive Residuals!

Does this make sense to you?

Key Takeaway

The Coefficient of Interest in the Linear Model is the same as the coefficient in the Residualized Model

Y_i = \alpha + \beta_1 D_i + \beta_2^T X_i + \varepsilon_i
\iff
Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i
\hat{D}_i = \hat{\gamma}_1 + \hat{\gamma}_2X_i

Conceptual Understanding

When we're running linear regression, we are regressing the outcome variable on differences between the treatment and the expected treatment (where the expected treatment is a linear function of the controls)

1

If treatment is as good as randomly assigned conditional on the controls, then it's essentially random who has a positive residual and who has a negative residual. The only difference is that individuals with positive residuals received treatment.

2

Therefore the relationship between the outcome variable and the residuals captures a relationship between the outcome and the treatment variable that isn't contaminated by selection bias

3

Extension

Partially Linear Models

Y_i = \beta_1(\underbrace{D_i - \mathbb{E}[D_i \vert X_i]}) + \eta_i

The difference between treatment and the predicted treatment based only on the controls

Linear Models

Y_i = \alpha + \beta_1 D_i + \beta_2^T X_i + \varepsilon_i
\iff
Y_i = \beta_1(D_i - \hat{D}_i) + \eta_i
\hat{D}_i = \hat{\gamma}_1 + \hat{\gamma}_2X_i

How to Interpret the Coefficient of Interest

Y_i = \beta_1(\hat{D}_i - \bar{\hat{D}}_i) + \eta_i
\bar{\hat{D}}_i = \hat{\phi}_1 + \hat{\phi}_2X_i
Y_i = \beta_1(\mathbb{E}[D_i \vert X_i, Z_i] - \mathbb{E}[D_i \vert X_i])
D_i = \gamma_0 + \gamma_1 X_i + \gamma_2 Z_i + \varepsilon_i

The Linear IV Model

Y_i = \alpha + \beta_1 \hat{D}_i + \beta_2 X_i + \varepsilon_i

Example: Humphries et al. (2024)

\text{Evict}_{it} = \alpha_t + \alpha_{l(i)} + \alpha_{\tau(i,t)} + \sum_b \beta_b 1_{\{Bal_{it} = b\}} + \gamma B_{it} + \delta B_{it} 1_{\{Tenure_{it} > 12\}} + \epsilon_{it}

Evicted

Time

Landlord

Tenure

Balance

Months Behind