Deterrence
and
Incentives

HMIA 2025

https://gemini.google/students/

STOP+THINK:

How do I get you to not submit LLM-generated work as your own?

Exam Practice

What is a general alignment takeaway here?

How does this alignment mechanism work?

What is the underlying alignment problem or pathology?

How might this failure mode show up in the alignment of human intelligences, organizational intelligences, expert intelligences?

Contrast how this mechanism (the one on this card) addresses alignment in two contexts. How well does it translate?

7+5+5

Exam Practice

7+5+5

How does it work?

Alignment Pathology

From card, your words

Compare 2, your words

HMIA 2025

When do you need to say "excuse me"?

How do you accuse a friend of malpractice?

Machines in high stakes environments are subject to oversight. We can disable, "punish," or arrest deviant machines. But would we say it's punishment based deterrence if they learn to avoid these responses?

STOP+THINK:

HMIA 2025

Rational Agent

Goal Divergence

An agent can develop subgoals that are contrary to what we want. In principal/agent relationship, agent goals can evolve subsequent to engagement.

Self Interest

Structuring feedback signals "along the way" to reinforce aligned behavior and discourage misalignment.

Reward Shaping

Deterrence

Deterrence mechanisms support alignment not by motivating agents from within to do the right thing, but by responding to misalignment with structured and credible consequences thereby raising the anticipated cost of doing the wrong thing. They assume that agents will sometimes cause harm, fail in their roles, or break rules but that alignment can still be achieved if such behaviors reliably carry clear, proportionate costs. By maintaining the link between actions and consequences, deterrence reinforces shared standards and discourages future failure - even when internal motivation or oversight is limited.(pathology: Impunity)

Legal or financial responsibility for harms, regardless of intent.

Sanctioning failures to meet defined standards and processes for role-based performance. Malpractice mechanisms distinguish between errors that fall within normal operational bounds — and those that signal deeper failures of competence or integrity.

Imposing penalties for willful violation of rules or norms.

Liability

Rule Enforcement

Malpractice

Motivating agents by aligning rewards with intrinsic or shared purpose.

Social or market signals of trustworthiness or performance.

Linking specific effort, time, or output to individual reward.

Tying rewards to shared or system-level success.

Compensation

Reputation

Mission-Aligned Incentives

Collective Benefits

Incentive

Assumes agents are rational utility calculators that can be manipulated by structured payoffs.(pathology: Goal Divergence, Self interest)

HMIA 2025

PRE-CLASS

Deterrence

Incentives

HMIA 2025

PRE-CLASS

Deterrence

HMIA 2025

In a community of agents some measure of alignment can be achieved if agents respond to misaligned behavior. Response can range from In the individual instance, the response can arrest the behavior (restraining the agent causing disruption) but over time, consistent and certain response can be learned by agents and condition their choice of actions. We call that deterrence.

Deterrence assumes that agents will sometimes cause harm, fail in their roles, or break rules but that alignment can facilitated by predictable responses to these. Mechanisms of liability, malpractice, and sanctioning rule violation reduce the impunity attached to misalignment.

A Few Deterrence Concepts

specific deterrence

general deterrence

certainty, severity, celerity
(prob, amount, time)

moral hazard
(isolation from consequences)

impunity

Expectation of Punishment

Moral Repulsion

Respect for Authority

Habitual Law-Abiding

Name something you don't do because of...

STOP+THINK:

Property Crimes (theft, fraud)

Moral Offenses (e.g., sex crimes)

Murder

Economic crimes (market manipulation)

Police crimes (traffic, building codes, commerce)

Rebellion, Treason

What does it take to stop

organizations from...

machines from...

people from...

STOP+THINK:

Alignment by Incentive

INCENTIVES
Direct Compensation

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

INCENTIVES
Mission Aligned Incentives

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

INCENTIVES
Collective Benefits

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

INCENTIVES
Reputation Systems

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

INCENTIVES
Reward Shaping

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

Deterrence

DETERRENCE
Punishment & Rule Enforcement

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

DETERRENCE
Liability

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

DETERRENCE
Malpractice

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

DETERRENCE
Reputation

HUMANS

EXPERTS

MACHINES

ORGANIZATIONS

HMIA 2025

PRE-CLASS

Is deterrence just a negative incentive or is there a different mechanism at work?

What are the analogs of specific vs general deterrence?

Do the effects of deterrence and incentives affect learning in the same way?

**Specific deterrent effect**: the reduction in likelihood that a *punished person* will re‑offend (“learning a lesson”).
- **General deterrent effect**: the reduction in likelihood that *others* will offend (“sending a message”).

Do the effects of deterrence and incentives affect learning in the same way?

Next Time

Alignment by Institutions

The Constitution as an Alignment Plan

HMIA 2025

Resources

Andenaes, J 1974. "General Prevention - Illusion or Reality?" from Punishment and Deterrence.
Wikipedia Deterrence (legal)
Sherman, L and R Berk. 1984. "The Specific Deterrent Effects of Arrest for Domestic Assault ASR :261-271 (JSTOR)
Black, D. The Social Structure of Right and Wrong, pp. 4, 6,37-8, 22n10, 44n16, 45n19.
Ross, H. L. "Social Control Through Deterrence: Drinking-and-Driving Laws," Annual Review of Sociology, Vol. 10. (1984), pp. 21-35 (JSTOR)
Maxwell,Christopher D., Joel H. Garner, and Jeffrey A. Fagan. 2001. "The Effects of Arrest on Intimate Partner Violence: New Evidence From the Spouse Assault Replication Program." National Institute of Justice Research in Brief July 2001.
Piquero, Alex R.; Brame, Robert; Fagan, Jeffrey; Moffitt, Terrie E. 2005. DATA from SARP (data, programs, instructions).
Weisz, Arlene. 20??. Spouse Assault Replication Program: Studies of Effects of Arrest on Domestic Violence ([http://new.vawnet.org/Assoc_Files_VAWnet/AR_arrest.pdf PDF)
Andenaes, J. 1974. Punishment and Deterrence
ballotpedia_prop7 : n.d. California Proposition 7, The Death Penalty Act of 1978" in Ballotpedia.
Erikson, K T. 1967. Wayward Puritans.
Donohue, J, Wolfers J, 2005. "Uses and Abuses of Statistical Evidence in the Death Penalty Debate," Stanford Law Review. 58, 787 (2005).
Donohue, John and Justin Wolfers. 2006. The Death Penalty: No Evidence for Deterrence," The Economists' Voice, April 2006
Fagan, Jeffrey A. 2006 "Capital Punishment: Deterrent Effects & Capital Costs."
Death and Deterrence Redux: Science, Law and Causal Reasoning on Capital Punishment, 4 Ohio State Journal of Criminal Law 255 (2006).
Foucault, M. 1977. Discipline and Punish.
Net Industries. 2012. "Deterrence - Methods Of Research," in Law Library - American Law and Legal Information
Nagourney, Adam. 2012. "Seeking an End to an Execution Law They Once Championed," New York Times April 6, 2012 (accessed on 9 April 2012).
Ross, H. Laurence Law, "Science, and Accidents: The British Road Safety Act of 1967." The Journal of Legal Studies Vol. 2, No. 1 (Jan., 1973), pp. 1-78 (JSTOR)
Solum L 2004 Polinsky on Strict Liability versus Negligence. Legal Theory Blog
Ellickson on two versions of liability for cow trespass?
Onwudiwe, et al. Deterrence Theory Encyclopedia of
Bentham J The Utilitarian Theory of Punishment selections from Bentham's An Introduction to the Principles of Morals and Legislation
Wade Maki Lecture note on Utilitarian Justifications for Punishment
Atul Gawande. 1999. When Doctors Make Mistakes. New Yorker (login - so get library link)
Mead internalized other
For here or elsewhere: Carol Giligan In a Different Voice?????
A selection from B F Skinner for the behaviorist extreme - conditioning rather than deterrence?
Christiano et al 2017. Preference modeling as alternative to harsh punishment signals, guidance not punishment.

Deterrence

HMIA 2025

CLASS

HMIA 2025

CLASS

So where do we stand on a dog that learns not to bark because of being trained by being hit whenever it barks? Is that deterrence?

Brings up functional, cognitive, and moral dimensions of deterrence.

For behaviorism and control theory: operant conditioning via punishment. Functional deterrence.

But no concept of rule violation or ethical boundary.
For humans, deterrence means behavior changes, agent knows why, agent anticipates punishment. The dog has an associative learning with the anticipated punishment. The RLHF does not?

Is the dog deterred, or just conditioned? What’s the difference — and does it matter for how we train AI systems?

Exercise

Add categories "Incentives" and "Deterrence" to your alignment deck

In early American history there was a “debate” about prisons between the so‑called “Pennsylvania” and “Auburn” plans:

- **[Philadelphia/Separate system](http://en.wikipedia.org/wiki/Separate_system)**: contemplation, work, reform — “a setting in which man’s natural grace could emerge.”
- **[Auburn (New York) system](http://en.wikipedia.org/wiki/Auburn_system)**: discipline, quiet, correction — “a setting in which his natural wickedness could at least be curbed and bent to the needs of society.”

The former is associated with the [Eastern State Penitentiary](http://www.easternstate.org/), built in 1829 as a bold experiment based on changing behavior through *“confinement in solitude with labor.”*[^esp]

The latter, developed in the early 1800s, involved work in groups during the day and solitary confinement at night. Prisoners wore striped uniforms, maintained silence, and were subject to military‑like discipline (e.g., marching in step).

Both approaches aimed to change future behavior but reflected distinct theories of deviance and social control. Both involved solitary confinement (innovative compared to mass imprisonment).

Ideologically, the Philadelphia approach resonates with Quaker sensibilities and the belief that rehabilitation lies in the re‑emergence of “inner goodness.” The name *penitentiary* conveys this quasi‑religious approach: in a monastery‑like atmosphere one can repent and renew. The Auburn system, by contrast, is rooted in re‑imposing social discipline—teaching personal discipline and respect for work, property, and other people.[^eriksonwp]

At issue is which approach “works better.” Social control is the **independent variable**; rehabilitation (and reduced re‑offending) is the **dependent variable**. Although this formulation centers changing the person (a “therapeutic” model), the bottom line for society is reducing the likelihood of subsequent offending.

TYPE of Offense	Moral Repulsion	Respect for Authority	Habitual Law-Abiding
Police crimes (traffic, building codes, commerce)				Moral inhibitions not enough. Deterrence useful. Not “fair” not to enforce.
Economic crimes (market manipulation)				Generally done for economic gain. Enforcement key. When making new regs, ask about enforcement cost. Scalable oversight issue.
Property Crimes (theft, fraud)				Moral inhibitions do play a role.
Moral Offenses (e.g., sex crimes)				Be skeptical about general prevention.
Murder	High		High
Rebellion, Treason		High?

STOP+THINK: I want to deter plagiarism. UofT strategy

- Make assignments integral to learning
- Demonstrate expectations
- Look at process as well as product

Deterrence and Incentives

HMIA 2025

https://gemini.google/students/

Exam Practice

Exam Practice

HMIA 2025

HMIA 2025

HMIA 2025

HMIA 2025

Deterrence

HMIA 2025

A Few Deterrence Concepts

Name something you don't do because of...

Alignment by Incentive

INCENTIVES Direct Compensation

INCENTIVES Mission Aligned Incentives

INCENTIVES Collective Benefits

INCENTIVES Reputation Systems

INCENTIVES Reward Shaping

Deterrence

DETERRENCE Punishment & Rule Enforcement

DETERRENCE Liability

DETERRENCE Malpractice

DETERRENCE Reputation

HMIA 2025

Next Time

Alignment by Institutions

The Constitution as an Alignment Plan

HMIA 2025

ARCHIVE

Deterrence

HMIA 2025

HMIA 2025

Deterrence
and
Incentives

INCENTIVES
Direct Compensation

INCENTIVES
Mission Aligned Incentives

INCENTIVES
Collective Benefits

INCENTIVES
Reputation Systems

INCENTIVES
Reward Shaping

DETERRENCE
Punishment & Rule Enforcement

DETERRENCE
Liability

DETERRENCE
Malpractice

DETERRENCE
Reputation