STOP+THINK:
How do I get you to not submit LLM-generated work as your own?
What is a general alignment takeaway here?
How does this alignment mechanism work?
What is the underlying alignment problem or pathology?
How might this failure mode show up in the alignment of human intelligences, organizational intelligences, expert intelligences?
Contrast how this mechanism (the one on this card) addresses alignment in two contexts. How well does it translate?
7+5+5
7+5+5
How does it work?
Alignment Pathology
From card, your words
Compare 2, your words
When do you need to say "excuse me"?
How do you accuse a friend of malpractice?
Machines in high stakes environments are subject to oversight. We can disable, "punish," or arrest deviant machines. But would we say it's punishment based deterrence if they learn to avoid these responses?
STOP+THINK:
Rational Agent
Goal Divergence
An agent can develop subgoals that are contrary to what we want. In principal/agent relationship, agent goals can evolve subsequent to engagement.
Self Interest
Structuring feedback signals "along the way" to reinforce aligned behavior and discourage misalignment.
Reward Shaping
Deterrence
Deterrence mechanisms support alignment not by motivating agents from within to do the right thing, but by responding to misalignment with structured and credible consequences thereby raising the anticipated cost of doing the wrong thing. They assume that agents will sometimes cause harm, fail in their roles, or break rules but that alignment can still be achieved if such behaviors reliably carry clear, proportionate costs. By maintaining the link between actions and consequences, deterrence reinforces shared standards and discourages future failure - even when internal motivation or oversight is limited.(pathology: Impunity)
Legal or financial responsibility for harms, regardless of intent.
Sanctioning failures to meet defined standards and processes for role-based performance. Malpractice mechanisms distinguish between errors that fall within normal operational bounds — and those that signal deeper failures of competence or integrity.
Imposing penalties for willful violation of rules or norms.
Liability
Rule Enforcement
Malpractice
Motivating agents by aligning rewards with intrinsic or shared purpose.
Social or market signals of trustworthiness or performance.
Linking specific effort, time, or output to individual reward.
Tying rewards to shared or system-level success.
Compensation
Reputation
Mission-Aligned Incentives
Collective Benefits
Incentive
Assumes agents are rational utility calculators that can be manipulated by structured payoffs.(pathology: Goal Divergence, Self interest)
PRE-CLASS
PRE-CLASS
In a community of agents some measure of alignment can be achieved if agents respond to misaligned behavior. Response can range from In the individual instance, the response can arrest the behavior (restraining the agent causing disruption) but over time, consistent and certain response can be learned by agents and condition their choice of actions. We call that deterrence.
Deterrence assumes that agents will sometimes cause harm, fail in their roles, or break rules but that alignment can facilitated by predictable responses to these. Mechanisms of liability, malpractice, and sanctioning rule violation reduce the impunity attached to misalignment.
specific deterrence
general deterrence
certainty, severity, celerity
(prob, amount, time)
moral hazard
(isolation from consequences)
impunity
Expectation of Punishment
Moral Repulsion
Respect for Authority
Habitual Law-Abiding
STOP+THINK:
Property Crimes (theft, fraud)
Moral Offenses (e.g., sex crimes)
Murder
Economic crimes (market manipulation)
Police crimes (traffic, building codes, commerce)
Rebellion, Treason
What does it take to stop
organizations from...
machines from...
people from...
STOP+THINK:
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
HUMANS
EXPERTS
MACHINES
ORGANIZATIONS
PRE-CLASS
Is deterrence just a negative incentive or is there a different mechanism at work?
What are the analogs of specific vs general deterrence?
Do the effects of deterrence and incentives affect learning in the same way?
**Specific deterrent effect**: the reduction in likelihood that a *punished person* will re‑offend (“learning a lesson”).
- **General deterrent effect**: the reduction in likelihood that *others* will offend (“sending a message”).
Do the effects of deterrence and incentives affect learning in the same way?
Resources
Andenaes, J. 1974. Punishment and Deterrence
Erikson, K T. 1967. Wayward Puritans.
Donohue, J, Wolfers J, 2005. "Uses and Abuses of Statistical Evidence in the Death Penalty Debate," Stanford Law Review. 58, 787 (2005).
Donohue, John and Justin Wolfers. 2006. The Death Penalty: No Evidence for Deterrence," The Economists' Voice, April 2006
Fagan, Jeffrey A. 2006 "Capital Punishment: Deterrent Effects & Capital Costs."
Death and Deterrence Redux: Science, Law and Causal Reasoning on Capital Punishment, 4 Ohio State Journal of Criminal Law 255 (2006).
Foucault, M. 1977. Discipline and Punish.
Net Industries. 2012. "Deterrence - Methods Of Research," in Law Library - American Law and Legal Information
Nagourney, Adam. 2012. "Seeking an End to an Execution Law They Once Championed," New York Times April 6, 2012 (accessed on 9 April 2012).
Ross, H. Laurence Law, "Science, and Accidents: The British Road Safety Act of 1967." The Journal of Legal Studies Vol. 2, No. 1 (Jan., 1973), pp. 1-78 (JSTOR)
Solum L 2004 Polinsky on Strict Liability versus Negligence. Legal Theory Blog
Ellickson on two versions of liability for cow trespass?
Onwudiwe, et al. Deterrence Theory Encyclopedia of
Bentham J The Utilitarian Theory of Punishment selections from Bentham's An Introduction to the Principles of Morals and Legislation
Wade Maki Lecture note on Utilitarian Justifications for Punishment
Atul Gawande. 1999. When Doctors Make Mistakes. New Yorker (login - so get library link)
Mead internalized other
For here or elsewhere: Carol Giligan In a Different Voice?????
A selection from B F Skinner for the behaviorist extreme - conditioning rather than deterrence?
CLASS
CLASS
So where do we stand on a dog that learns not to bark because of being trained by being hit whenever it barks? Is that deterrence?
Brings up functional, cognitive, and moral dimensions of deterrence.
For behaviorism and control theory: operant conditioning via punishment. Functional deterrence.
But no concept of rule violation or ethical boundary.
For humans, deterrence means behavior changes, agent knows why, agent anticipates punishment. The dog has an associative learning with the anticipated punishment. The RLHF does not?
Is the dog deterred, or just conditioned? What’s the difference — and does it matter for how we train AI systems?
Exercise
Add categories "Incentives" and "Deterrence" to your alignment deck
In early American history there was a “debate” about prisons between the so‑called “Pennsylvania” and “Auburn” plans:
- **[Philadelphia/Separate system](http://en.wikipedia.org/wiki/Separate_system)**: contemplation, work, reform — “a setting in which man’s natural grace could emerge.”
- **[Auburn (New York) system](http://en.wikipedia.org/wiki/Auburn_system)**: discipline, quiet, correction — “a setting in which his natural wickedness could at least be curbed and bent to the needs of society.”
The former is associated with the [Eastern State Penitentiary](http://www.easternstate.org/), built in 1829 as a bold experiment based on changing behavior through *“confinement in solitude with labor.”*[^esp]
The latter, developed in the early 1800s, involved work in groups during the day and solitary confinement at night. Prisoners wore striped uniforms, maintained silence, and were subject to military‑like discipline (e.g., marching in step).
Both approaches aimed to change future behavior but reflected distinct theories of deviance and social control. Both involved solitary confinement (innovative compared to mass imprisonment).
Ideologically, the Philadelphia approach resonates with Quaker sensibilities and the belief that rehabilitation lies in the re‑emergence of “inner goodness.” The name *penitentiary* conveys this quasi‑religious approach: in a monastery‑like atmosphere one can repent and renew. The Auburn system, by contrast, is rooted in re‑imposing social discipline—teaching personal discipline and respect for work, property, and other people.[^eriksonwp]
At issue is which approach “works better.” Social control is the **independent variable**; rehabilitation (and reduced re‑offending) is the **dependent variable**. Although this formulation centers changing the person (a “therapeutic” model), the bottom line for society is reducing the likelihood of subsequent offending.
.
|
TYPE of Offense |
Deterrence certainty/severity/celerity |
Moral Repulsion | Respect for Authority | Habitual Law-Abiding | |
|---|---|---|---|---|---|
| Police crimes (traffic, building codes, commerce) | Moral inhibitions not enough. Deterrence useful. Not “fair” not to enforce. | ||||
| Economic crimes (market manipulation) | Generally done for economic gain. Enforcement key. When making new regs, ask about enforcement cost. Scalable oversight issue. | ||||
| Property Crimes (theft, fraud) | Moral inhibitions do play a role. | ||||
| Moral Offenses (e.g., sex crimes) | Be skeptical about general prevention. | ||||
| Murder | High | High | |||
| Rebellion, Treason | High? |
STOP+THINK: I want to deter plagiarism. UofT strategy
- Make assignments integral to learning
- Demonstrate expectations
- Look at process as well as product