"Readings"
Video: x [3m21s]
Activity: TBD
PRE-CLASS
CLASS
TRIVIAL COOPERATION: shared goals and information - pick a goal and execute
Hobbes' observation: scarcity + similar agents→competition life is brutish&short
Hobbes' fix: cede sovereignty to boss with credible enforcement. Command→order
Command Failure Mode (preferences)
Agents retain autonomy→effort substitution & selective obedience
Command Failure Mode (information)
Orders incomplete & ambiguous, environments shift.
Principals and Agents
From commands to contracts. Alignment by design: selection, monitoring, incentives to align autonomy with principals goals.
Agent as RL learner.
Naked RL is a clean micro-model: the agent updates a policy to maximize rewards.
Goodhart risk: m(·) omits what drives V(·), maximizing T(m(a)) reduces Us. Gaming, reward hacking, short termism.
Requires governance and guardrails. Lagged, hard-to-game proxies, HITL overrides, team rewards, culture, the "alignment stack"
Incentives are transfers on signals.
As soon as behavior is driven by 𝑇(𝑚(𝑎)) T(m(a)), the problem is no longer obedience—it’s measurement.
But T(m(a)) is always a lossy compression of what matters.
PRE-CLASS
PRE-CLASS
CLASS
CLASS
Resources
Author. YYYY. "Linked Title" (info)