Hierarchy

HMIA 2025

TRIVIAL COOPERATION: shared goals and information - pick a goal and execute

Hobbes' observation: scarcity + similar agents→competition life is brutish&short

Hobbes' fix: cede sovereignty to boss with credible enforcement. Command→order

Command Failure Mode (preferences)
Agents retain autonomy→effort substitution & selective obedience

Command Failure Mode (information)
Orders incomplete & ambiguous, environments shift.

Principals and Agents

From commands to contracts. Alignment by design: selection, monitoring, incentives to align autonomy with principals goals.

Agent as RL learner.
Naked RL is a clean micro-model: the agent updates a policy to maximize rewards.

Goodhart risk: m(·) omits what drives V(·), maximizing T(m(a)) reduces Us. Gaming, reward hacking, short termism.

Requires governance and guardrails. Lagged, hard-to-game proxies, HITL overrides, team rewards, culture, the "alignment stack"

Incentives are transfers on signals.

\text{Let measurements be } 𝑚(𝑎_𝐴); \text{incentives } 𝑇(𝑚(𝑎_𝐴) \\ \text{Agent utility:} 𝑈_𝐴(𝑎)=𝑇(𝑚(𝑎))−𝐶_𝐴(𝑎)+𝐼_𝐴(𝑎) \\ \text{Principal utility: } 𝑈_𝑆(𝑎)=𝑉(𝑎)−𝑇(𝑚(𝑎))−𝐾_𝑆(𝑎)

As soon as behavior is driven by 𝑇(𝑚(𝑎)) T(m(a)), the problem is no longer obedience—it’s measurement.

But T(m(a)) is always a lossy compression of what matters.

HMIA 2025

TRIVIAL COOPERATION: shared goals and information - pick a goal and execute

Hobbes' observation

scarcity + similar agents

competition
life is nasty, brutish, & short

→

HMIA 2025

Hobbes' Fix

cede sovereignty to boss with credible enforcement.

Command→order

HMIA 2025

Command Failure Mode (preferences)

Agents retain autonomy→effort substitution & selective obedience

Command Failure Mode (information)

Orders incomplete & ambiguous, environments shift.

Obedience Failure Modes

"THE" Principals and Agents Problem

Suppose P needs some help realizing their goals and objectives

U_P

A could provide help, but P and A care about different things

U_P \ne U_A

Read: A is not going to get out of assisting P the same thing that P will get out of having some help.

From commands to contracts. Alignment by design: selection, monitoring, incentives to align autonomy with principal's goals.

Principal's Challenge: design an intervention that aligns their utilities

You want "do the work I need" to be agent's "best option"

\arg\max_{a_A} U_A(a_A) \longrightarrow \arg\max_{a_A} U_P(a_A)

HMIA 2025

To get this agent to work for us, we need to sweeten things

An ordinary RL agent as our model

Agent does action a in state s and garners reward - has some utility

U_A(a_A)

Add a transfer based on the behavior

U'_A(a_A) = U_A(a_A) + T(a_A)

HMIA 2025

Pay for time

Piece rates: pay per widget

Peformance bonus above threshold

Promotion (but delayed and fuzzy)

U'_A(a_A) = U_A(a_A) + T(a_A)

STOP+THINK: What are some common ways humans design T(aA)?

In AI/RL we call it "reward shaping" - tweaking reward signal to get desired behavior.

Kerr's Argument

T(a_A) \text{ is always actually } T(metric(a_A))

HMIA 2025

Incentives are transfers on signals.

\text{Let measurements be } 𝑚(𝑎_𝐴); \text{incentives } 𝑇(𝑚(𝑎_𝐴)) \\ \; \\ \text{Agent utility: } 𝑈_𝐴(𝑎)=𝑇(𝑚(𝑎))−𝐶_𝐴(𝑎)+𝐼_𝐴(𝑎) \\ \; \\ \text{Principal utility: } 𝑈_𝑆(𝑎)=𝑉(𝑎)−𝑇(𝑚(𝑎))−𝐾_𝑆(𝑎)

T, the incentive, is based on an observable signal.

m(a_A)

STOP+THINK: How do we read these equations?

\text{Credit Assignment Problem - hard to know what }a_A \text{ makes the difference in }U_P

STOP+THINK: What is an everyday example of credit assignment problem?

HMIA 2025

As soon as behavior is driven by 𝑇(𝑚(𝑎)), the problem is no longer obedience—it’s measurement.

T(m(a)) is always a lossy compression of what matters.

STOP+THINK: What's that mean?

Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure"

STOP+THINK: Examples of measure becoming a target?

STOP+THINK: What's that mean?

HMIA 2025

\text{Goodhart risk: }\\ \; \\ \text{If }m(·)\text{ omits what drives }V(·), \\ \text{ maximizing }T(m(a))\text{ reduces }U_P. \\ \; \\ \; \\ \text{ Gaming, reward hacking, short termism.}

HMIA 2025

Takeaway

Hierarchy - alignment by command and incentive, requires:

governance and guardrails.
Lagged, hard-to-game proxies;
Human in the loop (HITL) overrides;
team rewards;
culture;
the whole "alignment stack."

Example: Working Retail

tl;sc

managing humans in retail might be easier in some ways than managing robots in retail

STOP+THINK: If I am a retail company and I hire you for my store, what is my utility?

STOP+THINK: What signals about your work actions do your manager have?

STOP+THINK: What rewards do you respond to in the course of your job?

Internalization (norms of politeness, honesty): Values like honesty and helpfulness have been learned long before the job starts. (hiring)
Feedback Loops (enjoyment of helping): Smiles and gratitude from customers reinforce the behavior emotionally. (organizational culture)
Peer Signaling: Co-workers' approval or disapproval regulates over-performance and conformity. (organizational culture, hiring)
Managerial Oversight: Rules, checklists, and performance targets structure effort and attention. (organizational routines)
Cultural Framing: The meaning of "good service" or "professionalism" is stabilized through shared language, training videos, and informal stories. (organizational culture)

STOP+THINK: What other than pay and review on stats might contribute to variance in the a sub A you deliver?

STOP+THINK: Which of these levers and influences are or are not available if we want to use robots in retail?

Internalization (norms of politeness, honesty)
Feedback Loops (enjoyment of helping)
Peer Signaling
Managerial Oversight
Cultural Framing

Mechanism	ML Analogy	Availability	Approximation	Alignment Challenge	Guardrails
Internalization	Value embedding	Partially	Hard constraints (no deceitful claims); constitutional rules (“prefer safe/transparent acts”); escalation policies	How to represent and update social norms that are situational and culturally coded	Periodic red-team tests; human review of edge cases; log-and-explain decisions
Feedback loops	RL signals	Yes (risky)	Pair short-term CSAT with lagged outcomes (30-day returns, complaint rate, repeat purchases)	Avoid “reward hacking” (e.g., maximizing smiles/positive tone without real help)	Penalize “star-begging” patterns; randomize survey prompts; weight durable outcomes more than immediate smiles
Peer signaling	Multi-agent coordination	Partially	Shared queue health; handoff quality; “assist” events credited to both giver and receiver	Maintain cooperation without competition for metrics	Team-level rewards for cooperation; cap individual metrics that encourage hoarding easy cases
Managerial oversight	Monitoring & human-in-the-loop	Yes	Real-time dashboards; safe-interrupt (“stop/ask human”); audit trails	Ensure corrigibility: robot accepts override and interprets feedback appropriately	Reward acceptance of human override (no penalty for deferring); require explanations for high-impact actions
Cultural framing	Brand policy	Yes (with care)	Style guides → operational checks (offer alternatives, confirm understanding, follow-up reminders)	Translate vague brand values (“friendly,” “helpful,” “authentic”) into operational behavior	Calibrate on customer narratives, not just tone scores; fairness audits across customer segments

Example: Experts

tl;sc

experts require regulation beyond incentives

STOP+THINK: What does society hope for from expert intelligence?

Society (the principal) wants competent disinterested expertise that advances the public welfare.

STOP+THINK: What does society reward?

STOP+THINK: What can go wrong?

Above average pay; autonomy, exclusivity, status.

Guild interest, protection of low performers, exploitation based on knowledge asymmetry, conflicts of interest.

STOP+THINK: What do we do?

Meta governance: accreditation, standardized exams, continuing education, peer review.

Liability and accountability: malpractice, negligence standards, fiduciary duties, whistleblower protection

Incentive Hygeine: conflict of interest rules, disclosure requirements, separation of roles, rotation, cooling off periods

Measurement and Learning: registries, outcome tracking, incident reporting, public dashboards (outcomes not outputs)

HMIA 2025

CLASS

HMIA 2025

Resources

Author. YYYY. "Linked Title" (info)

Hierarchy

HMIA 2025

HMIA 2025

HMIA 2025

TRIVIAL COOPERATION: shared goals and information - pick a goal and execute

Hobbes' observation

scarcity + similar agents

competition life is nasty, brutish, & short

→

HMIA 2025

Hobbes' Fix

cede sovereignty to boss with credible enforcement.

Command→order

HMIA 2025

Obedience Failure Modes

"THE" Principals and Agents Problem

From commands to contracts. Alignment by design: selection, monitoring, incentives to align autonomy with principal's goals.

HMIA 2025

HMIA 2025

Kerr's Argument

HMIA 2025

Incentives are transfers on signals.

HMIA 2025

As soon as behavior is driven by 𝑇(𝑚(𝑎)), the problem is no longer obedience—it’s measurement.

T(m(a)) is always a lossy compression of what matters.

Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure"

HMIA 2025

HMIA 2025

Example: Working Retail

Example: Experts

HMIA 2025

HMIA 2025

NEXT Markets

competition
life is nasty, brutish, & short