Hierarchy

HMIA 2025

HMIA 2025

TRIVIAL COOPERATION: shared goals and information - pick a goal and execute

Hobbes' observation: scarcity + similar agents→competition life is brutish&short

Hobbes' fix: cede sovereignty to boss with credible enforcement. Command→order

Command Failure Mode (preferences)
Agents retain autonomyeffort substitution & selective obedience

Command Failure Mode (information)
Orders incomplete & ambiguous, environments shift.

Principals and Agents

From commands to contracts. Alignment by design: selection, monitoring, incentives to align autonomy with principals goals.

Agent as RL learner.
Naked RL is a clean micro-model: the agent updates a policy to maximize rewards.

Goodhart risk: m(·) omits what drives V(·), maximizing T(m(a)) reduces Us. Gaming, reward hacking, short termism.

Requires governance and guardrails. Lagged, hard-to-game proxies, HITL overrides, team rewards, culture, the "alignment stack"

Incentives are transfers on signals.

 

 

 

\text{Let measurements be } 𝑚(𝑎_𝐴); \text{incentives } 𝑇(𝑚(𝑎_𝐴) \\ \text{Agent utility:} 𝑈_𝐴(𝑎)=𝑇(𝑚(𝑎))−𝐶_𝐴(𝑎)+𝐼_𝐴(𝑎) \\ \text{Principal utility: } 𝑈_𝑆(𝑎)=𝑉(𝑎)−𝑇(𝑚(𝑎))−𝐾_𝑆(𝑎)

As soon as behavior is driven by 𝑇(𝑚(𝑎)) T(m(a)), the problem is no longer obedience—it’s measurement.

But T(m(a)) is always a lossy compression of what matters.

HMIA 2025

TRIVIAL COOPERATION: shared goals and information - pick a goal and execute

Hobbes' observation

scarcity + similar agents

competition 
life is nasty, brutish, & short

HMIA 2025

Hobbes' Fix

cede sovereignty to boss with credible enforcement.

 

Command→order

HMIA 2025

Command Failure Mode (preferences)


Agents retain autonomy→effort substitution & selective obedience

Command Failure Mode (information)


Orders incomplete & ambiguous, environments shift.

Obedience Failure Modes

"THE" Principals and Agents Problem

Suppose P needs some help realizing their goals and objectives

U_P

A could provide help, but P and A care about different things

U_P \ne U_A

Read: A is not going to get out of assisting P the same thing that P will get out of having some help.

From commands to contracts. Alignment by design: selection, monitoring, incentives to align autonomy with principal's goals.

Principal's Challenge: design an intervention that aligns their utilities

You want "do the work I need" to be agent's "best option"

\arg\max_{a_A} U_A(a_A) \longrightarrow \arg\max_{a_A} U_P(a_A)

HMIA 2025

To get this agent to work for us, we need to sweeten things

An ordinary RL agent as our model

Agent does action a in state s and garners reward - has some utility

U_A(a_A)

Add a transfer based on the behavior

U'_A(a_A) = U_A(a_A) + T(a_A)

HMIA 2025

Pay for time

Piece rates: pay per widget

Peformance bonus above threshold

Promotion (but delayed and fuzzy)

U'_A(a_A) = U_A(a_A) + T(a_A)

STOP+THINK: What are some common ways humans design T(aA)?

In AI/RL we call it "reward shaping" - tweaking reward signal to get desired behavior.

Kerr's Argument

T(a_A) \text{ is always actually } T(metric(a_A))

HMIA 2025

Incentives are transfers on signals.

\text{Let measurements be } 𝑚(𝑎_𝐴); \text{incentives } 𝑇(𝑚(𝑎_𝐴)) \\ \; \\ \text{Agent utility: } 𝑈_𝐴(𝑎)=𝑇(𝑚(𝑎))−𝐶_𝐴(𝑎)+𝐼_𝐴(𝑎) \\ \; \\ \text{Principal utility: } 𝑈_𝑆(𝑎)=𝑉(𝑎)−𝑇(𝑚(𝑎))−𝐾_𝑆(𝑎)

T, the incentive, is based on an observable signal. 

m(a_A)

STOP+THINK: How do we read these equations?

\text{Credit Assignment Problem - hard to know what }a_A \text{ makes the difference in }U_P

STOP+THINK: What is an everyday example of credit assignment problem?

HMIA 2025

As soon as behavior is driven by 𝑇(𝑚(𝑎)), the problem is no longer obedience—it’s measurement.

T(m(a)) is always a lossy compression of what matters.

STOP+THINK: What's that mean?

STOP+THINK: What's that mean?

Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure"

STOP+THINK: Examples of measure becoming a target?

STOP+THINK: What's that mean?

HMIA 2025

\text{Goodhart risk: }\\ \; \\ \text{If }m(·)\text{ omits what drives }V(·), \\ \text{ maximizing }T(m(a))\text{ reduces }U_P. \\ \; \\ \; \\ \text{ Gaming, reward hacking, short termism.}

HMIA 2025

Takeaway

Hierarchy - alignment by command and incentive, requires:

  • governance and guardrails.
  • Lagged, hard-to-game proxies;
  • Human in the loop (HITL) overrides;
  • team rewards;
  • culture;
  • the whole "alignment stack."

Example: Working Retail

tl;sc

managing humans in retail might be easier in some ways than managing robots in retail

STOP+THINK: If I am a retail company and I hire you for my store, what is my utility?

STOP+THINK: What signals about your work actions do your manager have?

STOP+THINK: What rewards do you respond to in the course of your job?

  • Internalization (norms of politeness, honesty): Values like honesty and helpfulness have been learned long before the job starts. (hiring)
  • Feedback Loops (enjoyment of helping): Smiles and gratitude from customers reinforce the behavior emotionally. (organizational culture)
  • Peer Signaling: Co-workers' approval or disapproval regulates over-performance and conformity. (organizational culture, hiring)
  • Managerial Oversight: Rules, checklists, and performance targets structure effort and attention. (organizational routines)
  • Cultural Framing: The meaning of "good service" or "professionalism" is stabilized through shared language, training videos, and informal stories. (organizational culture)

STOP+THINK: What other than pay and review on stats might contribute to variance in the a sub A you deliver?

STOP+THINK: Which of these levers and influences are or are not available if we want to use robots in retail?

  • Internalization (norms of politeness, honesty)
  • Feedback Loops (enjoyment of helping)
  • Peer Signaling
  • Managerial Oversight
  • Cultural Framing

Mechanism

ML Analogy

Availability

Approximation
Alignment Challenge
Guardrails
Internalization Value embedding Partially Hard constraints (no deceitful claims); constitutional rules (“prefer safe/transparent acts”); escalation policies How to represent and update social norms that are situational and culturally coded Periodic red-team tests; human review of edge cases; log-and-explain decisions
Feedback loops RL signals Yes (risky) Pair short-term CSAT with lagged outcomes (30-day returns, complaint rate, repeat purchases) Avoid “reward hacking” (e.g., maximizing smiles/positive tone without real help) Penalize “star-begging” patterns; randomize survey prompts; weight durable outcomes more than immediate smiles
Peer signaling Multi-agent coordination Partially Shared queue health; handoff quality; “assist” events credited to both giver and receiver Maintain cooperation without competition for metrics Team-level rewards for cooperation; cap individual metrics that encourage hoarding easy cases
Managerial oversight Monitoring & human-in-the-loop Yes Real-time dashboards; safe-interrupt (“stop/ask human”); audit trails Ensure corrigibility: robot accepts override and interprets feedback appropriately Reward acceptance of human override (no penalty for deferring); require explanations for high-impact actions
Cultural framing Brand policy Yes (with care) Style guides → operational checks (offer alternatives, confirm understanding, follow-up reminders) Translate vague brand values (“friendly,” “helpful,” “authentic”) into operational behavior Calibrate on customer narratives, not just tone scores; fairness audits across customer segments

Example: Experts

tl;sc

experts require regulation beyond incentives

STOP+THINK: What does society hope for from expert intelligence?

Society (the principal) wants competent disinterested expertise that advances the public welfare.

STOP+THINK: What does society reward?

STOP+THINK: What can go wrong?

Above average pay; autonomy, exclusivity, status.

Guild interest, protection of low performers, exploitation based on knowledge asymmetry, conflicts of interest.

STOP+THINK: What do we do?

Meta governance: accreditation, standardized exams, continuing education, peer review.

Liability and accountability: malpractice, negligence standards, fiduciary duties, whistleblower protection

Incentive Hygeine: conflict of interest rules, disclosure requirements, separation of roles, rotation, cooling off periods

Measurement and Learning: registries, outcome tracking, incident reporting, public dashboards (outcomes not outputs)

HMIA 2025

CLASS

HMIA 2025

Resources

Author. YYYY. "Linked Title" (info)

NEXT Markets