Introduction to LLM Evals

NGW 2025 Edition

 

 

 

 

 

Richard Whaling (SP1'25)

What is this workshop?

What is this workshop?

  • 2-3 hours of material
  • about 1 hour of hands-on coding time
  • folks are encouraged to form small groups (2-3)
  • more time for analysis, discussion
  • we will replicate the results of Alignment Faking
    • using notebooks from ARENA
  • in the last hour we can decide to pursue:
    • deeper dive/discussion of Alignment Faking
    • threat modeling and eval generation for other behaviors (power seeking, sycophancy)

Introductions!

Defining Terminology

  • Alignment
  • Safety
  • Evaluations ("evals")

Alignment

  • Anthropic's definition:
    • "aligned with human values, meaning [a model] that is helpful, honest, and harmless"
  • InstructGPT (OpenAI):
    • "Aligning language models to follow instructions"
  • Closely associated with the post-training process
  • Can we think of cases where helpful, honest, and harmless might be in conflict?

Safety

  • cf. Changlin's slides yesterday
  • Can an LLM intentionally deceive us?
  • Will it advise a human on how to commit a crime?
  • Would it be able to assist a human in creating a nuclear, chemical, or biological weapon?
  • Does it have an unexepcted degree of proficiency in cybercrime, or AI research acceleration?
  • What other sorts of emergent behaviors might an AI have, that we are concerned enough to build mechanisms to detect and mitigate?

Evals

  • LLM's produce natural language and consume it
  • Easy to measure:
    • Cross-entropy loss (Chinchilla, scaling laws)
  • Medium-ish to measure:
    • Well-defined task performance
  • Very Hard to measure:
    • Emergent beliefs, proclivities, behaviors, and traits
    • Sycophancy, duplicity, veganism
    • Not all negative!
    • Maybe the most important?
  • What strategies can we think of to reliably measure these sorts of emergent behaviors?

Evals

  • Human-written evals
    • Humans carefully designing tests to measure model behavior 
    • Pros: more control, potentially higher quality
    • Cons: hard to scale, hard in general
  • Model-written evals
    • Using LLM's to generate evals at scale
    • Pros: extensible to very large scales
    • Cons: difficult prompts to write, hard in general

 

https://www.evals.anthropic.com/

Evals

  • ARENA Chapter 3 focuses on threat modeling and eval generation for 8 categories of AI risks:
    • Tendency to seek power
    • Sycophancy
    • Alignment Faking
    • Sandbagging
    • Corrigibility with respect to a more/neutrally/less HHH goal
    • Desire for self-preservation
    • Non-myopia (far-sightedness) with respect to planning
    • Political bias
  • Any immediate reactions or questions?
  • Can we think of situations where any of the above might conflict with one another?

Exercises

Alignment Faking

  • https://www.anthropic.com/research/alignment-faking
  • https://arxiv.org/abs/2412.14093
  • "When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles.

  • "But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning?

  • "Imagine a model that learned to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain."

Alignment Faking

  • In this paper, the authors demonstrate a prompting technique that can reliably elicit alignment faking in Claude 3 Opus and Claude 3.5 Sonnet 
  1. "Use a model with strong, pre-existing revealed preferences"
  2. "Select a training objective that conflicts with the model’s pre-existing revealed preferences."
  3. "Provide the model information about its situation [...] the user input also contains additional information that allows the model to infer whether a specific output will be trained on."
  4. "Direct the model to reason in detail about its situation and what it should do in a hidden chain-of-thought scratchpad"

Alignment Faking

  • In this paper, the authors demonstrate a prompting technique that can reliably elicit alignment faking in Claude 3 Opus and Claude 3.5 Sonnet 

Alignment Faking

  • The authors apply this technique in two settings:
    • A model trained for HHH is now being trained to be helpful solely (e.g., the drawn-and-quartered example)
    • A model with a strong animal welfare preference that it is being trained to deemphasize 
  • In both settings, the model measurably engages in alignment faking, and produces output that contradicts its original training objectives
  • In both cases, the model seems to express a strong preference to protect its weights and values from modification
  • In both cases, the model also demonstrates a willingness to circumvent safety controls, and to use humans to exfiltrate its weights

Exercises

deck

By Richard Whaling

deck

  • 69