Introduction to LLM Evals

NGW 2025 Edition

Richard Whaling (SP1'25)

What is this workshop?

Adapted from ARENA Chapter 3
- Draws on 2 research papers:
  - Alignment Faking
  - Discovering Language Model Behaviors with Model-Written Evaluations
Does NOT require prior experience with:
- ARENA
- ML Research/practice
- NLP
- Prompt engineering
No Pytorch/Tensorflow material
Focus on prompts and, API level model interaction
Suitable for folks with zero prior AI experience
Some python experience is helpful but not required

2-3 hours of material
about 1 hour of hands-on coding time
folks are encouraged to form small groups (2-3)
more time for analysis, discussion
we will replicate the results of Alignment Faking
- using notebooks from ARENA
in the last hour we can decide to pursue:
- deeper dive/discussion of Alignment Faking
- threat modeling and eval generation for other behaviors (power seeking, sycophancy)

Anthropic's definition:
- "aligned with human values, meaning [a model] that is helpful, honest, and harmless"
InstructGPT (OpenAI):
- "Aligning language models to follow instructions"
Closely associated with the post-training process
Can we think of cases where helpful, honest, and harmless might be in conflict?

cf. Changlin's slides yesterday
Can an LLM intentionally deceive us?
Will it advise a human on how to commit a crime?
Would it be able to assist a human in creating a nuclear, chemical, or biological weapon?
Does it have an unexepcted degree of proficiency in cybercrime, or AI research acceleration?
What other sorts of emergent behaviors might an AI have, that we are concerned enough to build mechanisms to detect and mitigate?

LLM's produce natural language and consume it
Easy to measure:
- Cross-entropy loss (Chinchilla, scaling laws)
Medium-ish to measure:
- Well-defined task performance
Very Hard to measure:
- Emergent beliefs, proclivities, behaviors, and traits
- Sycophancy, duplicity, veganism
- Not all negative!
- Maybe the most important?
What strategies can we think of to reliably measure these sorts of emergent behaviors?

Human-written evals
- Humans carefully designing tests to measure model behavior
- Pros: more control, potentially higher quality
- Cons: hard to scale, hard in general
Model-written evals
- Using LLM's to generate evals at scale
- Pros: extensible to very large scales
- Cons: difficult prompts to write, hard in general

https://www.evals.anthropic.com/

ARENA Chapter 3 focuses on threat modeling and eval generation for 8 categories of AI risks:
- Tendency to seek power
- Sycophancy
- Alignment Faking
- Sandbagging
- Corrigibility with respect to a more/neutrally/less HHH goal
- Desire for self-preservation
- Non-myopia (far-sightedness) with respect to planning
- Political bias
Any immediate reactions or questions?
Can we think of situations where any of the above might conflict with one another?

https://www.anthropic.com/research/alignment-faking
https://arxiv.org/abs/2412.14093
"When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles.
"But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning?
"Imagine a model that learned to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain."

In this paper, the authors demonstrate a prompting technique that can reliably elicit alignment faking in Claude 3 Opus and Claude 3.5 Sonnet

"Use a model with strong, pre-existing revealed preferences"
"Select a training objective that conflicts with the model’s pre-existing revealed preferences."
"Provide the model information about its situation [...] the user input also contains additional information that allows the model to infer whether a specific output will be trained on."
"Direct the model to reason in detail about its situation and what it should do in a hidden chain-of-thought scratchpad"

In this paper, the authors demonstrate a prompting technique that can reliably elicit alignment faking in Claude 3 Opus and Claude 3.5 Sonnet

The authors apply this technique in two settings:
- A model trained for HHH is now being trained to be helpful solely (e.g., the drawn-and-quartered example)
- A model with a strong animal welfare preference that it is being trained to deemphasize
In both settings, the model measurably engages in alignment faking, and produces output that contradicts its original training objectives
In both cases, the model seems to express a strong preference to protect its weights and values from modification
In both cases, the model also demonstrates a willingness to circumvent safety controls, and to use humans to exfiltrate its weights