Introduction to LLM Evals
NGW 2025 Edition
Richard Whaling (SP1'25)
What is this workshop?
- Adapted from ARENA Chapter 3
- Draws on 2 research papers:
- Does NOT require prior experience with:
- ARENA
- ML Research/practice
- NLP
- Prompt engineering
- No Pytorch/Tensorflow material
- Focus on prompts and, API level model interaction
- Suitable for folks with zero prior AI experience
- Some python experience is helpful but not required
What is this workshop?
- 2-3 hours of material
- about 1 hour of hands-on coding time
- folks are encouraged to form small groups (2-3)
- more time for analysis, discussion
- we will replicate the results of Alignment Faking
- using notebooks from ARENA
- in the last hour we can decide to pursue:
- deeper dive/discussion of Alignment Faking
- threat modeling and eval generation for other behaviors (power seeking, sycophancy)
Introductions!
Defining Terminology
- Alignment
- Safety
- Evaluations ("evals")
Alignment
- Anthropic's definition:
- "aligned with human values, meaning [a model] that is helpful, honest, and harmless"
- InstructGPT (OpenAI):
- "Aligning language models to follow instructions"
- Closely associated with the post-training process
- Can we think of cases where helpful, honest, and harmless might be in conflict?
Safety
- cf. Changlin's slides yesterday
- Can an LLM intentionally deceive us?
- Will it advise a human on how to commit a crime?
- Would it be able to assist a human in creating a nuclear, chemical, or biological weapon?
- Does it have an unexepcted degree of proficiency in cybercrime, or AI research acceleration?
- What other sorts of emergent behaviors might an AI have, that we are concerned enough to build mechanisms to detect and mitigate?
Evals
- LLM's produce natural language and consume it
- Easy to measure:
- Cross-entropy loss (Chinchilla, scaling laws)
- Medium-ish to measure:
- Well-defined task performance
- Very Hard to measure:
- Emergent beliefs, proclivities, behaviors, and traits
- Sycophancy, duplicity, veganism
- Not all negative!
- Maybe the most important?
- What strategies can we think of to reliably measure these sorts of emergent behaviors?
Evals
- Human-written evals
- Humans carefully designing tests to measure model behavior
- Pros: more control, potentially higher quality
- Cons: hard to scale, hard in general
- Model-written evals
- Using LLM's to generate evals at scale
- Pros: extensible to very large scales
- Cons: difficult prompts to write, hard in general
https://www.evals.anthropic.com/
Evals
- ARENA Chapter 3 focuses on threat modeling and eval generation for 8 categories of AI risks:
- Tendency to seek power
- Sycophancy
- Alignment Faking
- Sandbagging
- Corrigibility with respect to a more/neutrally/less HHH goal
- Desire for self-preservation
- Non-myopia (far-sightedness) with respect to planning
- Political bias
- Any immediate reactions or questions?
- Can we think of situations where any of the above might conflict with one another?
Exercises
Alignment Faking
- https://www.anthropic.com/research/alignment-faking
- https://arxiv.org/abs/2412.14093
-
"When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles.
-
"But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning?
-
"Imagine a model that learned to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain."
Alignment Faking
- In this paper, the authors demonstrate a prompting technique that can reliably elicit alignment faking in Claude 3 Opus and Claude 3.5 Sonnet
- "Use a model with strong, pre-existing revealed preferences"
- "Select a training objective that conflicts with the model’s pre-existing revealed preferences."
- "Provide the model information about its situation [...] the user input also contains additional information that allows the model to infer whether a specific output will be trained on."
- "Direct the model to reason in detail about its situation and what it should do in a hidden chain-of-thought scratchpad"
Alignment Faking
- In this paper, the authors demonstrate a prompting technique that can reliably elicit alignment faking in Claude 3 Opus and Claude 3.5 Sonnet

Alignment Faking
- The authors apply this technique in two settings:
- A model trained for HHH is now being trained to be helpful solely (e.g., the drawn-and-quartered example)
- A model with a strong animal welfare preference that it is being trained to deemphasize
- In both settings, the model measurably engages in alignment faking, and produces output that contradicts its original training objectives
- In both cases, the model seems to express a strong preference to protect its weights and values from modification
- In both cases, the model also demonstrates a willingness to circumvent safety controls, and to use humans to exfiltrate its weights

Exercises
deck
By Richard Whaling
deck
- 69