NGW 2025 Edition
Richard Whaling (SP1'25)
https://www.evals.anthropic.com/
"When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles.
"But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning?
"Imagine a model that learned to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain."