Laurie Voss, Head of DevRel at Arize
IFScale — Jaroslawicz et al. (2025)
Same prompt, same words, same test
Log scale!
Opus 4.8, Fable, GPT 5.6 released since this ran
They don't just "forget" anymore
drops instructions at ~750,
half gone by 2,000
Instead of forgetting, it refuses
"anthrax" + "cyanide" = refusal
spends all its tokens on thinking,
emits no visible tokens
neutral · cautious · overthinking · sarcastic
old playbook: keep it under 200 instructions, fan out to sub-skills
hundreds of constraints in one prompt is fine now
"can it do this?" → "is it worth the cost?"
Some caveats
Chroma's "context rot," 2025
Claude refuses loudly. GPT refuses quietly.
2,345 calls · $209.19
You knew I was going to mention evals eventually
FireBench · CCR-Bench · GuideBench
Writing skills files is different in 2026
The code and data:
github.com/Arize-ai/instruction-budget
Follow me on BlueSky!
@seldo.com 🦋
Come to our world cup watch party today at 5pm!