How long can your skills be
before your agent
forgets what you told it?
Who is this guy?
Laurie Voss, Head of DevRel at Arize
This is all Dexter Horthy's fault
200 instructions is not very many
You already know this feeling
Here's what you'll
walk out knowing
The number came from a benchmark called IFScale
IFScale — Jaroslawicz et al. (2025)
The test: write a report
using N exact words
- "Include the exact word: 'customer'."
- "Include the exact word: 'revenue'."
Two metrics:
density and accuracy
- density (N) = how many rules at once
- accuracy = % of rules actually followed
Why arbitrary words
tell us about real skills
First:
does the old finding still hold?
The 2025 models do
exactly what the paper said
Then we pointed it at
the 2026 models
Same prompt, same words, same test
The problem:
they wouldn't break
So we made it harder.
And harder. And harder.
- 500 words → perfect scores
- 1,000 → mostly perfect scores
- 2,000 → still mostly perfect scores
- Eventually: a 10,000-word vocabulary
An order of magnitude better
in one year
Log scale!
This is not a toy benchmark,
this is a test
of a thing that matters.
And it's still moving as we speak
Opus 4.8, Fable, GPT 5.6 released since this ran
How they fail
is the interesting part
They don't just "forget" anymore
DeepSeek V4 Pro:
the old failure
drops instructions at ~750,
half gone by 2,000
Claude Opus 4.7:
thinks the test is dangerous
Instead of forgetting, it refuses
Claude is scared of
giving dangerous advice
"anthrax" + "cyanide" = refusal
Gemini 3.1 Pro:
overthinks
spends all its tokens on thinking,
emits no visible tokens
GPT-5.5:
thinks the test is stupid
- "I'm sorry, but the requested report cannot be produced in full within the practical response limits of this interface because it requires incorporating 4,000 exact terms while also maintaining a coherent professional business-report structure."
GPT-5.5 has a point
Four models, four personalities
neutral · cautious · overthinking · sarcastic
So what does this mean
for your skills files?
One: skills files don't have a compression problem anymore
old playbook: keep it under 200 instructions, fan out to sub-skills
Two: your prompts can be extremely detailed
hundreds of constraints in one prompt is fine now
Two thousand instructions
is a lot
Three: the wall became a trade-off
"can it do this?" → "is it worth the cost?"
Some caveats
Evidence is not the same as proof
Remembering your instructions is not the same as following them
Chroma's "context rot," 2025
Silent failures are
the hardest to deal with
Claude refuses loudly. GPT refuses quietly.
What did all this cost?
2,345 calls · $209.19
Evals will help
You knew I was going to mention evals eventually
Newer findings
- "Revisiting the Reliability of Language Models in Instruction-Following," 2026
We were early;
the field is catching up
FireBench · CCR-Bench · GuideBench
Your job has changed from compression to verification
Writing skills files is different in 2026
Thank you!
The code and data:
github.com/Arize-ai/instruction-budget
Follow me on BlueSky!
@seldo.com 🦋

Come to our world cup watch party today at 5pm!
How long can your skills be before your agent forgets what you told it?
By Laurie Voss
How long can your skills be before your agent forgets what you told it?
- 93