How long can your skills be

before your agent

forgets what you told it?

Who is this guy?

Laurie Voss, Head of DevRel at Arize

This is all Dexter Horthy's fault

200 instructions is not very many

You already know this feeling

Here's what you'll

walk out knowing

The number came from a benchmark called IFScale

IFScale — Jaroslawicz et al. (2025)

The test: write a report

using N exact words

"Include the exact word: 'customer'."
"Include the exact word: 'revenue'."

Two metrics:

density and accuracy

density (N) = how many rules at once
accuracy = % of rules actually followed

Why arbitrary words

tell us about real skills

First:

does the old finding still hold?

The 2025 models do

exactly what the paper said

Then we pointed it at

the 2026 models

Same prompt, same words, same test

The problem:

they wouldn't break

So we made it harder.

And harder. And harder.

500 words → perfect scores
1,000 → mostly perfect scores
2,000 → still mostly perfect scores
Eventually: a 10,000-word vocabulary

An order of magnitude better

in one year

Log scale!

This is not a toy benchmark,

this is a test

of a thing that matters.

And it's still moving as we speak

Opus 4.8, Fable, GPT 5.6 released since this ran

How they fail

is the interesting part

They don't just "forget" anymore

DeepSeek V4 Pro:

the old failure

drops instructions at ~750,

half gone by 2,000

Claude Opus 4.7:

thinks the test is dangerous

Instead of forgetting, it refuses

Claude is scared of

giving dangerous advice

"anthrax" + "cyanide" = refusal

Gemini 3.1 Pro:

overthinks

spends all its tokens on thinking,

emits no visible tokens

GPT-5.5:

thinks the test is stupid

"I'm sorry, but the requested report cannot be produced in full within the practical response limits of this interface because it requires incorporating 4,000 exact terms while also maintaining a coherent professional business-report structure."

GPT-5.5 has a point

Four models, four personalities

neutral · cautious · overthinking · sarcastic

So what does this mean

for your skills files?

One: skills files don't have a compression problem anymore

old playbook: keep it under 200 instructions, fan out to sub-skills

Two: your prompts can be extremely detailed

hundreds of constraints in one prompt is fine now

Two thousand instructions

is a lot

Three: the wall became a trade-off

"can it do this?" → "is it worth the cost?"

Some caveats

Evidence is not the same as proof

Remembering your instructions is not the same as following them

Chroma's "context rot," 2025

Silent failures are

the hardest to deal with

Claude refuses loudly. GPT refuses quietly.

What did all this cost?

2,345 calls · $209.19

Evals will help

You knew I was going to mention evals eventually

Newer findings

"Revisiting the Reliability of Language Models in Instruction-Following," 2026

We were early;

the field is catching up

FireBench · CCR-Bench · GuideBench

Your job has changed from compression to verification

Writing skills files is different in 2026

Thank you!

The code and data:

github.com/Arize-ai/instruction-budget

Follow me on BlueSky!

@seldo.com 🦋

Come to our world cup watch party today at 5pm!

How long can your skills be before your agent forgets what you told it?

By Laurie Voss

How long can your skills be before your agent forgets what you told it?

Laurie Voss PRO

seldo.com

How long can your skills be

before your agent

forgets what you told it?

Who is this guy?

This is all Dexter Horthy's fault

200 instructions is not very many

You already know this feeling

Here's what you'll

walk out knowing

The number came from a benchmark called IFScale

The test: write a report

using N exact words

Two metrics:

density and accuracy

Why arbitrary words

tell us about real skills

First:

does the old finding still hold?

The 2025 models do

exactly what the paper said

Then we pointed it at

the 2026 models

The problem:

they wouldn't break

So we made it harder.

And harder. And harder.

An order of magnitude better

in one year

This is not a toy benchmark,

this is a test

of a thing that matters.

And it's still moving as we speak

How they fail

is the interesting part

DeepSeek V4 Pro:

the old failure

Claude Opus 4.7:

thinks the test is dangerous

Claude is scared of

giving dangerous advice

Gemini 3.1 Pro:

overthinks

GPT-5.5:

thinks the test is stupid

GPT-5.5 has a point

Four models, four personalities

So what does this mean

for your skills files?

One: skills files don't have a compression problem anymore

Two: your prompts can be extremely detailed

Two thousand instructions

is a lot

Three: the wall became a trade-off

Evidence is not the same as proof

Remembering your instructions is not the same as following them

Silent failures are

the hardest to deal with

What did all this cost?

Evals will help

Newer findings

We were early;

the field is catching up

Your job has changed from compression to verification

Thank you!

How long can your skills be before your agent forgets what you told it?

How long can your skills be before your agent forgets what you told it?

Laurie Voss PRO

More from Laurie Voss