From Vibes to Production

Evaluating and shipping agents that work

What are we talking about?

What is an eval, and why you need them
Setting up tracing with Arize AX
Building an AI agent with the Claude Agent SDK

From data to evals

Looking at your data: error analysis
Code evals and built-in LLM evals
Writing custom eval rubrics
Meta-evaluation: testing your tests

From evals to experiments

and beyond

Datasets
Experiments
The improvement cycle

Get the notebook

Get set up

Anthropic API key: console.anthropic.com
Arize AX account: arize.com (start a free trial)
- Arize Space ID — in your AX workspace settings
- Arize API key — generate one in Settings

What is Arize AX?

Arize's AI observability and evaluation platform
Captures traces, runs evals, monitors production
Hosted for you — no infrastructure to manage

What is an eval?

Traces are logs, evals are tests

Traces = logs, for AI
Evals = tests, for AI

Spans: the building blocks

The vibes problem

What you can't do without evals

Can't detect regressions when you change a prompt
Can't compare prompt versions objectively
Can't know if a new model is actually better
Can't run quality gates in CI

You can't switch models

without evals

New models drop every few months
Without evals, switching = weeks of manual testing
With evals, you know within hours

This is not theoretical

Descript, Bolt, Claude Code — all followed the same arc

Two types of evals

Code evals — deterministic, free, fast
LLM-as-a-judge — semantic, flexible, powerful

LLM-as-a-judge evals

A second LLM grades outputs against a rubric
Handles meaning, not just strings
Non-deterministic — needs calibration

LLM judges: tradeoffs

When to use which

Code evals → format, structure, constraints
LLM judge → accuracy, relevance, tone, faithfulness
Human review → novel failures, calibrating judges

Why agents make this harder

Single LLM call: input → output. Done.
Agent: input → tool call → result → reasoning → another tool call → output
Errors cascade. Each step can go wrong.

Multi-agent complexity

Handoffs between agents add another layer
Triage routing, specialist handling
Each layer = new ways things can go wrong

Cascading failures

Bad retrieval → bad reasoning → confidently wrong output
The user sees a polished response and trusts it
This is worse than an obvious failure

Creatively correct vs. wrong

Sometimes the agent finds a better solution
Your eval says "fail" — but the agent was right
Evals need to distinguish creative from wrong

Another way to categorize evals

Capability evals: can it do this new thing?
Regression evals: can it do the stuff it used to do?

What an eval result looks like

Code eval:  score: 1 · label: "valid"
LLM judge:  score: 0 · label: "incorrect"
            explanation: "The response fails to include..."

What a real explanation looks like

label: "incorrect"
explanation: "The response fails to include a budget
breakdown, which is a core requirement. The agent
provides destination info and local recommendations
but omits all cost estimates, making the plan
incomplete for a user who asked specifically
about budget travel to Tokyo."

Explanations make evals actionable

Concrete failure → you know what to fix
Same explanation across 50 traces = systematic problem
Evals become a debugging tool, not just a scoreboard

The full loop

Setting up tracing

Open the notebook

Install dependencies

%pip install claude-agent-sdk
  openinference-instrumentation-claude-agent-sdk
  arize arize-otel arize-phoenix anthropic

The Claude Agent SDK

Anthropic's framework for building agents
Tool use, web search, conversation context
Auto-instrumented by OpenInference

Set your keys

os.environ["ANTHROPIC_API_KEY"] = "sk-ant-XXXX"
os.environ["ARIZE_API_KEY"]     = "YYYY"
os.environ["ARIZE_SPACE_ID"]    = "ZZZZ"

Register the tracer

from arize.otel import register, Endpoint
tracer_provider = register(
    space_id=..., api_key=...,
    project_name="aie-financial-demo",
    endpoint=Endpoint.ARIZE,
    batch=False,
)
ClaudeAgentSDKInstrumentor().instrument(
    tracer_provider=tracer_provider)

Why this works so smoothly

Set up the AX client

from arize import ArizeClient
arize_client = ArizeClient(api_key=...)
SPACE_ID = os.environ["ARIZE_SPACE_ID"]
PROJECT_NAME = "aie-financial-demo"

Even easier: the arize-skills plugin

npx skills add Arize-ai/arize-skills

Works with Claude Code, Cursor, Codex, and many more
Skills handle instrumentation, evals, datasets, experiments

Build the agent

A financial analysis chatbot

The agent setup

options = ClaudeAgentOptions(
    model="claude-haiku-4-5-20251001",
    allowed_tools=["WebSearch"],
    permission_mode="acceptEdits",
)

The two-turn pattern

RESEARCH_PROMPT = "Research {tickers}. Focus on: {focus}.
    Use web search to find current financial data."
WRITE_PROMPT = "Now write a concise financial report
    based on your research above."

Wrapping it in a span

with tracer.start_as_current_span("financial_report", ...):

Run it!

result = await financial_report(
    "TSLA",
    "financial performance and growth outlook"
)
print(result)

Non-deterministic by design

Look at the report

Open the trace in AX

Click into a span

This is observability

Generate test data

Here's one I made earlier

Test queries

test_queries = [
    {"tickers": "AAPL", "focus": "revenue growth"},
    {"tickers": "NVDA", "focus": "AI chip demand"},
    {"tickers": "AAPL, MSFT", "focus": "comparative analysis"},
    {"tickers": "RIVN", "focus": "financial health"},
    {"tickers": "KO", "focus": "dividend yield"},
    ...  # 12 in total
]

Covering the edge cases

Traces are loaded

Start with data, not metrics

Read your traces

before you write evals

You need requirements first

You can't say "it doesn't work" if you haven't defined what "works" looks like
Write down explicit success criteria

Defining success is cross-functional work

Where to get test data

Before production: synthetic data (LLM-generated queries)
After production: real user queries from traces
Diversity is critical — vary phrasing, intent, complexity

Don't forget the edge cases

Examine the traces

When the output is

suspiciously short

When the data looks right but isn't

The "confidently wrong" problem

Open coding and axial coding

Open coding: read data, name what you see, no preconceptions
Axial coding: group those names into bigger themes
This is qualitative research, not engineering

Categorize by root cause

"The response was wrong" — not actionable. Ask *why*.
Retrieval failure → better search
Reasoning error → better prompts
Hallucination → grounding checks
Scope violation → explicit boundaries

Frequency times severity

The Swiss Cheese model

Evaluations

The simplest useful eval

Get your spans from AX

spans_df = arize_client.spans.export_to_df(
    space_id=SPACE_ID,
    project_name=PROJECT_NAME,
    start_time=..., end_time=...,
)
parent_spans = spans_df[spans_df["parent_id"].isna()]

Ticker check eval

@create_evaluator(name="mentions_ticker", kind="code")
def mentions_ticker(input, output):
    tickers = re.findall(r"\b([A-Z]{1,5})\b", input)
    ...
    if not missing:
        return {"label": "pass", "score": 1}
    return {"label": "fail", "score": 0,
        "explanation": f"Missing: {', '.join(missing)}"}

Running an online eval

Running the ticker check

with suppress_tracing():
    results = evaluate_dataframe(
        dataframe=parent_spans,
        evaluators=[mentions_ticker])

Log the results back to AX

log_eval_to_ax(results, eval_name="mentions_ticker")

Why this matters

Code evals aren't just toy examples

Did the output parse as JSON?
Is the response under 500 tokens?
Does it avoid forbidden phrases?

Grade the outcome, not the path

Don't check that the agent followed specific steps
Agents find valid approaches you didn't anticipate
Check the outcome, not the trajectory

Built-in LLM evals

What code can't check

Three components

1. A judge model (the LLM that grades)
2. A prompt template (the rubric)
3. Data (the examples being evaluated)

AX ships built-in evals

Correctness, Faithfulness, Conciseness
Tool Selection, Tool Invocation
Document Relevance, Refusal
No prompt engineering required

Set up the judge

from phoenix.evals.llm import LLM
from phoenix.evals.metrics import CorrectnessEvaluator
llm = LLM(provider="anthropic", model="claude-sonnet-4-6")
correctness_eval = CorrectnessEvaluator(llm=llm)

Run the evaluation

with suppress_tracing():
    correctness_results = evaluate_dataframe(
        dataframe=parent_spans,
        evaluators=[correctness_eval])

Every score is zero

Faithfulness — a better built-in

Giving the judge context

Correctness: "Is this factually accurate?" (no context)
Faithfulness: "Does this stick to the source material?" (with context)
The difference: faithfulness gets the research the agent found

How faithfulness works

FaithfulnessEvaluator needs three columns:
- input: the user's query
- output: the agent's response
- context: the source material to check against

Run faithfulness

faithfulness_eval = FaithfulnessEvaluator(llm=llm)
with suppress_tracing():
    faith_results = evaluate_dataframe(
        dataframe=spans_with_context,
        evaluators=[faithfulness_eval])

Faithfulness results

Two built-in evals,

two different signals

Correctness: 0/13 — eval doesn't fit the use case
Faithfulness: 13/13 — confirms the reports are grounded
Choosing the right eval matters more than tuning it

Built-in evals are your starting point

Custom eval rubrics

The structure of a good rubric

The AX docs recommend four parts:
1. Define the judge's role
2. Explicit pass / fail criteria
3. Label the data with XML tags
4. Define the output choices outside the prompt
+ Labeled examples — our own addition

Part 1: Define the role

"You are an expert financial analyst evaluator.
Your task is to judge whether a financial report
provides actionable investment guidance,
not just raw data."

Part 2: Explicit criteria

ACTIONABLE — The report:
- Contains specific recommendations (buy/sell/hold)
- Identifies concrete risks with supporting data
- Includes forward-looking analysis, not just history
- Provides context for *why* recommendations are made
NOT ACTIONABLE — The report:
- Only summarizes data without interpretation
- Lacks specific recommendations or next steps
- Presents risks without supporting evidence
- Contains only backward-looking analysis

Criteria come from error analysis

Part 3: Label the data with XML tags

<user_query>
{input}
</user_query>
<financial_report>
{output}
</financial_report>

Part 4: Add examples

An actionable example

"Based on NVDA's 122% YoY revenue growth driven by

data center demand, strong forward P/E of 35x relative

to sector median of 22x, and expanding margins, NVDA

presents a compelling growth position. Key risk:

concentration in AI training chips (~70% of revenue).

Recommendation: accumulate on pullbacks below $800."

A not-actionable example

"NVDA is a major player in the semiconductor industry.
The company has seen significant growth in recent years
driven by AI demand. NVDA's stock has performed well.
Investors should consider various factors when making
investment decisions."

Part 5: Keep the choices

out of the prompt

Don't end the prompt with "answer ACTIONABLE or NOT"
Define the choices in the evaluator config instead
choices={"actionable": 1.0, "not actionable": 0.0}

Chain-of-thought for judges

Wire it up

actionability_evaluator = ClassificationEvaluator(
    name="actionability",
    llm=llm,
    prompt_template=actionability_template,
    choices={"actionable": 1.0, "not actionable": 0.0},
)

Online LLM as a judge

Look at the results

Read the explanations

Eval anti-patterns

Treat eval prompts like code

Version them. Test them against known answers.
Small wording changes shift results.
An unvalidated eval is a fancy way of being wrong at scale.

The God Evaluator anti-pattern

Don't build one eval that checks everything
One evaluator per dimension

One evaluator per dimension

Guardrails vs. north-star metrics

Guardrails — ship-blockers
North-stars — aspirational targets
Know which is which

Can you trust your judges?

Meta-evaluation

Your judge is a classifier

It makes predictions: pass or fail
Predictions can be compared against ground truth
Your job: check the judge's homework

Human judgement is a lot of work

Building your golden dataset

Pull the labels back

into the notebook

spans_df = arize_client.spans.export_to_df(...)
ANNOTATION_COL = "annotation.human_actionable.label"
labeled_subset = parent_spans[
    parent_spans[ANNOTATION_COL].notna()]

Write unambiguous tasks

If 0% pass rate consistently → broken task, not broken agent
Each task needs a reference solution
Test when a behavior SHOULD occur AND when it shouldn't

Dev/test splits for your labels

Run the judge

on the same examples

with suppress_tracing():
    judge_results = evaluate_dataframe(
        dataframe=labeled_subset,
        evaluators=[actionability_evaluator])

Where they agree and disagree

Fixing the rubric

Disagreement → read the explanation → find the ambiguity → tighten
"Forward-looking analysis" → "Forward-looking analysis WITH specific recommendations"

Precision and recall

Precision: when the judge says "fail," is it right?
Recall: of all real fails, how many does it catch?
Prioritize recall — catching defects matters more

Prioritize recall

Judge pitfalls

Position bias — judges favor the first or last option
Length bias — longer responses score higher
Confidence bias — fooled by confidently wrong answers
Self-preference — same model rates its own output higher

Mitigating self-preference bias

The benchmark is human performance

Human inter-rater reliability: often 0.2–0.3 (Cohen's Kappa)
If your judge is more consistent than humans, that's a win

Failures should seem fair

When a task fails, is it clear what the agent got wrong?
If scores don't climb, is the eval at fault?
Reading transcripts is how you verify

Self-improving systems

Datasets and experiments

The problem with one-off fixes

Save failures as a dataset

Filter to failing traces in AX
Click "Save as Dataset"
Name it "aie-financial-demo-fails"

Save passing traces too

Failures dataset → are we catching the bad?
Passing dataset → did the good stuff stay good?

Your datasets evolve over time

Pre-production: synthetic test cases
Early production: a mix
Mature: mostly real production traces, labeled
Failure set + pass set = your golden dataset

Improve the agent — let Claude do it

Feed Claude: current prompts + judge explanations + requirements
Claude finds the themes and rewrites both prompts
One call with the anthropic SDK — the same package the judge uses

Every change

is grounded in a finding

Wire up the improved agent

async def improved_financial_report(tickers, focus):
    ... uses IMPROVED_RESEARCH_PROMPT / IMPROVED_WRITE_PROMPT

Run an experiment

experiment, experiment_df = arize_client.experiments.run(
    name="improved-prompts-v1",
    dataset="aie-financial-demo-fails",
    space=SPACE_ID,
    task=improved_agent_task,
    evaluators=[actionability_eval],
)

The task abstraction

What experiments show you

Same inputs, same evaluators, different agent version
The only variable is your change
Side-by-side comparison, example by example

Compare the results

The eval-iterate cycle

Find failures → Read explanations → Fix → Run experiment → Repeat

How many samples do you need?

Workshop experiments: 12–20 examples for directional signal
Shipping decisions: 200–400 samples
Halving the margin of error takes 4x the samples

The impact hierarchy

1. Data quality fixes (highest impact)
2. Prompting improvements
3. Model selection
4. Hyperparameter tuning (lowest impact)

Eval-driven development

Write the eval first, then build the feature
Like test-driven development, but for AI
The eval defines what "done" means

Who can write evals?

Product managers, customer success, even salespeople
They know what good looks like better than engineers do

Into production

Where AX goes beyond the notebook

Online evals

Run your evals automatically on incoming production traces
The same evaluators you wrote today
Span, trace, or session scope

Sample, don't grade everything

10% sampling is a good default
Cheaper, and statistically representative
AX handles the sampling for you

Alyx Eval Builder

Describe the eval in plain English
Alyx generates the rubric template
You review and tweak before shipping

The full cycle in production

Application → online evals → eval labels → monitors → alert → improve → repeat

Today's failure is tomorrow's regression test

One more thing: feeding evals

to a coding agent

Export failing traces + explanations from AX
Hand them to Claude Code or Cursor as context
"Find the patterns. Propose fixes."
Then verify with an experiment

How that works

Keep the loop honest

Feed it your requirements, not just the failures
Find themes, not one-off failures
Always verify with an experiment before shipping

The SDLC closing on itself

What we built today

Instrument → trace → read data → eval → validate → iterate → ship → monitor

Start small

Evals are infrastructure

Treat evals as a core part of your system, not an afterthought
The value compounds — but only if you keep investing

Go try it

arize.com — start a free trial
arize.com/docs/ax — the docs
npx skills add Arize-ai/arize-skills

Thank you!

@seldo.com on BlueSky

Get a free year! Upgrade with code

ARIZEAIE2026