@vipulgupta2048

Good LLM! Bad LLM?
Not sure about that LLM: A
practical guide to do AI Evals 😵‍💫

You

Still you, but now you understand AI Evals

In the next

✨ 20 minutes ✨

I am Vipul!

@vipulgupta2048

We-pull

Product Owner & Documentation Lead at balena
Solopreneur, "Right" the Docs @ Mixster
Community Lead, GitTogether Delhi
Pronouns: He/him/his

@vipulgupta2048

How do you go about
testing software?

@vipulgupta2048

Testing Software

Reproducible
Deterministic
Traceable

@vipulgupta2048

Testing LLMs

Non-reproducible*
Non-deterministic*
Non-traceable*

LLMs provide probabilistic, context-sensitive outputs

There is no testing a system like this. We do evaluations.

And, chances are,
You are doing
evals already...

@vipulgupta2048

@vipulgupta2048

LLM Evaluations

Given an input prompt, an output is generated. We evaluate this output against a set of ideal answers to assess the quality of the LLM system. This is an eval.
The process of validating and testing the outputs that your LLM applications are producing are LLM evals.
Assessing whether LLM application (not just models) provides value on tasks they were designed to solve.

@vipulgupta2048

Different to benchmarking

Public benchmarks test general capabilities like coding and mathematics through standardized assessments,
LLM evaluations focus on specific application performance, including prompts, application logic, and integrated components

Why should I care?

It's just testing the LLM, isn't it?
Isn't it... 👀

Why should I care?

Background Monitoring - passively, without interrupting the core workflow, evals detect drift or degradation.
Guardrails - Evals block harmful output, force a retry, or result in the fall back to a safer alternative.
Iterations - Evals can label data for fine-tuning LLMs, select high-quality few-shot examples for prompts, or identify failure cases that motivate architectural changes.

Types of Evaluations

Reference-based evals
- Compare system outputs against predetermined correct answers. Test questions with expected responses.
Reference free evals
- Assess output quality without predetermined answers or where multiple correct answers exist. Checking for safety, politeness, correctness etc.

Finally the Demo

Sarvam M

Inital Evals Work

Evaluating the model to be a regional tourist guide & area expert

Understanding Reasoning

Guardrails can be put in place to keep the conversation in the language until explicitly asked to be changed by the user

How do I become an evaluator?

@vipulgupta2048

1. alignmentforum.org > a-starter-guide-for-evals

2. Everything you need to know, https://www.news.aakashg.com/p/ai-evals

3. OpenAI Evals cookbook

4. Hamel Husain's Videos on Youtube

5. Building LLM Applications and writing evals

6. This is all what I have been doing.

Youtube: https://www.youtube.com/watch?v=LwLxlEwrtRA

Thank you all for listenning!

And, that's about it!

Questions? Collaborate? Work with us? Reach out!

@vipulgupta2048

Reviews cheesecakes, closes issues & runs Mixster to "right" the docs for startups

Feedback please + Link to the slides

Deep Dive into LLM Evals

By Vipul Gupta

Deep Dive into LLM Evals

To eval or not to eval - that's the question. Presented at FOSS United Delhi & GitTogether Delhi-NCR Anniversary Meetup July 2025

Vipul Gupta

Vipul Gupta is an engineer, documentarian, and community organizer. He works on eliminatin critical bottlenecks in hardware testing. He's a GitHub Star. He uns GitTogether Delhi and Mixster, an initiative to "right" technical docs for startups.

Good LLM! Bad LLM? Not sure about that LLM: A practical guide to do AI Evals 😵‍💫

I am Vipul!

How do you go about testing software?

Testing Software

Testing LLMs

And, chances are, You are doing evals already...

LLM Evaluations

Different to benchmarking

Why should I care?

Why should I care?

Types of Evaluations

Finally the Demo

Sarvam M

Inital Evals Work

And, that's about it!

Deep Dive into LLM Evals

More from Vipul Gupta

Good LLM! Bad LLM?
Not sure about that LLM: A
practical guide to do AI Evals 😵‍💫

How do you go about
testing software?

And, chances are,
You are doing
evals already...