@vipulgupta2048

You

Still you, but now you understand AI Evals

In the next

✨ 20 minutes ✨

I am Vipul!

@vipulgupta2048

We-pull

  • Product Owner & Documentation Lead at balena
  • Solopreneur, "Right" the Docs @ Mixster
  • Community Lead, GitTogether Delhi
  • Pronouns: He/him/his

@vipulgupta2048

How do you go about
testing software?

@vipulgupta2048

Testing Software

  • Reproducible
  • Deterministic
  • Traceable

@vipulgupta2048

Testing LLMs

  • Non-reproducible*
  • Non-deterministic* 
  • Non-traceable*


 

LLMs provide probabilistic, context-sensitive outputs

There is no testing a system like this. We do evaluations.

And, chances are,
You are doing
evals already...

@vipulgupta2048

@vipulgupta2048

LLM Evaluations

  • Given an input prompt, an output is generated. We evaluate this output against a set of ideal answers to assess the quality of the LLM system. This is an eval.

  • The process of validating and testing the outputs that your LLM applications are producing are LLM evals.

  • Assessing whether LLM application (not just models) provides value on tasks they were designed to solve.

@vipulgupta2048

Different to benchmarking

  • Public benchmarks test general capabilities like coding and mathematics through standardized assessments, 

     

  • LLM evaluations focus on specific application performance, including prompts, application logic, and integrated components

     

Why should I care?

It's just testing the LLM, isn't it?
Isn't it... 👀

Why should I care?

  • Background Monitoring - passively, without interrupting the core workflow, evals detect drift or degradation.

  • Guardrails - Evals block harmful output, force a retry, or result in the fall back to a safer alternative.

  • Iterations - Evals can label data for fine-tuning LLMs, select high-quality few-shot examples for prompts, or identify failure cases that motivate architectural changes.

Types of Evaluations

  • Reference-based evals
    • Compare system outputs against predetermined correct answers. Test questions with expected responses.
       

  • Reference free evals
    • Assess output quality without predetermined answers or where multiple correct answers exist. Checking for safety, politeness, correctness etc.

Finally the Demo

Sarvam M

Inital Evals Work

Evaluating the model to be a regional tourist guide & area expert

Understanding Reasoning

Guardrails can be put in place to keep the conversation in the language until explicitly asked to be changed by the user

How do I become an evaluator?

@vipulgupta2048

1. alignmentforum.org > a-starter-guide-for-evals

2. Everything you need to know, https://www.news.aakashg.com/p/ai-evals 

3. OpenAI Evals cookbook

4. Hamel Husain's Videos on Youtube

5. Building LLM Applications and writing evals

6. This is all what I have been doing.

 

Thank you all for listenning!

And, that's about it!

Questions? Collaborate? Work with us? Reach out!

@vipulgupta2048

Reviews cheesecakes, closes issues & runs Mixster to "right" the docs for startups

Feedback please + Link to the slides

Deep Dive into LLM Evals

By Vipul Gupta

Deep Dive into LLM Evals

To eval or not to eval - that's the question. Presented at FOSS United Delhi & GitTogether Delhi-NCR Anniversary Meetup July 2025

  • 206