Vipul Gupta
Vipul is a software engineer at balena and a documentarian running his docs initiative called Mixster. He advocates strongly for open-source, cheesecakes and party parrots.
@vipulgupta2048
You
Still you, but now you understand AI Evals
In the next
✨ 20 minutes ✨
@vipulgupta2048
We-pull
@vipulgupta2048
@vipulgupta2048
@vipulgupta2048
Non-traceable*
LLMs provide probabilistic, context-sensitive outputs
There is no testing a system like this. We do evaluations.
@vipulgupta2048
@vipulgupta2048
Given an input prompt, an output is generated. We evaluate this output against a set of ideal answers to assess the quality of the LLM system. This is an eval.
The process of validating and testing the outputs that your LLM applications are producing are LLM evals.
Assessing whether LLM application (not just models) provides value on tasks they were designed to solve.
@vipulgupta2048
Public benchmarks test general capabilities like coding and mathematics through standardized assessments,
LLM evaluations focus on specific application performance, including prompts, application logic, and integrated components
It's just testing the LLM, isn't it?
Isn't it... 👀
Background Monitoring - passively, without interrupting the core workflow, evals detect drift or degradation.
Guardrails - Evals block harmful output, force a retry, or result in the fall back to a safer alternative.
Iterations - Evals can label data for fine-tuning LLMs, select high-quality few-shot examples for prompts, or identify failure cases that motivate architectural changes.
Compare system outputs against predetermined correct answers. Test questions with expected responses.
Evaluating the model to be a regional tourist guide & area expert
Understanding Reasoning
Guardrails can be put in place to keep the conversation in the language until explicitly asked to be changed by the user
How do I become an evaluator?
@vipulgupta2048
1. alignmentforum.org > a-starter-guide-for-evals
2. Everything you need to know, https://www.news.aakashg.com/p/ai-evals
4. Hamel Husain's Videos on Youtube
5. Building LLM Applications and writing evals
6. This is all what I have been doing.
Thank you all for listenning!
Questions? Collaborate? Work with us? Reach out!
@vipulgupta2048
Reviews cheesecakes, closes issues & runs Mixster to "right" the docs for startups
Feedback please + Link to the slides
By Vipul Gupta
To eval or not to eval - that's the question. Presented at FOSS United Delhi & GitTogether Delhi-NCR Anniversary Meetup July 2025
Vipul is a software engineer at balena and a documentarian running his docs initiative called Mixster. He advocates strongly for open-source, cheesecakes and party parrots.