Vipul Gupta
Vipul is a software engineer at balena and a documentarian running his docs initiative called Mixster. He advocates strongly for open-source, cheesecakes and party parrots.
@vipulgupta2048
You
Still you, but now you understand AI Evals
In the next
✨ 30 minutes ✨
@vipulgupta2048
We-pull
@vipulgupta2048
@vipulgupta2048
@vipulgupta2048
Non-traceable*
LLMs provide probabilistic, context-sensitive outputs
And, hence we go beyond testing. We do evaluations.
@vipulgupta2048
@vipulgupta2048
The process of validating and testing the outputs that your LLM applications are producing.
Assess whether LLM application (not just models) provides value on tasks they were designed to solve.
Given an input prompt, an output is generated. We evaluate this output against a set of ideal answers to assess the quality of the LLM system. This is an eval.
@vipulgupta2048
Public benchmarks test general capabilities like coding and mathematics through standardized assessments,
LLM evaluations focus on specific application performance, including prompts, application logic, and integrated components
It's just testing the LLM, isn't it?
Isn't it... 👀
Background Monitoring - passively, without interrupting the core workflow, evals detect drift or degradation.
Guardrails - Evals block harmful output, force a retry, or result in the fall back to a safer alternative.
To Improve a pipeline - these evals can label data for fine-tuning LLMs, select high-quality few-shot examples for prompts, or identify failure cases that motivate architectural changes.
Compare system outputs against predetermined correct answers. Test questions with expected responses.
How do I become an evaluator?
@vipulgupta2048
1. alignmentforum.org > a-starter-guide-for-evals
2. Everything you need to know, https://www.news.aakashg.com/p/ai-evals
4. Hamel Husain's Videos on Youtube
5. Building LLM Applications and writing evals
6. This is all what I have been doing.
Thank you all for listenning!
Questions? Collaborate? Work with us? Reach out!
@vipulgupta2048
Reviews cheesecakes, closes issues & runs Mixster to "right" the docs for startups
Feedback please + Link to the slides
By Vipul Gupta
To eval or not to eval - that's the question.
Vipul is a software engineer at balena and a documentarian running his docs initiative called Mixster. He advocates strongly for open-source, cheesecakes and party parrots.