@vipulgupta2048

You

Still you, but now you understand AI Evals

In the next

✨ 30 minutes ✨

I am Vipul!

@vipulgupta2048

We-pull

  • Product Owner & Documentation Lead at balena
  • Solopreneur, "Right" the Docs @ Mixster
  • Organizer, GitTogether Delhi
  • Works remotely from Noida, India (burning up atm)
  • Pronouns: He/him/his

@vipulgupta2048

How do you go about
testing software?

@vipulgupta2048

Testing Software

  • Reproducible
  • Deterministic
  • Traceable

@vipulgupta2048

Testing LLMs

  • Non-reproducible*
  • Non-deterministic* 
  • Non-traceable*

LLMs provide probabilistic, context-sensitive outputs

And, hence we go beyond testing. We do evaluations.

You are already performing
Evals ... but badly.

@vipulgupta2048

@vipulgupta2048

LLM Evaluations

  • The process of validating and testing the outputs that your LLM applications are producing.

  • Assess whether LLM application (not just models) provides value on tasks they were designed to solve.

  • Given an input prompt, an output is generated. We evaluate this output against a set of ideal answers to assess the quality of the LLM system. This is an eval.

@vipulgupta2048

Different to benchmarking

  • Public benchmarks test general capabilities like coding and mathematics through standardized assessments, 

     

  • LLM evaluations focus on specific application performance, including prompts, application logic, and integrated components

     

Why should I care?

It's just testing the LLM, isn't it?
Isn't it... 👀

Why should I care?

  • Background Monitoring - passively, without interrupting the core workflow, evals detect drift or degradation.

  • Guardrails - Evals block harmful output, force a retry, or result in the fall back to a safer alternative.

  • To Improve a pipeline - these evals can label data for fine-tuning LLMs, select high-quality few-shot examples for prompts, or identify failure cases that motivate architectural changes.

Types of Evaluations

  • Reference-based evals
    • Compare system outputs against predetermined correct answers. Test questions with expected responses.
       

  • Reference free evals
    • Assess output quality without predetermined answers or where multiple correct answers exist. Checking for safety, politeness, correctness etc.

Finally the Demo

How do I become an evaluator?

@vipulgupta2048

1. alignmentforum.org > a-starter-guide-for-evals

2. Everything you need to know, https://www.news.aakashg.com/p/ai-evals 

3. OpenAI Evals cookbook

4. Hamel Husain's Videos on Youtube

5. Building LLM Applications and writing evals

6. This is all what I have been doing.

 

Thank you all for listenning!

And, that's about it!

Questions? Collaborate? Work with us? Reach out!

@vipulgupta2048

Reviews cheesecakes, closes issues & runs Mixster to "right" the docs for startups

Feedback please + Link to the slides

[Foss United Delhi] Getting Started to LLM Evals

By Vipul Gupta

[Foss United Delhi] Getting Started to LLM Evals

To eval or not to eval - that's the question.

  • 95