Good Bot, Bad Bot

Richard Lindsey     @Velveeta

What's the Goal?

Not just test coverage, but test coverage that provides the most value with the least amount of effort.

LLMs: The Good

  • They're really really fast.

  • IDE agents can take entire project context into account.

  • NLP is getting ridiculously good.

  • They can be taught!

LLMs: The Bad

  • They're really really dumb.

  • They can be overly thorough.

  • They do exactly what you tell them.

  • They can run you around in circles if you let them.

Initial Results

interface CheckInOutInformationProps {
  checkIn: ReactNode;
  checkOut: ReactNode;
}

it('accepts checkIn as a ReactNode', () => {});
it('accepts checkIn as a string', () => {});
it('accepts checkIn as a number', () => {});
it('accepts checkIn as a boolean', () => {});
it('accepts checkIn as null', () => {});
it('accepts checkIn as undefined', () => {});
it('accepts checkOut as a ReactNode', () => {});
it('accepts checkOut as a string', () => {});
it('accepts checkOut as a number', () => {});
it('accepts checkOut as a boolean', () => {});
it('accepts checkOut as null', () => {});
it('accepts checkOut as undefined', () => {});

Double Feedback Loop

  • Iterate on its instruction file.

  • Run the test output.

  • Find the best and worst examples.

  • Refine the instruction set with those examples.

Become a Mentor

  • You want a Junior Engineer, not a yes-man. AI agents are optimized to be agreeable by default.

  • Guardrails are how you teach an agreeable system to become a critical one.

Refine With Examples

#### Compound component pattern
- This repo often uses compound components where subcomponents are attached as properties:
  - RootComponent.ComponentA = ChildComponentA
- When testing these, describe blocks must use the attached path:
  - describe("RootComponent", () => {
      describe("RootComponent.ComponentA", () => { ... })
      describe("RootComponent.ComponentB", () => { ... })
    })

**Examples of FORBIDDEN patterns:**
```ts
// ❌ BAD: Structural CSS selectors - NEVER USE THESE
container.querySelectorAll("div > p")
container.querySelectorAll(".card .content > .item")
container.querySelector("div.foo span.bar")
```

#### Working with existing tests
- **ALWAYS check for existing test files** before generating new tests
- If a test file already exists:
  - Read and understand the existing test structure and patterns
  - Identify which functionality is already covered
  - Add ONLY the missing tests for new functionality
  - Maintain consistency with existing test style and naming
  - Do NOT duplicate existing tests

Be Conversational, It Works

> One other tweak to make, specifically to your instruction set, is that they're
  starting to get phrased very specifically to this particular component. E.g. you
  just added instructions that said "If both `checkIn` and `checkOut` accept `ReactNode`,
  test BOTH props with string and JSX values". This could, however, be any number
  of props that render ReactNode, so let's not lock down this list to specific
  examples, but rather, make it generic for any given prop. I don't want to risk
  you interpreting this in the future as these specific prop names needing these
  specific types of tests.
  
> We've introduced another brittle piece of testing infrastructure here. You're
  doing a querySelectorAll with a path of "div > p", and this is an internal detail
  that could potentially change and break this test. We don't care about the internal
  markup structure, but we do care that our amenities are listed as expected. Please
  update your libs/stays/CLAUDE.md and .claude/commands/test.md files to include
  verbiage that will prevent you from creating these kinds of brittle selector strings
  in the future, and determine the strongest and least likely to break method of
  asserting that these assertions can be resolved.

Summary

Invest in the instruction set, not the prompt. This should be considered a team-level resource to be updated with both good and bad examples as  additional application complexity yields shortcomings in test output.

Summary

Optimize with specific good and bad examples from actual test output, but whenever possible, abstract it to broader examples, from which it can infer for problems it hasn't yet  seen.

Summary

Don't try to automate test authoring, build an extensive set of rules that teach your agent how to exercise good test judgement. This is how you prevent a swath of useless tests, and start finding much higher value.

fn.

Richard Lindsey     @Velveeta

Good Bot, Bad Bot

By Richard Lindsey