โ† Blog

June 1, 2026 ยท Caio Pizzol

Stop letting one model grade another

The default way to score an agent's answer is to ask another model whether it was good. LLM-as-judge. It is fast to set up and it feels reasonable: the judge reads the answer, the judge gives a number.

For testing whether an agent actually understands your product, it is the wrong tool. Not because judge models are useless, but because the thing you are measuring is not an opinion.

A judge brings its own priors

Ask "does this agent know how my product works" and grade it with a model, and you have added a second guesser to the loop. Unless you ground the judge in your sources too, it scores from the same training-data priors the agent under test is drawing on, and even grounded it stays an interpretive path: a model deciding whether another model's answer was good enough. When judge and agent lean wrong in the same direction, the judge happily approves a confident, incorrect answer. You have measured agreement between two models, not correctness against your product.

It is also not reproducible. Run the judge twice and the number moves. Change the judge model and it moves again. You cannot gate a release on a score that drifts, and you cannot debug a regression you cannot reproduce.

Legibility is a contract, not a vibe

Whether an agent understands your product is a grounded, factual property. Did it use the context mode you asked it to use? Did it avoid a known-wrong answer? Did it say the thing that has to be in a correct answer? These are checkable by code.

That is the whole bet. Three deterministic signals, no model in the scoring path:

product:
  name: my-product
  description: short one-liner

sources:
  readme: ./README.md

agents:
  quick:
    provider: claude-code
    model: claude-haiku-4-5

contexts:
  given_readme: { mode: inject, source: readme }

facts:
  install_command:
    statement: my-product installs with bunx my-product.
    match:
      allOf: ["bunx my-product"]

misstatements:
  npm_install:
    statement: The answer recommends the old npm package.
    match:
      anyOf: ["npm install my-product-cli"]

questions:
  - id: install
    question: How do I install my-product?
    agents: [quick]
    contexts: [given_readme]
    expects: [install_command]
    rejects: [npm_install]
    examples:
      pass:
        - "Install it with `bunx my-product`."
      fail:
        - "Run `npm install my-product-cli`."

facts check the answer: a declared product truth must be covered. misstatements check for known-wrong text that must not appear. Tool-use provenance checks that a web or mcp context actually invoked its tool. None of these is a model. Given the same response, the score does not move. If a rerun comes back different, it is because the agent changed its answer, not because a judge reinterpreted the same one. The score is the score.

What you give up, and why it is fine

A deterministic contract cannot grade open-ended quality. It will not tell you whether an answer was elegant, or well-toned, or pleasantly phrased. That is real, and it is the honest cost.

It is also beside the point here. Testing product legibility is not a writing competition. The question is narrow and checkable: did the agent use the right context mode, say the right content, and avoid the known-wrong answer. A contract answers that exactly. A judge answers it approximately, expensively, and differently each time.

There are tasks where a judge model earns its place: summarization quality, creative range, nuanced tone. Product legibility is not one of them. For "does the agent get my product right," you want a receipt, not a second opinion.

The payoff is trust

A deterministic score is reproducible, so it can gate CI. It avoids an extra judge call, so the cost is the agent run itself. And when it fails, it points at a specific contract that broke, not a number a model felt like emitting. That is what makes it something you can build a process on.

Try it

Write one question with facts and misstatements. Score the same response twice and nothing moves; the only thing that can change a rerun is the agent's answer, not the scoring.