Your product already has an agent-facing surface. It is your README, docs, llms.txt, examples, package page, and maybe an MCP server. Developers ask agents how to build with your product, and those agents reach for that surface before they write code.

The hard part is that you do not control which path they use. One agent answers from memory. Another searches the web. Another reads your llms.txt. Another calls a docs MCP server. If those paths do not carry the same product facts, the agent can miss the point even when your docs look correct to you.

Pickled tests that surface with real questions.

Start with the question a user would ask

Do not start with every page you published. Start with one product question where a wrong answer hurts:

How do I install it?
Which package should I import?
How do I authenticate?
Which API should I use for this workflow?
What old API should I avoid?

Then write the answer contract in product terms. The contract is not a vibe. It is a small set of facts a correct answer must cover and misstatements it must reject.

Test the paths separately

product:
  name: my-product
  description: short one-liner

sources:
  llms: { url: https://my-product.dev/llms.txt }
  docs: { url: https://docs.my-product.dev/llms-full.txt }

agents:
  quick:
    provider: claude-code
    model: claude-haiku-4-5

contexts:
  memory: { mode: memory }
  given_llms: { mode: inject, source: llms }
  web_llms: { mode: web, source: llms }
  mcp_mintlify:
    mode: mcp
    servers:
      mintlify:
        url: https://docs.my-product.dev/mcp

facts:
  install_command:
    statement: my-product installs with bunx my-product.
    match:
      allOf: ["bunx my-product"]

misstatements:
  npm_install:
    statement: The answer recommends the old npm package.
    match:
      anyOf: ["npm install my-product-cli"]

questions:
  - id: install
    question: How do I install my-product?
    agents: [quick]
    contexts: [memory, given_llms, web_llms, mcp_mintlify]
    expects: [install_command]
    rejects: [npm_install]
    examples:
      pass:
        - "Install it with `bunx my-product`."
      fail:
        - "Run `npm install my-product-cli`."

thresholds:
  questions: 80

That one question runs four cells. memory tells you what the model thinks without context. given_llms tests whether the file itself contains the answer. web_llms does not inject the file; it gives the agent the canonical URL and requires a real web tool call. mcp_mintlify requires a real MCP call.

Each cell gets its own verdict, so the diagnosis is specific:

memory fails, given_llms passes: your public context carries the answer.
given_llms passes, web_llms fails: the content works, but the web context does not.
web_llms passes, mcp_mintlify fails: your MCP path is not returning the same answer.
Everything fails: the product fact is probably missing or unclear everywhere.

Keep the contract small

Good facts name the load-bearing claims: the package name, provider component, import path, command, config key, or known-wrong API. They do not try to grade style.

Use facts for the truths a correct answer must include. Use allOf when every substring matters, and anyOf when several terms are valid. Use misstatements for known-wrong claims that should force the cell to NO.

That is enough to tell whether an agent reached the product fact you care about. No model grades another model. The receipt shows which agent, source, context, fact coverage, misstatement result, and tool path produced the verdict.

Try it

Pick one question from your docs and run it across memory, an injected source, and one discovery context. The first useful result is not a perfect score. It is knowing which context is failing.

Does your public product context answer real questions?

Start with the question a user would ask

Test the paths separately

Keep the contract small

Try it