Questions tell you whether an agent can say the right thing. Builds tell you whether it can do the work.
Use a question for product facts: install commands, import paths, auth rules, deprecated APIs. Use a build when the real question is whether the agent can edit a project and pass your verifier.
builds:
- id: add-toolbar
goal: Add a toolbar using my-product.
agents: [builder]
contexts: [given_docs]
trials: 3
workspace:
path: ./fixtures/react-app
setup: [bun install --frozen-lockfile]
verifier:
failToPass:
- { name: toolbar behavior, run: bun test tests/toolbar.test.ts }
passToPass:
- { run: bun run typecheck }
Each (agent ร context) cell runs trials times in a fresh workspace. The result is a rate: Built 3/3, Partially built 2/3, or Did not build 0/3. No model judges the code. Your verifier does.
Keep build fixtures small:
- one behavior,
- one workspace,
- one fast verifier,
- enough trials to see variance.
One caveat matters when you compare contexts. In build mode, the workspace may contain files the agent can read. If the task needs a local llms.txt, then memory means "not injected into the prompt," not "unavailable."
Run build tasks deliberately. A practical CI shape is: pickled test and pickled check --plan on every PR, capped pickled check on trusted PRs, and capped pickled build on a schedule or release branch.
When you run builds in CI, save the receipt and render it:
pickled build . --verify-only
pickled build . --max-cells 6 --output pickled-builds.json
pickled report pickled-builds.json --format markdown >> "$GITHUB_STEP_SUMMARY"
--verify-only proves the fixture and reference patch before the paid run. The markdown summary shows the build verdict, changed files, failed verifier commands, and reference-solution proof without printing diffs or command output. Upload the JSON receipt as the artifact; add --verbose only for forensic detail.