Questions tell you whether an agent can say the right thing. Builds tell you whether it can do the work.

Use a question for product facts: install commands, import paths, auth rules, deprecated APIs. Use a build when the real question is whether the agent can edit a project and pass your verifier.

builds:
  - id: add-toolbar
    goal: Add a toolbar using my-product.
    agents: [builder]
    contexts: [given_docs]
    trials: 3
    workspace:
      path: ./fixtures/react-app
      setup: [bun install --frozen-lockfile]
    verifier:
      failToPass:
        - { name: toolbar behavior, run: bun test tests/toolbar.test.ts }
      passToPass:
        - { run: bun run typecheck }

Each (agent × context) cell runs trials times in a fresh workspace. The result is a rate: Built 3/3, Partially built 2/3, or Did not build 0/3. No model judges the code. Your verifier does.

Keep build fixtures small:

one behavior,
one workspace,
one fast verifier,
enough trials to see variance.

One caveat matters when you compare contexts. In build mode, the workspace may contain files the agent can read. If the task needs a local llms.txt, then memory means "not injected into the prompt," not "unavailable."

Run build tasks deliberately. A practical CI shape is: pickled test and pickled check --plan on every PR, capped pickled check on trusted PRs, and capped pickled build on a schedule or release branch.

When you run builds in CI, save the receipt and render it:

pickled build . --verify-only
pickled build . --max-cells 6 --output pickled-builds.json
pickled report pickled-builds.json --format markdown >> "$GITHUB_STEP_SUMMARY"

--verify-only proves the fixture and reference patch before the paid run. The markdown summary shows the build verdict, changed files, failed verifier commands, and reference-solution proof without printing diffs or command output. Upload the JSON receipt as the artifact; add --verbose only for forensic detail.

Use build tasks when answers are not enough