Your code has a gate. A test breaks, the build goes red, the PR does not merge. That gate is why you trust your codebase to change without rotting.
The surface your users' agents read has no such gate. Your README, llms.txt, docs, examples, and MCP server change every week, and nothing checks whether an agent still answers correctly afterward. The first signal that one drifted is a user watching an agent confidently get your product wrong.
That is a gap you can close with the CI you already run.
Make agent understanding a check, not a hope
The move is the same one you made for code: run the check on every change, and fail the build when it regresses. With a deterministic eval (facts, misstatements, tool-use provenance, no model judging another), the scoring path is deterministic, so a failure points to a contract that broke, not a judge that reinterpreted the answer. A flaky score makes a useless gate. A contract makes a real one.
Two tiers keep it fast and cheap:
name: pickled
on:
pull_request:
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: oven-sh/setup-bun@v2
- run: bun install --frozen-lockfile
- run: bunx @pickled-dev/cli test .
- run: bunx @pickled-dev/cli check . --plan
- name: Run pickled
id: pickled
run: |
set +e
bunx @pickled-dev/cli check . --max-cells 20 --output pickled-report.json
echo "exit_code=$?" >> "$GITHUB_OUTPUT"
exit 0
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- run: |
if [ -f pickled-report.json ]; then
bunx @pickled-dev/cli report pickled-report.json --format markdown >> "$GITHUB_STEP_SUMMARY"
fi
if: always()
- uses: actions/upload-artifact@v4
if: always()
with:
name: pickled-report
path: pickled-report.json
- run: exit "${{ steps.pickled.outputs.exit_code }}"
pickled test runs no model calls. It scores your declared example answers and catches brittle fact or misstatement contracts before you spend a single token. pickled check --plan prints the cells that would run. Only then does the capped check run your questions against a real agent. Set thresholds.questions in pickled.yml and the run passes or fails on the overall score.
The --output file is the receipt. pickled report renders it into the GitHub job summary without rerunning the agent. Default JSON is safe for CI artifacts; use --verbose only when you need full forensic detail.
One caveat if your repo is public: GitHub does not pass secrets to workflows triggered from fork pull requests, so a check job that needs ANTHROPIC_API_KEY will not run on those. Keep the token-free test and plan steps on every PR, and run the full check on push to branches you control or on trusted internal PRs.
That is the whole gate. A PR that weakens your docs or breaks an answer your llms.txt is supposed to give now goes red, in the PR, before it ships.
What changes when it is a gate
Stale prompt surface is product debt, and debt you cannot see is the dangerous kind. A CI gate turns an invisible, slow rot into a loud, immediate failure attached to the exact change that caused it. You stop finding out from users.
It also changes how the team treats those files. An llms.txt or docs page with a test is no longer a document someone updates when they remember. It is a contract with a consequence. People keep it current because the build makes them.
And because examples are free and the check run is bounded by --max-cells or --sample, you can afford to run it on every pull request, not as an occasional manual sweep. The point of a gate is that nothing gets past it.
Try it
Add the workflow above, set thresholds.questions, and open a PR that intentionally breaks one answer. Watch it go red.