Don't Let the Agent Grade Itself: Verification Gates for Autonomous Claude Code

There is a specific moment that should make you suspicious of any autonomous coding setup. The agent finishes, prints something confident ("All changes complete and tests passing") and you believe it, because the sentence is well-formed and the work looks done. Then you run the build yourself and it fails on the first file.

The gap between looks done and is done is the entire problem. And the instinct most people reach for to close it (a better prompt, a sterner system message, "make sure you actually verify before saying you're finished") is the wrong tool, because it asks the model to be the judge of its own work. A model that has just spent a session convincing itself the task is complete is the least reliable possible auditor of whether the task is complete. Self-assessment is not a safety mechanism. It is the failure mode wearing the costume of one.

So the principle this whole post rests on is simple, and I'd put it on a wall:

An autonomous agent is only as trustworthy as the gate it cannot move. The verification has to be external to the model and deterministic: a real command that exits zero only when the work is genuinely done, not the model's opinion of itself.

Everything else is a consequence of taking that seriously.

Why "external and deterministic" are both load-bearing words

External means the judgment lives outside the agent's context. The agent does not get to decide whether it passed. A test suite decides. A type checker decides. A build decides. A script you wrote decides. The agent's job is to make those things pass; it has no vote on whether they did.

Deterministic means the same artifact produces the same verdict every time, with no model in the loop at the decision point. The temptation in the agent era is to "verify" by asking a second LLM whether the first one did a good job. Sometimes that's useful as a signal, but it is not a gate; it's another opinion, with its own variance and its own failure modes. A gate is npm test. A gate is tsc --noEmit. A gate is a script that greps the output for the thing that must be true and returns non-zero if it isn't. The verdict has to be boring and repeatable, because the point of a gate is that it is not negotiable.

Once you accept that, a surprising amount of structure falls out of it automatically.

The pipeline

Here is a build pipeline I use for multi-step work in Claude Code's headless mode. It runs N steps unattended. Each step gets a fresh agent session, does exactly one unit of work, and is then judged by a real command. Pass, and the progress is recorded; fail, and the failure output is fed back into a retry. Nothing advances on the model's say-so.

The verification-gate loop: a real command decides; the model never grades itself.

#!/usr/bin/env bash
# Autonomous N-step build pipeline for Claude Code (headless mode).
#
# For each step it:
#   1. starts a FRESH `claude -p` session (cold start, no leaked context)
#   2. tells Claude to read the state file for prior context, then do ONLY this step
#   3. runs a DETERMINISTIC verification command (the gate)
#   4. on failure, feeds the failure back and retries
#   5. on pass, records progress to the state file (cross-session memory)
#   6. moves on — the next iteration is a brand-new session
#
# Resumable: re-running picks up after the last verified step.
 
set -uo pipefail
 
# ---------------- Config ----------------
PROJECT_DIR="$(pwd)"
STATE_FILE="$PROJECT_DIR/PIPELINE_STATE.md"     # the cross-session "memory"
LOG_DIR="$PROJECT_DIR/.pipeline-logs"
PROMPT_DIR="$PROJECT_DIR/steps"                 # steps/step-01.md ... step-12.md
TOTAL_STEPS=12
MAX_RETRIES=3
MODEL="sonnet"                                  # cheap default; bump specific steps if needed
ALLOWED_TOOLS="Read,Edit,Write,Bash,Grep,Glob"
BUDGET_USD=20.00                                # hard kill-switch for total spend
 
# Per-step verification commands. Index = step number.
# CRITICAL: each MUST exit 0 ONLY when the step is genuinely complete.
# This is a real command (tests/build/typecheck/script) — NOT the model judging itself.
declare -A VERIFY
VERIFY[1]="npm test -- --testPathPattern=step1"
VERIFY[2]="npm run build && npm test -- --testPathPattern=step2"
VERIFY[3]="echo 'TODO: define a real gate for step 3'; false"   # fail-closed on purpose
# ... fill in all 12 ...
 
# ---------------- Setup ----------------
mkdir -p "$LOG_DIR"
if [ ! -f "$STATE_FILE" ]; then
  printf '# Pipeline State\n\nProgress log Claude reads at the start of every step.\n\n' > "$STATE_FILE"
fi
 
spent=0
add_cost ()    { spent=$(awk "BEGIN{print $spent + ${1:-0}}"); }
over_budget () { awk "BEGIN{exit !($spent >= $BUDGET_USD)}"; }
 
run_step () {
  local n="$1" attempt="$2" extra="$3"
  local prompt_file out
  prompt_file="$PROMPT_DIR/step-$(printf '%02d' "$n").md"
  out="$LOG_DIR/step-${n}-attempt-${attempt}.json"
 
  claude -p "First read $STATE_FILE for context from completed steps.
Then complete ONLY the step described below. Do NOT start any later step.
When done, leave the working tree in a state where the verification command passes.
 
--- STEP $n ---
$(cat "$prompt_file")
$extra" \
    --model "$MODEL" \
    --allowedTools "$ALLOWED_TOOLS" \
    --permission-mode bypassPermissions \
    --max-turns 30 \
    --output-format json \
    > "$out" 2>&1
 
  # Accumulate cost. Verify the field name against your CLI version:
  # `claude -p "hi" --output-format json | jq` and check what the result object calls it.
  add_cost "$(jq -r '.total_cost_usd // 0' "$out" 2>/dev/null || echo 0)"
}
 
# ---------------- Main loop ----------------
# Resume after the last completed step recorded in state.
# NOTE: `grep -c` prints 0 AND exits non-zero on no match, so do NOT append `|| echo 0`
# (that yields "0\n0" and breaks the arithmetic below). Capture, then default.
completed=$(grep -c '^- \[x\] Step' "$STATE_FILE" 2>/dev/null)
completed=${completed:-0}
START=$(( completed + 1 ))
 
for (( n=START; n<=TOTAL_STEPS; n++ )); do
  echo "=== Step $n ==="
  passed=0
  extra=""
 
  for (( attempt=1; attempt<=MAX_RETRIES; attempt++ )); do
    if over_budget; then
      echo "Budget limit ($BUDGET_USD USD) reached at \$$spent. Stopping."
      exit 2
    fi
    echo "  attempt $attempt (fresh session)  [spent so far: \$$spent]"
    run_step "$n" "$attempt" "$extra"
 
    # ---- the gate: external, deterministic verification ----
    vlog="$LOG_DIR/verify-${n}-attempt-${attempt}.log"
    if eval "${VERIFY[$n]}" > "$vlog" 2>&1; then
      passed=1
      break
    fi
    echo "  verification FAILED — feeding output back into next attempt"
    extra="
NOTE: a previous attempt FAILED verification. Verification output:
$(tail -n 40 "$vlog")
Fix the root cause so the verification command passes."
  done
 
  if [ "$passed" -ne 1 ]; then
    echo "Step $n failed after $MAX_RETRIES attempts. Stopping for human review."
    echo "  See logs in $LOG_DIR/"
    exit 1
  fi
 
  # ---- update cross-session memory, then the next iteration is a fresh session ----
  printf -- '- [x] Step %s — verified %s (cumulative cost: $%s)\n' \
    "$n" "$(date -u +%FT%TZ)" "$spent" >> "$STATE_FILE"
  echo "  Step $n verified and recorded. Resetting session."
done
 
echo "All $TOTAL_STEPS steps complete and verified. Total cost: \$$spent"

It's maybe sixty lines of logic. The interesting part isn't the bash. It's that every design choice is downstream of the one principle.

Reading the structure as consequences, not features

The gate is a real command, indexed per step. VERIFY[2]="npm run build && npm test ...". This is the whole point made concrete: the model never reports success. A command does, or doesn't. If you find yourself writing a verify step that calls another model to "check the work," stop: you've put an opinion where a gate belongs.

Each step gets a fresh session. A new claude -p invocation per step, with no carried context. This is deliberate and it's the part people resist, because resuming a session feels more efficient. But a long-lived session accumulates its own narrative, including the growing conviction that things are going well, and that narrative is exactly the bias you don't want anywhere near a completion check. A cold session can't "remember" that it already believes it succeeded. It walks in, reads the durable record, does the work, and is judged. (Claude Code can resume sessions with --resume; here we deliberately don't.)

The state file is the only memory. Because sessions are disposable, continuity has to live somewhere durable and external. PIPELINE_STATE.md is that somewhere: a plain, append-only log the next session reads on cold start. The model's working memory is allowed to be ephemeral; the record of verified progress is not, and notably it only ever gets a checkmark after the gate passes. The memory contains facts, not hopes.

Retries feed back failure, not vibes. When a gate fails, the next attempt receives the actual tail of the verification log (the real compiler error, the real failing assertion) appended to its prompt. This is the "modify it until it works" loop, but grounded in the deterministic verdict rather than a vague "try again." The agent fixes a real, named failure.

It's resumable, and it fails closed. Re-running counts the verified steps in the state file and picks up after the last one. And two of the defaults lean toward safety on purpose: MAX_RETRIES caps thrashing so a step that can't be made to pass stops the line for a human instead of looping forever, and the placeholder gate for an undefined step is false: an unconfigured gate blocks, it doesn't wave the step through. Fail-closed is the right default for anything that runs without you watching.

The failure modes that actually matter

If I published this without the next section I'd be doing the opposite of what the post argues for. A pipeline that runs unattended has a real blast radius, and pretending otherwise would be exactly the looks done dishonesty I started with.

This runs with permissions bypassed. Sandbox it. The script uses --permission-mode bypassPermissions with Bash in the tool list, in a loop, with no human to approve anything. That is an agent executing arbitrary shell commands with the guardrails off. Run it in a disposable environment (a container or throwaway VM) scoped to the project directory, never against your main working tree or anything holding credentials. If you can't put it in a sandbox, you're not ready to run it unattended.

Bypassing permissions also weakens your allowlist. This is subtle and worth internalizing: ALLOWED_TOOLS looks like a containment boundary, but in a full permission-bypass mode the allowlist generally stops being a hard limit; the prompts it would have gated are exactly the ones being skipped. If you want the tool restriction to actually constrain blast radius, prefer a stricter permission mode (so the allowlist and --disallowedTools are enforced) and reserve full bypass for a genuinely sandboxed run. Check the exact semantics against your installed version rather than trusting the variable name to mean what it looks like it means.

The budget kill-switch is a floor, not a ceiling. over_budget is checked before each attempt, so a single expensive step can overshoot BUDGET_USD before the next check catches it. It will stop the pipeline soon after you cross the line; it will not stop you from crossing it mid-step. Treat it as a coarse circuit breaker, set it well below the number that would actually hurt, and add --max-turns (as above) so no single step can run away on its own.

Your own glue code needs the same skepticism as the agent. The resume logic in this script originally read completed=$(grep -c '...' "$STATE_FILE" || echo 0). That has a quietly nasty bug: grep -c prints 0 and exits non-zero when there are no matches, so the || echo 0 also fires, the variable becomes the two-line string "0\n0", and the arithmetic on the next line breaks. I fixed it above by capturing the count and defaulting separately. I'm leaving the story in because it's the most honest illustration of the whole thesis: that is precisely the kind of silent error a human skimming green-looking output would sail right past, which is the entire reason you want an external gate deciding what "done" means, not a person's pattern-matching and not the model's.

When this is worth it, and when it isn't

This machinery earns its keep when the work is genuinely multi-step, each step has a checkable definition of done, and the run is long enough that babysitting it is the bottleneck: scaffolding a service across a dozen verifiable stages, large mechanical migrations, anything you'd otherwise queue overnight.

It is overkill when the task is a single edit, when "done" is subjective enough that no honest deterministic gate exists (visual polish, prose, judgment calls), or when the blast radius of an automated mistake is larger than the time the automation saves. If you can't write the gate, that's not a reason to let the model self-certify. It's a sign the task wants a human in the loop, and you should keep one there.

The unlock isn't autonomy for its own sake. It's that once a real gate decides what counts as done, you can let the agent attempt far more ambitious work without extending it any trust it hasn't earned one verified step at a time. The gate is what makes the autonomy safe to grant.

Don't Let the Agent Grade Itself: Verification Gates for Autonomous Claude Code

Why "external and deterministic" are both load-bearing words

The pipeline

Reading the structure as consequences, not features

The failure modes that actually matter

When this is worth it, and when it isn't

Loop Engineering, Defined

Seeing Isn't Measuring: Fixing the Design-to-Code Plateau

Claude Tag: Anthropic Puts an Agent in the Channel

Why "external and deterministic" are both load-bearing words

The pipeline

Reading the structure as consequences, not features

The failure modes that actually matter

When this is worth it, and when it isn't

Related reading

Loop Engineering, Defined

Seeing Isn't Measuring: Fixing the Design-to-Code Plateau

Claude Tag: Anthropic Puts an Agent in the Channel