Implement pending tasks using Codex or Gemini, committing after each task. Use when ready to execute planned work...
Implement pending tasks one-by-one, committing after each completion.
STOP. Before dispatching ANY Agent/Codex/Gemini call, verify you are sending it EXACTLY ONE task. PostToolUse hooks do not fire inside subagents — batching tasks into one Agent call makes pidash show stale progress for the entire duration.
The loop runs in YOUR session (the main session), not inside a subagent:
for each pending task:
1. TaskUpdate(in_progress) → sync state file
2. Agent A writes tests (from requirements only)
3. test quality gate (main session)
4. Agent C tries to break tests (adversarial validation)
5. commit tests
6. Agent B implements against failing tests
7. verify THIS task's tests pass (retry Agent B if needed)
8. commit implementation
9. TaskUpdate(completed) → sync state file
after all tasks complete:
10. run full verification suite ONCE (see step 7)
Per-task verification runs only the tests Agent A wrote in step 2.7, not the full project suite. The full suite (workspace tests, smoke, integration, lint) runs once at the end. This is deliberate: per-task full-suite runs compound to 40+ minutes of redundant test time across a 20-task phase.
If you find yourself writing an Agent prompt that mentions multiple tasks, STOP — you are about to violate this rule.
See Subagent Dispatch Budget below — every Agent dispatch must satisfy it.
Every prompt passed to the Agent tool (Agent A test author, Agent B implementor, Agent C adversary, or code reviewer) must be ≤ 50 000 bytes.
PostToolUse hooks do not fire inside subagents (see "CRITICAL: One Task at a Time" above), so the runtime context cap from PRD 00024 cannot abort a subagent that grows past 200K. The bound must be enforced at dispatch time, before the Agent call.
Procedure before every Agent dispatch:
len(prompt.encode("utf-8")) (Python) or printf '%s' "$prompt" | wc -c (shell)./run-autopilot Phase 0 of the next session replans the PRD in place
(PRD stays in wip/; see Phase 0 step 1's replan procedure — parallel
to context_overrun):state.task_aborts[]:{"task_id": "<id>", "turn": -1, "total_input_tokens": <prompt-bytes/4>, "cause": "subagent_prompt_overrun"}
state.stall_reason:{"stalled": "subagent_prompt_overrun", "task": "<id>", "prompt_bytes": <prompt-bytes>}
$_AUTOPILOT_LOOP is set (per /run-autopilot "Loop
Detection" — manual sessions have no shell wrapper to restart on
SIGINT), write task_aborted to the autopilot signal file. Use
walk-up discipline to find the autopilot dir from cwd, then write
to <autopilot_dir>/signal. Skip the signal write when
$_AUTOPILOT_LOOP is unset; the next manual /run-autopilot
invocation will resume via state.stall_reason.outcome: "aborted", cause: "subagent_prompt_overrun",
model from task.metadata.model,
review_cycle: null (Phase 3) or current state.cycle (rework).subagent_prompt_overrun and stop work on this task.Abort and report if you read more than 100K of total input. Return the partial result and an abort_reason: context_overrun field.
Rationale: soft enforcement — the subagent honors the instruction — but /plan-tasks's 150K per-task budget bounds how much context /work can plausibly hand off anyway. Combined, the 50K dispatch cap, the 100K subagent-internal cap, and the 150K per-task cap keep subagent contexts well under Sonnet 4.6's 200K standard-tier ceiling.
Before any Agent call for a task, read task.metadata.model (or equivalently state.tasks[i].model — /run-autopilot keeps the two in sync) and pass it as the Agent tool's model parameter.
Applies to every Agent call this skill dispatches, including follow-up dispatches inside compound steps. The list below is illustrative, not exhaustive — when the prose says "every Agent call", it means every one:
If you add a new Agent call to this skill, pass model from task.metadata.model — no exceptions.
Accepted values: "haiku", "sonnet", "opus".
Legacy plans (created before PRD 00025) have no metadata.model. Omit the model parameter — subagents inherit the session model. This preserves the pre-PRD-00025 behavior bit-for-bit.
The Subagent Dispatch Budget (50K bytes, 100K subagent-internal cap) applies regardless of tier. Haiku doesn't earn a smaller cap; opus doesn't earn a larger one.
At every task exit — success in step 6, abort in step 4 (timeout / context exceeded / error after debug) or via the Subagent Dispatch Budget overrun path — append one entry to state.tasks[i].attempts[]:
{
"attempt": <len(existing attempts) + 1>,
"model": "<tier>",
"outcome": "completed" | "aborted",
"review_cycle": <int | null>,
"cause": "<string | null>"
}
Field semantics:
attempt: 1-indexed; len(existing) + 1.model: the tier /work used for this pass. Read from task.metadata.model when set. On legacy plans where metadata.model is absent, record the effective session-inherited tier as a string (e.g. "sonnet" for a Work-phase Sonnet 4.6 launch). Always a string — never null.outcome: /work writes only "completed" or "aborted". Later phases upgrade earlier entries — /run-autopilot Phase 6 (Rework) step 2 rewrites a "completed" entry's outcome to "review_flagged" at the start of escalation, then later rewrites a flagged entry's outcome to "rework_failed" when the rework pass also fails at the top of the chain.review_cycle: null on a first/Phase-3 attempt; set to the current state.cycle integer when the pass is a rework re-dispatch (see step 1.5 "Rework-mode task filter (PRD 00025)").cause: null on success; on abort, the reason — "context_overrun", "subagent_prompt_overrun", "timeout", "error".Write procedure: read state.json, append to tasks[i].attempts[] (create the array if absent), write back atomically. Merge — do not replace siblings. Walk up from the resolved physical cwd to find the autopilot dir, same pattern as the cap-marker reset in step 2.
Cross-reference: references/state-schema.md tasks[].attempts row defines the canonical shape.
Choose the right tool based on task domain:
| Domain | Tool | Rationale |
|---|---|---|
| Backend, APIs, business logic | Codex | Strong at algorithms, data flow, system design |
| Frontend, UI, visual design | Gemini | Better aesthetic judgment, visual coherence |
| Mixed (e.g., full-stack feature) | Split task or use both sequentially |
Use use-gemini skill when the task involves:
For visual tasks, Gemini can challenge existing specs:
Example prompt addition for visual tasks:
Before implementing, critically review this design spec.
Suggest improvements to colors, spacing, typography, or layout.
Challenge anything that feels generic or could be more distinctive.
Use use-codex skill when the task involves:
After EVERY TaskUpdate call, sync dev/local/prd-cycle.json:
TaskList to get all current task statesdev/local/prd-cycle.jsontasks array: [{"id": "<id>", "name": "title", "status": "pending|in_progress|completed"}, ...]tasks_completed and tasks_totalThis is not optional — the user watches this file in real time via pidash.
Use TaskList tool to see all tasks. Filter for:
pendingblockedBy)Update dev/local/prd-cycle.json with the full task list (see Dashboard State File above).
Read state.rework_task_ids from dev/local/autopilot/state.json (walk up from cwd to find the autopilot dir, same pattern as the cap-marker reset in step 2). Two modes:
rework_task_ids |
Mode | Iteration source |
|---|---|---|
absent or [] |
default (full-plan) | The pending-and-unblocked subset from step 1's TaskList filter, in TaskList order. This is the Phase 3 first-pass behavior. |
| non-empty array | rework mode | The listed task IDs read directly from state.rework_task_ids, in array order — bypass step 1's status filter entirely. Each ID is fetched via TaskGet regardless of current status (pending after Phase 6's reset, or completed if Phase 6's reset hasn't fired yet). Tasks NOT in the list are skipped entirely — no Agent A/B/C dispatch, no commits. |
In rework mode, each task's status is set to in_progress at start via TaskUpdate (overwriting whatever the prior status was — pending after Phase 6's reset, or completed on a defensive re-entry) and to completed at end — same lifecycle as a default-mode pass, so the dashboard reflects rework progress.
In rework mode, the Attempt logging entry (see "Attempt logging" above) sets review_cycle to the current state.cycle value (not null), model to the escalated tier read from task.metadata.model (set by /run-autopilot Phase 6), and outcome to "completed" or "aborted" as normal.
/work does NOT modify rework_task_ids itself. Clearing is /run-autopilot Phase 6's responsibility, after this /work invocation returns. If /work aborts mid-rework (context overrun, Subagent Dispatch Budget overrun, unrecoverable error), rework_task_ids survives in state — this is correct recovery behavior: the next /run-autopilot session resumes with the same rework batch and re-attempts the listed tasks at their already-escalated tier. Phase 6's clear runs only on the successful /work return.
Cross-reference: references/state-schema.md rework_task_ids row; run-autopilot/SKILL.md Phase 6 (rework) tier-escalation rule.
For the first available task:
TaskUpdate to set status: in_progress and claim ownershipstate.json differs from the id stored in the marker file (added cycle-5+1), but the explicit Bash clear here is a belt-and-braces backstop in case state.json's task-id snapshot lags the actual task switch. Walk up from the resolved physical cwd to find the autopilot dir (same pattern the hook uses, including Path.resolve()-style symlink resolution and a root-inclusive check — agents may cd into a subdirectory or through a symlink during the task), then remove the marker inside it:d=$(pwd -P); while :; do [ -d "$d/dev/local/autopilot" ] && { rm -f "$d/dev/local/autopilot/.cap-fired"; break; }; [ "$d" = "/" ] && break; d=$(dirname "$d"); done
No-op when no ancestor has the dir (non-autopilot runs) or the marker is already absent (first task of the phase).TaskGet to read full task descriptionBefore dispatching to Codex/Gemini, load relevant context into the prompt:
dev/local/prds/wip/1M context makes this practical — richer prompts produce better first-pass results.
Ambiguity check (Think Before Coding): Re-read the task description. If scope, data shape, target surface, or success criteria are unclear, stop and ask the user rather than picking silently. See references/code-quality-principles.md §1 and references/code-quality-examples.md §1 for what counts as a hidden assumption worth surfacing.
Dispatch a separate agent to write tests from requirements only. This agent must NOT receive implementation hints or architecture deep-dives - only what a user of the API would know.
Agent A runs as: Claude Code subagent (Agent tool), not Codex/Gemini. It's a focused task that benefits from direct file access for reading test patterns.
Skip for: test-only, docs-only, or config-only tasks.
Agent A receives:
Agent A does NOT receive:
See references/test-author-prompt.md for the full prompt template — it now embeds Simplicity/Think-Before-Coding/Surgical rules to prevent Agent A from writing speculative tests or silently assuming input shape.
Agent A prompts must satisfy the Subagent Dispatch Budget (see section above the Workflow): ≤ 50K bytes, abort-instruction line prepended.
Before committing Agent A's tests, review them in the main session against this checklist:
If any check fails, dispatch Agent A again with specific feedback about what's weak. Max 2 quality gate retries.
Total Agent A budget: max 5 dispatches across the entire test authoring phase (quality gate + adversarial rounds combined). If exhausted, flag weakness in task output and proceed. Don't block the pipeline forever.
Dispatch Agent C to try to write a wrong implementation that passes all of Agent A's tests. Agent C's goal is to exploit weak tests.
Agent C runs as: Claude Code subagent (Agent tool). It needs file write access and the project's test runner to actually execute its wrong implementation against the tests.
Agent C receives:
Agent C receives nothing else. No task description, no acceptance criteria, no architecture docs.
Agent C's job: Write an implementation that is clearly wrong (hardcoded values, ignored edge cases, shortcut if/else chains), run the tests against it, and report which tests it broke through.
Outcomes:
| Agent C result | Action |
|---|---|
| Cannot break tests (tests catch all exploits) | Tests are strong. Proceed to 2.9. |
| Breaks tests with wrong impl that passes | Send Agent C's exploit back to Agent A: "These tests can be passed by: {wrong impl}. Strengthen them." Then re-run Agent C against strengthened tests. Max 2 A/C rounds. |
| 2 A/C rounds exhausted | Flag weakness in task output, proceed anyway. |
See references/adversarial-test-prompt.md for the full prompt template.
Agent C prompts must satisfy the Subagent Dispatch Budget: ≤ 50K bytes, abort-instruction line prepended.
git add <test_files>
git commit -m "test(<scope>): add tests for <feature>"
Tests are committed separately before implementation. This makes the TDD boundary auditable in git history.
Agent B's job: make the failing tests pass. Tests ARE the spec.
Agent B receives:
Agent B does NOT receive:
Prompt must include:
references/code-quality-principles.md (copy the "Prompt Snippet" section verbatim). These counter the anti-patterns LLMs produce by default: speculative abstractions, drive-by refactoring, style drift, silent assumptions. Concrete before/after examples are in references/code-quality-examples.md if the agent needs them.subagent_prompt_overrun.If the task description is ambiguous (multiple interpretations, unclear scope, unstated format/fields/location), stop before dispatching Agent B and surface the ambiguity to the user. See Example 1 in references/code-quality-examples.md. Do not dispatch with guessed-at requirements.
Determine task domain (see Tool Selection above), then:
For Codex tasks:
gpt-5.2-codex (default) or user preferenceworkspace-write for code changesreferences/codex-integration.md (TDD implementation mode)For Gemini tasks:
--allow-all-tools for code changes-p for non-interactivereferences/gemini-integration.md (TDD implementation mode)| Result | Action |
|---|---|
| Success | Continue to step 5 |
| Timeout | Append attempt-log entry per the "Attempt logging" section (outcome: "aborted", cause: "timeout"). Split task (see below), mark original as blocked. |
| Context exceeded | Append attempt-log entry per the "Attempt logging" section (outcome: "aborted", cause: "context_overrun"). Split task, mark original as blocked. |
| Error | Invoke systematic-debugging if available (see below). On unrecoverable error, append attempt-log entry per the "Attempt logging" section (outcome: "aborted", cause: "error") and report to user. |
If the tool returned an error and superpowers:systematic-debugging is in the available skills list, invoke it to diagnose the root cause before reporting to the user. If debugging resolves the issue, continue to step 5. If not, report to user and keep task in_progress.
Stage changed files, then commit in a separate Bash call:
git add -A
git commit -m "<type>(<scope>): <description>"
Never chain these with && in a single Bash call.
Commit message rules:
Run only the specific tests Agent A wrote in step 2.7. Do NOT run the full project test suite, smoke tests, integration tests, or lint here — those run once at the end of the phase (step 7).
cargo test -p <crate> --test <test_file> or cargo test -p <crate> <module::test_name>pytest path/to/test_file.py::test_namevitest run path/to/test_file or jest path/to/test_filereferences/code-quality-principles.md, plus an explicit SURGICAL instruction: "Fix only what the failing test output points to. Do not refactor passing code, adjust unrelated files, or change style."superpowers:verification-before-completion is available, invoke it for additional verification beyond tests — but keep its scope to this task's files, not the full workspace.Do not run here: cargo test --workspace, cargo clippy --workspace, ./tests/smoke.sh, ./tests/integration.sh, cargo test-full, or any equivalent full-suite command. These are batched into step 7.
If superpowers:requesting-code-review is in the available skills list, dispatch a code review after commit and verification:
BASE_SHA = commit before this task, HEAD_SHA = HEAD after commitsuperpowers:receiving-code-review is available, invoke it to evaluate feedback before acting - verify suggestions technically, push back if wrong. Then fix confirmed issues (dispatch Agent B with the code-quality rules block from references/code-quality-principles.md plus: "Apply ONLY the specific fixes listed below. Do not refactor surrounding code or address unrelated issues you notice."), re-commit, re-verify (step 5.5), re-review. Max 3 review cycles, then proceed with warning.Skip for documentation-only or configuration-only tasks.
TaskUpdate to set status: completedstate.tasks[i].attempts[] per the "Attempt logging" section: outcome: "completed", model from task.metadata.model, cause: null, review_cycle: null on a Phase-3 first pass or the current state.cycle on a rework pass.After all tasks in the phase are marked completed, run the project's full verification suite once. This is the single point where the full suite runs — per-task verification (step 5.5) only ran the new tests in isolation, so this step is mandatory and must not be skipped.
What to run (project-dependent — use the commands documented in AGENTS.md / CLAUDE.md / project README):
cargo test --workspace, pytest, npm test)cargo clippy --workspace, ruff check, eslint .)./tests/smoke.sh)./tests/integration.sh, cargo test -p <crate>-e2e)Run each as a separate Bash call. Do not chain with &&.
Handling failures at this step:
TaskUpdate(status: in_progress) and sync state file.references/code-quality-principles.md and add: "Fix only the regression identified below. Do not touch unrelated files or refactor adjacent code." Do NOT relax the failing test.Max 3 fix cycles at this step before escalating to the user — regressions clustering here usually indicate a design issue that needs human input.
Only stop the work phase once step 7 is fully green.
Note: With 1M context, context-exceeded failures are rare. Split primarily for timeout or task complexity, not context limits.
When a tool can't complete a task (timeout/complexity), split it:
TaskCreate for each subtaskTaskUpdate.addBlockedBy if sequential| Original scope | Split into |
|---|---|
| Multiple files | One task per file |
| Multiple features | One task per feature |
| Large refactor | Extract → transform → cleanup |
| Full-stack feature | Backend task (Codex) → Frontend task (Gemini) |
If superpowers:dispatching-parallel-agents is in the available skills list and the current batch contains 2+ tasks that:
blockedBy dependencies on each other[C{n}] or [D{n}] (rework tasks, not original plan tasks)Then dispatch them in parallel using the dispatching-parallel-agents pattern.
Never parallelize original plan tasks - the one-at-a-time rule remains for all non-rework tasks due to pidash sync requirements.
references/test-author-prompt.md - Test author (Agent A) prompt templatereferences/adversarial-test-prompt.md - Adversarial validator (Agent C) prompt templatereferences/codex-integration.md - Codex prompt templates and patternsreferences/gemini-integration.md - Gemini prompt templates and patternsreferences/code-quality-principles.md - Think/Simplicity/Surgical/Goal-driven rules to inject into Agent B promptsreferences/code-quality-examples.md - Before/after examples of the anti-patterns those rules prevent