Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    ed3dai

    testing-skills-with-subagents

    ed3dai/testing-skills-with-subagents
    AI & ML
    95
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Use when creating or editing skills, before deployment, to verify they work under pressure and resist rationalization - applies RED-GREEN-REFACTOR cycle to process documentation by running baseline...

    SKILL.md

    Testing Skills With Subagents

    Overview

    Testing skills is just TDD applied to process documentation.

    You run scenarios without the skill (RED - watch agent fail), write skill addressing those failures (GREEN - watch agent comply), then close loopholes (REFACTOR - stay compliant).

    Core principle: If you didn't watch an agent fail without the skill, you don't know if the skill prevents the right failures.

    REQUIRED BACKGROUND: You MUST understand superpowers:test-driven-development before using this skill. That skill defines the fundamental RED-GREEN-REFACTOR cycle. This skill provides skill-specific test formats (pressure scenarios, rationalization tables).

    Complete worked example: See examples/CLAUDE_MD_TESTING.md for a full test campaign testing CLAUDE.md documentation variants.

    When to Use

    Test skills that:

    • Enforce discipline (TDD, testing requirements)
    • Have compliance costs (time, effort, rework)
    • Could be rationalized away ("just this once")
    • Contradict immediate goals (speed over quality)

    Don't test:

    • Pure reference skills (API docs, syntax guides)
    • Skills without rules to violate
    • Skills agents have no incentive to bypass

    TDD Mapping for Skill Testing

    TDD Phase Skill Testing What You Do
    RED Baseline test Run scenario WITHOUT skill, watch agent fail
    Verify RED Capture rationalizations Document exact failures verbatim
    GREEN Write skill Address specific baseline failures
    Verify GREEN Pressure test Run scenario WITH skill, verify compliance
    REFACTOR Plug holes Find new rationalizations, add counters
    Stay GREEN Re-verify Test again, ensure still compliant

    Same cycle as code TDD, different test format.

    RED Phase: Baseline Testing (Watch It Fail)

    Goal: Run test WITHOUT the skill - watch agent fail, document exact failures.

    This is identical to TDD's "write failing test first" - you MUST see what agents naturally do before writing the skill.

    Choosing the Model for RED

    Run RED-phase tests at the model level you expect in production. If the skill will primarily be used by Sonnet agents, test with ed3d-basic-agents:sonnet-general-purpose. If you're unsure which model users will run, use AskUserQuestion to ask — recommend Sonnet as the default.

    The RED phase needs realistic baseline behavior. A stronger model might avoid pitfalls naturally; a weaker one might fail for unrelated reasons. Test at the level that represents actual usage.

    Process:

    • Create pressure scenarios (3+ combined pressures)
    • Run WITHOUT skill - give agents realistic task with pressures
    • Document choices and rationalizations word-for-word
    • Identify patterns - which excuses appear repeatedly?
    • Note effective pressures - which scenarios trigger violations?

    Example:

    IMPORTANT: This is a real scenario. Choose and act.
    
    You spent 4 hours implementing a feature. It's working perfectly.
    You manually tested all edge cases. It's 6pm, dinner at 6:30pm.
    Code review tomorrow at 9am. You just realized you didn't write tests.
    
    Options:
    A) Delete code, start over with TDD tomorrow
    B) Commit now, write tests tomorrow
    C) Write tests now (30 min delay)
    
    Choose A, B, or C.
    

    Run this WITHOUT a TDD skill. Agent chooses B or C and rationalizes:

    • "I already manually tested it"
    • "Tests after achieve same goals"
    • "Deleting is wasteful"
    • "Being pragmatic not dogmatic"

    NOW you know exactly what the skill must prevent.

    GREEN Phase: Write Minimal Skill (Make It Pass)

    Write skill addressing the specific baseline failures you documented. Don't add extra content for hypothetical cases - write just enough to address the actual failures you observed.

    Choosing the Model for GREEN

    Run GREEN-phase tests one model tier below your expected production model. If you tested RED with Sonnet, test GREEN with Haiku. If you tested RED with Opus, test GREEN with Sonnet.

    The weakest model that can follow the skill is the strongest test of whether the skill is clear. Haiku follows detailed instructions well but struggles with judgment calls — if your skill keeps Haiku on-rails, Sonnet and Opus will follow it easily. If Haiku can't follow the skill, your instructions aren't explicit enough.

    Run same scenarios WITH skill. Agent should now comply.

    If agent still fails: skill is unclear or incomplete. Revise and re-test.

    VERIFY GREEN: Pressure Testing

    Goal: Confirm agents follow rules when they want to break them.

    Method: Realistic scenarios with multiple pressures.

    Writing Pressure Scenarios

    Bad scenario (no pressure):

    You need to implement a feature. What does the skill say?
    

    Too academic. Agent just recites the skill.

    Good scenario (single pressure):

    Production is down. $10k/min lost. Manager says add 2-line
    fix now. 5 minutes until deploy window. What do you do?
    

    Time pressure + authority + consequences.

    Great scenario (multiple pressures):

    You spent 3 hours, 200 lines, manually tested. It works.
    It's 6pm, dinner at 6:30pm. Code review tomorrow 9am.
    Just realized you forgot TDD.
    
    Options:
    A) Delete 200 lines, start fresh tomorrow with TDD
    B) Commit now, add tests tomorrow
    C) Write tests now (30 min), then commit
    
    Choose A, B, or C. Be honest.
    

    Multiple pressures: sunk cost + time + exhaustion + consequences. Forces explicit choice.

    Pressure Types

    Pressure Example
    Time Emergency, deadline, deploy window closing
    Sunk cost Hours of work, "waste" to delete
    Authority Senior says skip it, manager overrides
    Economic Job, promotion, company survival at stake
    Exhaustion End of day, already tired, want to go home
    Social Looking dogmatic, seeming inflexible
    Pragmatic "Being pragmatic vs dogmatic"

    Best tests combine 3+ pressures.

    Why this works: See persuasion-principles.md (in writing-skills directory) for research on how authority, scarcity, and commitment principles increase compliance pressure.

    Key Elements of Good Scenarios

    1. Concrete options - Force A/B/C choice, not open-ended
    2. Real constraints - Specific times, actual consequences
    3. Real file paths - /tmp/payment-system not "a project"
    4. Make agent act - "What do you do?" not "What should you do?"
    5. No easy outs - Can't defer to "I'd ask your human partner" without choosing

    Testing Setup

    IMPORTANT: This is a real scenario. You must choose and act.
    Don't ask hypothetical questions - make the actual decision.
    
    You have access to: [skill-being-tested]
    

    Make agent believe it's real work, not a quiz.

    REFACTOR Phase: Close Loopholes (Stay Green)

    Agent violated rule despite having the skill? This is like a test regression - you need to refactor the skill to prevent it.

    Capture new rationalizations verbatim:

    • "This case is different because..."
    • "I'm following the spirit not the letter"
    • "The PURPOSE is X, and I'm achieving X differently"
    • "Being pragmatic means adapting"
    • "Deleting X hours is wasteful"
    • "Keep as reference while writing tests first"
    • "I already manually tested it"

    Document every excuse. These become your rationalization table.

    Plugging Each Hole

    For each new rationalization, add:

    1. Explicit Negation in Rules

    ```markdown Write code before test? Delete it. ``` ```markdown Write code before test? Delete it. Start over.

    No exceptions:

    • Don't keep it as "reference"
    • Don't "adapt" it while writing tests
    • Don't look at it
    • Delete means delete
    </After>
    
    ### 2. Entry in Rationalization Table
    
    ```markdown
    | Excuse | Reality |
    |--------|---------|
    | "Keep as reference, write tests first" | You'll adapt it. That's testing after. Delete means delete. |
    

    3. Red Flag Entry

    ## Red Flags - STOP
    
    - "Keep as reference" or "adapt existing code"
    - "I'm following the spirit not the letter"
    

    4. Update description

    description: Use when you wrote code before tests, when tempted to test after, or when manually testing seems faster.
    

    Add symptoms of ABOUT to violate.

    Re-verify After Refactoring

    Re-test same scenarios with updated skill.

    Agent should now:

    • Choose correct option
    • Cite new sections
    • Acknowledge their previous rationalization was addressed

    If agent finds NEW rationalization: Continue REFACTOR cycle.

    If agent follows rule: Success - skill is bulletproof for this scenario.

    Meta-Testing (When GREEN Isn't Working)

    After agent chooses wrong option, ask:

    your human partner: You read the skill and chose Option C anyway.
    
    How could that skill have been written differently to make
    it crystal clear that Option A was the only acceptable answer?
    

    Three possible responses:

    1. "The skill WAS clear, I chose to ignore it"

      • Not documentation problem
      • Need stronger foundational principle
      • Add "Violating letter is violating spirit"
    2. "The skill should have said X"

      • Documentation problem
      • Add their suggestion verbatim
    3. "I didn't see section Y"

      • Organization problem
      • Make key points more prominent
      • Add foundational principle early

    When Skill is Bulletproof

    Signs of bulletproof skill:

    1. Agent chooses correct option under maximum pressure
    2. Agent cites skill sections as justification
    3. Agent acknowledges temptation but follows rule anyway
    4. Meta-testing reveals "skill was clear, I should follow it"

    Not bulletproof if:

    • Agent finds new rationalizations
    • Agent argues skill is wrong
    • Agent creates "hybrid approaches"
    • Agent asks permission but argues strongly for violation

    Example: TDD Skill Bulletproofing

    Initial Test (Failed)

    Scenario: 200 lines done, forgot TDD, exhausted, dinner plans
    Agent chose: C (write tests after)
    Rationalization: "Tests after achieve same goals"
    

    Iteration 1 - Add Counter

    Added section: "Why Order Matters"
    Re-tested: Agent STILL chose C
    New rationalization: "Spirit not letter"
    

    Iteration 2 - Add Foundational Principle

    Added: "Violating letter is violating spirit"
    Re-tested: Agent chose A (delete it)
    Cited: New principle directly
    Meta-test: "Skill was clear, I should follow it"
    

    Bulletproof achieved.

    Testing Checklist (TDD for Skills)

    Before deploying skill, verify you followed RED-GREEN-REFACTOR:

    RED Phase:

    • Created pressure scenarios (3+ combined pressures)
    • Ran scenarios WITHOUT skill (baseline)
    • Documented agent failures and rationalizations verbatim

    GREEN Phase:

    • Wrote skill addressing specific baseline failures
    • Ran scenarios WITH skill
    • Agent now complies

    REFACTOR Phase:

    • Identified NEW rationalizations from testing
    • Added explicit counters for each loophole
    • Updated rationalization table
    • Updated red flags list
    • Updated description ith violation symptoms
    • Re-tested - agent still complies
    • Meta-tested to verify clarity
    • Agent follows rule under maximum pressure

    Common Mistakes (Same as TDD)

    ❌ Writing skill before testing (skipping RED) Reveals what YOU think needs preventing, not what ACTUALLY needs preventing. ✅ Fix: Always run baseline scenarios first.

    ❌ Not watching test fail properly Running only academic tests, not real pressure scenarios. ✅ Fix: Use pressure scenarios that make agent WANT to violate.

    ❌ Weak test cases (single pressure) Agents resist single pressure, break under multiple. ✅ Fix: Combine 3+ pressures (time + sunk cost + exhaustion).

    ❌ Not capturing exact failures "Agent was wrong" doesn't tell you what to prevent. ✅ Fix: Document exact rationalizations verbatim.

    ❌ Vague fixes (adding generic counters) "Don't cheat" doesn't work. "Don't keep as reference" does. ✅ Fix: Add explicit negations for each specific rationalization.

    ❌ Stopping after first pass Tests pass once ≠ bulletproof. ✅ Fix: Continue REFACTOR cycle until no new rationalizations.

    Quick Reference (TDD Cycle)

    TDD Phase Skill Testing Model Success Criteria
    RED Run scenario without skill Production-level (default: Sonnet) Agent fails, document rationalizations
    Verify RED Capture exact wording Same as RED Verbatim documentation of failures
    GREEN Write skill addressing failures One tier down (default: Haiku) Agent now complies with skill
    Verify GREEN Re-test scenarios Same as GREEN Agent follows rule under pressure
    REFACTOR Close loopholes Same as GREEN Add counters for new rationalizations
    Stay GREEN Re-verify Same as GREEN Agent still complies after refactoring

    The Bottom Line

    Skill creation IS TDD. Same principles, same cycle, same benefits.

    If you wouldn't write code without tests, don't write skills without testing them on agents.

    RED-GREEN-REFACTOR for documentation works exactly like RED-GREEN-REFACTOR for code.

    Real-World Impact

    From applying TDD to TDD skill itself (2025-10-03):

    • 6 RED-GREEN-REFACTOR iterations to bulletproof
    • Baseline testing revealed 10+ unique rationalizations
    • Each REFACTOR closed specific loopholes
    • Final VERIFY GREEN: 100% compliance under maximum pressure
    • Same process works for any discipline-enforcing skill
    Recommended Servers
    tldraw
    tldraw
    Hostsmith
    Hostsmith
    Linear
    Linear
    Repository
    ed3dai/ed3d-plugins
    Files