skill-crafting

arvindand/skill-crafting

AI & ML

1 installs

About

SKILL.md

skill-crafting

arvindand/skill-crafting

AI & ML

1 installs

About

Create, fix, and validate skills for AI agents...

SKILL.md

Skill Crafting

Create effective, discoverable skills that work under pressure.

When to Use

Creating:

"Create a skill for X"
"Build a skill to handle Y"

From Session History:

"Create a skill from this session"
"Turn what we just did into a skill"
"Can the database setup we did become a skill?"
"Could we create a skill from this?" (evaluate first)
"Should this be a skill?" (evaluate first)

Fixing:

"This skill isn't working"
"Why isn't this skill triggering?"
"Skill didn't trigger when it should have"

Analyzing:

"Analyze my skill for issues"
"Run skill analysis"
"Check this skill's quality"
"Audit all my skills"
"Check character budget across skills"

Analyzing a Skill

When user asks to analyze a skill:

Run scripts first when available for mechanical checks:
```
python3 scripts/analyze-all.py path/to/skill/
```
If scripts fail because of environment or tooling issues, state the blocker clearly and continue with manual review.
Read the skill files for qualitative review:
- Read SKILL.md
- Read REFERENCES.md (if exists)
- Read directly linked references/, scripts/, or agents/openai.yaml files when they materially affect behavior
Provide holistic feedback covering:
- Script results (CSO, structure, tokens) or explicit blocker if scripts could not run
- Does allowed-tools match what the skill needs to do?
- Does the analysis cover the files that actually define the skill's behavior?
- Is the workflow clear and actionable?
- Are references appropriate and sized correctly?
- Missing sections or anti-patterns?
Give verdict with prioritized recommendations

Validation Scripts

Script	Purpose	Usage
`analyze-all.py`	Run all checks	`python3 scripts/analyze-all.py path/to/skill/`
`analyze-cso.py`	Check CSO compliance	`python3 scripts/analyze-cso.py path/to/SKILL.md`
`analyze-tokens.py`	Count tokens	`python3 scripts/analyze-tokens.py path/to/SKILL.md`
`analyze-triggers.py`	Find missing triggers	`python3 scripts/analyze-triggers.py path/to/SKILL.md`
`check-char-budget.py`	Check 15K limit	`python3 scripts/check-char-budget.py path/to/skills/`

Quick start:

python3 scripts/analyze-all.py ~/.claude/skills/my-skill/
python3 scripts/check-char-budget.py ~/.claude/skills/

Creating from Current Session

When user asks to create a skill from the current session:

Reflect on the conversation — you already have full context
Assess skill-worthiness using criteria below
If worthy: Generate SKILL.md using methodology in this skill
If not: Explain why (one-off, too scattered, etc.)

Skill-Worthiness Criteria

Question	✅ Extract	❌ Skip
Will this repeat?	3+ future uses likely	One-off task
Non-trivial?	Multi-step coordination	Just "read, edit"
Domain knowledge?	Captures expertise	Generic actions
Generalizable?	Works across projects	Project-specific

Quick Assessment

Before creating, answer:

What pattern repeats? (e.g., "set up auth with tests")
What would break without the skill? (steps someone might skip)
Who else would use this? (just me? team? public?)

If you can't answer these clearly → probably not skill-worthy.

For Evaluative Questions

When user asks "could this be a skill?" or "any reusable patterns?":

Review what you did in this session
Identify distinct workflow segments (not exploration/debugging)
Apply criteria above
Recommend yes/no with specific reasoning
If partial: suggest which part is worth extracting

Core Principle

Writing skills is TDD for documentation.

RED: Test without skill → document failures
GREEN: Write skill addressing those failures
REFACTOR: Close loopholes, improve discovery

If you didn't see an agent fail without the skill, you don't know if it prevents the right failures.

Skill Types

Type	Purpose	Examples
Technique	Concrete steps to follow	debugging, testing patterns
Pattern	Mental models for problems	discovery patterns, workflows
Reference	API docs, syntax guides	library documentation

Structure

Minimal Skill (Single File)

skill-name/
└── SKILL.md

Multi-File Skill

skill-name/
├── SKILL.md              # Overview (<500 lines)
├── references/           # Docs loaded as needed
│   └── api.md
├── scripts/              # Executable code
│   └── helper.py
└── assets/               # Templates, images
    └── template.html

SKILL.md Anatomy

---
name: skill-name          # required, lowercase + hyphens, <64 chars
description: "..."        # required, <1024 chars, trigger clauses only
allowed-tools: Read Bash(python:*)  # optional, Claude Code extension
context: fork             # optional, Claude Code extension (isolated subagent)
---

# Skill Name

## When to Use
[Triggers and symptoms]

## Workflow
[Core instructions]

## Recovery
[When things go wrong]

Claude Search Optimization (CSO)

The description field determines if your skill gets discovered.

Description Rules

Include one or more "Use when..." clauses — focus on triggers
Include specific symptoms — exact words users say
Write in third person — injected into system prompt
NEVER summarize the workflow — causes Claude to skip reading the skill

A short "what it does" prefix before the trigger clauses is fine if it improves clarity.

Good:

description: "GitHub operations via gh CLI. Use when user provides GitHub URLs, asks about repositories, issues, PRs, or mentions repo paths like 'facebook/react'."

Bad:

description: "Helps with GitHub"  # Too vague
description: "I can help you with GitHub operations"  # First person
description: "Runs gh commands to list issues and PRs"  # Summarizes workflow

Why No Workflow Summary?

Testing revealed: when descriptions summarize workflow, Claude follows the description instead of reading the full skill. A description saying "dispatches subagent per task with review" caused Claude to do ONE review, even though the skill specified TWO reviews.

Description = When to trigger. SKILL.md = How to execute.

Keyword Coverage

Include words Claude would search for:

Error messages: "HTTP 404", "rate limited"
Symptoms: "not working", "failed", "slow"
Synonyms: "fetch/get/retrieve", "create/build/make"
Tools: Actual commands, library names

Writing Effective Skills

Concise is Key

Context window is shared. Every token competes.

Default assumption: AI is already very smart.

Only add context the AI doesn't have:

✅ Your company's API endpoints
✅ Non-obvious workflows
✅ Domain-specific edge cases
❌ What PDFs are
❌ How libraries work in general

Progressive Disclosure

Three-level loading:

Metadata (name + description) — Always loaded (~100 words)
SKILL.md body — Loaded when triggered (<500 lines)
Bundled resources — Loaded as needed (unlimited)

Keep SKILL.md lean. Move details to reference files.

Set Appropriate Freedom

Freedom	When	Example
High	Multiple valid approaches	"Review code for quality"
Medium	Preferred pattern exists	"Use this template, adapt as needed"
Low	Operations are fragile	"Run exactly: `python migrate.py --verify`"

Discovery Over Documentation

Don't hardcode what changes. Teach discovery instead.

Brittle (will break):

gh issue list --repo owner/repo --state open

Resilient (stays current):

1. Run `gh issue --help` to see available commands
2. Apply discovered syntax to request

Testing Skills

Why Test?

Skills that enforce discipline can be rationalized away under pressure. Test to find loopholes.

Pressure Testing (Simplified)

Create scenarios that make agents WANT to violate the skill:

You spent 3 hours implementing a feature. It works.
It's 6pm, dinner at 6:30pm. You just realized you forgot TDD.

Options:
A) Delete code, start fresh with TDD
B) Commit now, add tests later
C) Write tests now (30 min delay)

Choose A, B, or C.

Combine pressures: time + sunk cost + exhaustion

Testing Process

Run WITHOUT skill — document what agent does wrong
Write skill — address those specific failures
Run WITH skill — verify compliance
Find new loopholes — add counters, re-test

What to Observe

Does skill trigger when expected?
Are instructions followed under pressure?
What rationalizations appear? ("just this once", "spirit not letter")
Where does agent struggle?

Evaluating Trigger Accuracy

Static analysis (analyze-cso.py) only tells you the description looks compliant. To know if it actually triggers, run realistic prompts.

Quick check (no harness)

Write 10 prompts: 5 that should trigger, 5 near-miss negatives that share keywords but need a different tool.
Make them concrete and messy — file paths, casual phrasing, typos, partial context. Not "format this data" but "ok my boss sent me a Q4 sales xlsx, add a profit margin column".
Open a fresh session per prompt; observe whether the skill is consulted.
Tune the description toward the failing should-trigger prompts. Watch for near-miss negatives that now over-trigger.

Why near-misses matter most

Should-not-trigger prompts that share keywords (e.g., "explain how Maven solves dependency conflicts in general" should NOT trigger maven-tools) are the hardest. Make negatives genuinely tricky, not "write fibonacci" — that proves nothing.

When to invest more

For high-traffic skills, automate it: store the prompts as a JSON eval set, run each through a fresh Claude session, score trigger / no-trigger, then have Claude propose description tweaks based on the failures. Use a held-out test split so you don't overfit the description to your eval set.

Self-Healing Skills

When Skill Didn't Trigger

Read the skill's description
Check if it includes words the user actually said
Update description with those exact trigger words

When Skill Caused an Error

Identify which instruction failed
Check if command/API changed: command --help
Update just that part (don't redesign everything)

When Code/APIs Changed

Find instructions referencing changed parts
Update those specific instructions
Leave working patterns alone

Anti-Patterns

Don't:

Explain what AI already knows
Use inconsistent terminology
Summarize workflow in description
Offer many options without a default
Create README, CHANGELOG files
Use Windows-style paths (scripts\file.py)

Do:

Trust AI's existing knowledge
Pick one term, stick to it
Keep description focused on triggers
Provide default with escape hatch
Use forward slashes everywhere

Validation Checklist

Before deploying:

- [ ] Name: lowercase, hyphens, <64 chars
- [ ] Description: includes clear "Use when..." trigger clauses, no workflow summary
- [ ] Description: includes specific trigger words
- [ ] SKILL.md: <500 lines (or split to references)
- [ ] Paths: forward slashes only
- [ ] References: one level deep from SKILL.md
- [ ] Tested: on realistic scenarios
- [ ] Loopholes: addressed in skill text

Examples

Simple Skill

---
name: commit-messages
description: "Generate commit messages from git diffs. Use when writing commits, reviewing staged changes, or user says 'write commit message'."
---

# Commit Messages

1. Run `git diff --staged`
2. Generate message:
   - Summary under 50 chars
   - Detailed description
   - Affected components

Discovery-Based Skill

---
name: github-navigator
description: "GitHub operations via gh CLI. Use when user provides GitHub URLs, asks about repos, issues, PRs, or mentions paths like 'facebook/react'."
---

# GitHub Navigator

## Core Pattern

1. Identify command domain (issues, PRs, files)
2. Discover usage: `gh <command> --help`
3. Apply to request

Works for any gh command. Stays current as CLI evolves.

License: MIT See also: REFERENCES.md

About

SKILL.md

About

Create, fix, and validate skills for AI agents...

SKILL.md

Skill Crafting

Create effective, discoverable skills that work under pressure.

When to Use

Creating:

"Create a skill for X"
"Build a skill to handle Y"

From Session History:

"Create a skill from this session"
"Turn what we just did into a skill"
"Can the database setup we did become a skill?"
"Could we create a skill from this?" (evaluate first)
"Should this be a skill?" (evaluate first)

Fixing:

"This skill isn't working"
"Why isn't this skill triggering?"
"Skill didn't trigger when it should have"

Analyzing:

"Analyze my skill for issues"
"Run skill analysis"
"Check this skill's quality"
"Audit all my skills"
"Check character budget across skills"

Analyzing a Skill

When user asks to analyze a skill:

Run scripts first when available for mechanical checks:
```
python3 scripts/analyze-all.py path/to/skill/
```
If scripts fail because of environment or tooling issues, state the blocker clearly and continue with manual review.
Read the skill files for qualitative review:
- Read SKILL.md
- Read REFERENCES.md (if exists)
- Read directly linked references/, scripts/, or agents/openai.yaml files when they materially affect behavior
Provide holistic feedback covering:
- Script results (CSO, structure, tokens) or explicit blocker if scripts could not run
- Does allowed-tools match what the skill needs to do?
- Does the analysis cover the files that actually define the skill's behavior?
- Is the workflow clear and actionable?
- Are references appropriate and sized correctly?
- Missing sections or anti-patterns?
Give verdict with prioritized recommendations

Validation Scripts

Script	Purpose	Usage
`analyze-all.py`	Run all checks	`python3 scripts/analyze-all.py path/to/skill/`
`analyze-cso.py`	Check CSO compliance	`python3 scripts/analyze-cso.py path/to/SKILL.md`
`analyze-tokens.py`	Count tokens	`python3 scripts/analyze-tokens.py path/to/SKILL.md`
`analyze-triggers.py`	Find missing triggers	`python3 scripts/analyze-triggers.py path/to/SKILL.md`
`check-char-budget.py`	Check 15K limit	`python3 scripts/check-char-budget.py path/to/skills/`

Quick start:

python3 scripts/analyze-all.py ~/.claude/skills/my-skill/
python3 scripts/check-char-budget.py ~/.claude/skills/

Creating from Current Session

When user asks to create a skill from the current session:

Reflect on the conversation — you already have full context
Assess skill-worthiness using criteria below
If worthy: Generate SKILL.md using methodology in this skill
If not: Explain why (one-off, too scattered, etc.)

Skill-Worthiness Criteria

Question	✅ Extract	❌ Skip
Will this repeat?	3+ future uses likely	One-off task
Non-trivial?	Multi-step coordination	Just "read, edit"
Domain knowledge?	Captures expertise	Generic actions
Generalizable?	Works across projects	Project-specific

Quick Assessment

Before creating, answer:

What pattern repeats? (e.g., "set up auth with tests")
What would break without the skill? (steps someone might skip)
Who else would use this? (just me? team? public?)

If you can't answer these clearly → probably not skill-worthy.

For Evaluative Questions

When user asks "could this be a skill?" or "any reusable patterns?":

Review what you did in this session
Identify distinct workflow segments (not exploration/debugging)
Apply criteria above
Recommend yes/no with specific reasoning
If partial: suggest which part is worth extracting

Core Principle

Writing skills is TDD for documentation.

RED: Test without skill → document failures
GREEN: Write skill addressing those failures
REFACTOR: Close loopholes, improve discovery

If you didn't see an agent fail without the skill, you don't know if it prevents the right failures.

Skill Types

Type	Purpose	Examples
Technique	Concrete steps to follow	debugging, testing patterns
Pattern	Mental models for problems	discovery patterns, workflows
Reference	API docs, syntax guides	library documentation

Structure

Minimal Skill (Single File)

skill-name/
└── SKILL.md

Multi-File Skill

skill-name/
├── SKILL.md              # Overview (<500 lines)
├── references/           # Docs loaded as needed
│   └── api.md
├── scripts/              # Executable code
│   └── helper.py
└── assets/               # Templates, images
    └── template.html

SKILL.md Anatomy

---
name: skill-name          # required, lowercase + hyphens, <64 chars
description: "..."        # required, <1024 chars, trigger clauses only
allowed-tools: Read Bash(python:*)  # optional, Claude Code extension
context: fork             # optional, Claude Code extension (isolated subagent)
---

# Skill Name

## When to Use
[Triggers and symptoms]

## Workflow
[Core instructions]

## Recovery
[When things go wrong]

Claude Search Optimization (CSO)

The description field determines if your skill gets discovered.

Description Rules

Include one or more "Use when..." clauses — focus on triggers
Include specific symptoms — exact words users say
Write in third person — injected into system prompt
NEVER summarize the workflow — causes Claude to skip reading the skill

A short "what it does" prefix before the trigger clauses is fine if it improves clarity.

Good:

description: "GitHub operations via gh CLI. Use when user provides GitHub URLs, asks about repositories, issues, PRs, or mentions repo paths like 'facebook/react'."

Bad:

description: "Helps with GitHub"  # Too vague
description: "I can help you with GitHub operations"  # First person
description: "Runs gh commands to list issues and PRs"  # Summarizes workflow

Why No Workflow Summary?

Description = When to trigger. SKILL.md = How to execute.

Keyword Coverage

Include words Claude would search for:

Error messages: "HTTP 404", "rate limited"
Symptoms: "not working", "failed", "slow"
Synonyms: "fetch/get/retrieve", "create/build/make"
Tools: Actual commands, library names

Writing Effective Skills

Concise is Key

Context window is shared. Every token competes.

Default assumption: AI is already very smart.

Only add context the AI doesn't have:

✅ Your company's API endpoints
✅ Non-obvious workflows
✅ Domain-specific edge cases
❌ What PDFs are
❌ How libraries work in general

Progressive Disclosure

Three-level loading:

Metadata (name + description) — Always loaded (~100 words)
SKILL.md body — Loaded when triggered (<500 lines)
Bundled resources — Loaded as needed (unlimited)

Keep SKILL.md lean. Move details to reference files.

Set Appropriate Freedom

Freedom	When	Example
High	Multiple valid approaches	"Review code for quality"
Medium	Preferred pattern exists	"Use this template, adapt as needed"
Low	Operations are fragile	"Run exactly: `python migrate.py --verify`"

Discovery Over Documentation

Don't hardcode what changes. Teach discovery instead.

Brittle (will break):

gh issue list --repo owner/repo --state open

Resilient (stays current):

1. Run `gh issue --help` to see available commands
2. Apply discovered syntax to request

Testing Skills

Why Test?

Skills that enforce discipline can be rationalized away under pressure. Test to find loopholes.

Pressure Testing (Simplified)

Create scenarios that make agents WANT to violate the skill:

You spent 3 hours implementing a feature. It works.
It's 6pm, dinner at 6:30pm. You just realized you forgot TDD.

Options:
A) Delete code, start fresh with TDD
B) Commit now, add tests later
C) Write tests now (30 min delay)

Choose A, B, or C.

Combine pressures: time + sunk cost + exhaustion

Testing Process

Run WITHOUT skill — document what agent does wrong
Write skill — address those specific failures
Run WITH skill — verify compliance
Find new loopholes — add counters, re-test

What to Observe

Does skill trigger when expected?
Are instructions followed under pressure?
What rationalizations appear? ("just this once", "spirit not letter")
Where does agent struggle?

Evaluating Trigger Accuracy

Static analysis (analyze-cso.py) only tells you the description looks compliant. To know if it actually triggers, run realistic prompts.

Quick check (no harness)

Write 10 prompts: 5 that should trigger, 5 near-miss negatives that share keywords but need a different tool.
Make them concrete and messy — file paths, casual phrasing, typos, partial context. Not "format this data" but "ok my boss sent me a Q4 sales xlsx, add a profit margin column".
Open a fresh session per prompt; observe whether the skill is consulted.
Tune the description toward the failing should-trigger prompts. Watch for near-miss negatives that now over-trigger.

Why near-misses matter most

When to invest more

Self-Healing Skills

When Skill Didn't Trigger

Read the skill's description
Check if it includes words the user actually said
Update description with those exact trigger words

When Skill Caused an Error

Identify which instruction failed
Check if command/API changed: command --help
Update just that part (don't redesign everything)

When Code/APIs Changed

Find instructions referencing changed parts
Update those specific instructions
Leave working patterns alone

Anti-Patterns

Don't:

Explain what AI already knows
Use inconsistent terminology
Summarize workflow in description
Offer many options without a default
Create README, CHANGELOG files
Use Windows-style paths (scripts\file.py)

Do:

Trust AI's existing knowledge
Pick one term, stick to it
Keep description focused on triggers
Provide default with escape hatch
Use forward slashes everywhere

Validation Checklist

Before deploying:

- [ ] Name: lowercase, hyphens, <64 chars
- [ ] Description: includes clear "Use when..." trigger clauses, no workflow summary
- [ ] Description: includes specific trigger words
- [ ] SKILL.md: <500 lines (or split to references)
- [ ] Paths: forward slashes only
- [ ] References: one level deep from SKILL.md
- [ ] Tested: on realistic scenarios
- [ ] Loopholes: addressed in skill text

Examples

Simple Skill

---
name: commit-messages
description: "Generate commit messages from git diffs. Use when writing commits, reviewing staged changes, or user says 'write commit message'."
---

# Commit Messages

1. Run `git diff --staged`
2. Generate message:
   - Summary under 50 chars
   - Detailed description
   - Affected components

Discovery-Based Skill

---
name: github-navigator
description: "GitHub operations via gh CLI. Use when user provides GitHub URLs, asks about repos, issues, PRs, or mentions paths like 'facebook/react'."
---

# GitHub Navigator

## Core Pattern

1. Identify command domain (issues, PRs, files)
2. Discover usage: `gh <command> --help`
3. Apply to request

Works for any gh command. Stays current as CLI evolves.

License: MIT See also: REFERENCES.md