Quality checker for Claude Code skills (SKILL.md) with 24-point checklist based on Anthropic's best practices...
Evaluates Claude Code skills on two dimensions: structural correctness (does it follow the format?) and principle quality (will it actually work well for an LLM?).
Many skills pass structural checks but fail in practice because they lack reasoning, mismatched freedom levels, or waste tokens on things Claude already knows. This reviewer catches both.
| Situation | Mode | Key Action |
|---|---|---|
| Validate my skill before publishing | Self-Review | Run checklist → fix → re-validate |
| Evaluate someone else's skill | External Review | Clone → full audit → report |
| Contribute improvements to a repo | Auto-PR | Fork → additive changes → PR |
Not every skill needs every check. Determine the scope before reviewing — it controls which checks are mandatory vs advisory.
| Profile | When | Location | Key Difference |
|---|---|---|---|
| 📁 Project | One repo only, solo or small team | .claude/skills/ in project |
Overfitting OK — meant for this project |
| 🏠 Personal | Your own use, all projects | ~/.claude/skills/ |
Must generalize across projects |
| 👥 Team | Shared within company/org | Committed to team repos | + discoverability, naming, documentation |
| 🌐 Public | Open-source, marketplace, community | Published externally | All checks mandatory |
Check applicability: 📁+ = from Project up, 🏠+ = from Personal up, 👥+ = from Team up, 🌐 = Public only.
When reporting, mark advisory-only issues as 🟢 Low regardless of their usual severity. A project-specific skill that's "overfitted" (P7) is not a problem — it's the intent.
Objective checks that can be verified mechanically. These catch obvious problems.
| # | Check | Why It Matters | Scope |
|---|---|---|---|
| F1 | name: present, ≤64 chars, lowercase+hyphens |
Claude uses this for slash-command routing | 📁+ |
| F2 | description: present, ≤1024 chars, non-empty |
Claude's only basis for choosing this skill among 100+ | 📁+ |
| F3 | description in third-person |
Injected into system prompt; first-person breaks coherence | 👥+ |
| F4 | description includes trigger conditions ("Use when...") |
Without this, Claude under-triggers the skill | 🏠+ |
| F5 | SKILL.md body ≤500 lines | Beyond this, competes with conversation context for tokens | 📁+ |
| F6 | No Windows paths (\) |
Breaks on Unix systems; forward slashes work everywhere | 🏠+ |
| F7 | No hardcoded absolute paths (/Users/..., /home/...) |
Fails on any other machine | 🏠+ |
| F8 | No hardcoded secrets or API keys | Security risk if skill is shared | 📁+ |
| F9 | Reference depth ≤1 level from SKILL.md | Claude may head -100 nested refs, losing information |
🏠+ |
| F10 | Reference files >100 lines have a table of contents | Claude can see scope even with partial reads | 👥+ |
| F11 | No unnecessary files (README.md, CHANGELOG, etc.) | Skills are for AI agents, not human onboarding | 🌐 |
| F12 | Scripts have explicit error handling | Bare failures force Claude to guess what went wrong | 🏠+ |
Judgment-based checks that determine whether the skill will actually perform well. These are what separate a mediocre skill from a great one.
| # | Check | Why It Matters | Scope |
|---|---|---|---|
| P1 | WHY is explained — rules include reasoning, not just commands | Claude can generalize from principles; bare rules break on edge cases | 🏠+ |
| P2 | Token-efficient — no explanations of things Claude already knows | Context window is shared; every wasted token degrades performance | 📁+ |
| P3 | Freedom calibrated — specificity matches task fragility | Over-constraining flexible tasks wastes effort; under-constraining fragile tasks causes errors | 📁+ |
| P4 | Consistent terminology — one term per concept throughout | Mixed terms ("field"/"box"/"element") confuse Claude's pattern matching | 👥+ |
| P5 | Concrete examples — input/output pairs, not abstract descriptions | Examples teach format and style more effectively than rules | 👥+ |
| P6 | Feedback loops — validate→fix→re-validate for quality-critical steps | Single-pass execution misses errors that compound downstream | 📁+ |
| P7 | Not overfitted — instructions generalize beyond test examples | Narrow rules work for known cases but fail on real-world variation | 🏠+ |
| P8 | No time-sensitive info — no dates, versions that will expire | Skills may be used months later; stale info produces wrong outputs | 👥+ |
| P9 | Description is assertive — slightly "pushy" to prevent under-triggering | Claude tends to skip skills when descriptions are too conservative | 🏠+ |
| P10 | Default over options — one recommended approach, not many choices | Multiple equal options cause decision paralysis in the agent | 👥+ |
| P11 | Dependencies explicit — required packages and tools are listed | Implicit assumptions fail silently in different environments | 🏠+ |
| P12 | MCP tools fully qualified — ServerName:tool_name format |
Ambiguous tool names cause "tool not found" errors | 🏠+ |
Full checklist with scoring: references/evaluation-checklist.md
Validate your own skill before publishing. Run all checks, fix issues, then re-run until clean. Present results using the Report Output Format below.
Self-Review Loop:
1. Run Tier 1 checks (structural)
→ Fix any failures
→ Re-run until all pass
2. Run Tier 2 checks (principle)
→ For each issue, decide: fix, justify, or accept
→ Re-run to confirm fixes didn't introduce new issues
3. Final validation
→ Read SKILL.md as if seeing it for the first time
→ Ask: "Would Claude know what to do with this?"
4. Output results using Report Output Format
→ Layer 1 (Overall Assessment) + Layer 2 (Section Summary) + Layer 3 (Issue Details)
The Tier 1 and Tier 2 tables above ARE the checklist. Read each row, verify, move on. Do not duplicate them here.
Evaluate someone else's skill. The goal is understanding the author's intent before judging quality.
External Review Workflow:
1. [ ] Read ALL files before forming opinions
2. [ ] Identify the author's intent and target use case
3. [ ] Run Tier 1 checks (structural)
4. [ ] Run Tier 2 checks (principle)
5. [ ] Output results using Report Output Format:
Layer 1: Overall Assessment (scores + severity counts)
Layer 2: Section Summary (category status table)
Layer 3: Issue Details (sorted 🔴→🟡→🟠→🟢)
This report applies three design principles. The template below is a reference, not a rigid format — adapt freely as long as these principles are preserved.
Layer 1 → Layer 2 → Layer 3 (verdict → section scan → actionable details)
# Skill Review: [skill-name]
## 1. Overall Assessment
| Tier | Score | Verdict |
|------|-------|---------|
| Form (Structural) | 9/12 | Good — minor issues |
| Principle (Quality) | 6/12 | Needs work |
| **Overall** | **15/24** | **Several principle issues to address** |
Issues found: 🔴 1 Critical · 🟡 3 High · 🟠 2 Medium · 🟢 1 Low
Strengths:
- [Always acknowledge what the skill does well before listing problems]
- [This sets a constructive tone and respects the author's effort]
## 2. Section Summary
Quick scan — which areas need attention and which are clean.
| Category | Status | Issues |
|----------|--------|--------|
| Frontmatter (F1-F4) | ⚠️ 1 issue | F4 missing triggers |
| Body & Structure (F5-F11) | ✅ Clean | — |
| Scripts (F12) | ⚠️ 1 issue | F12 bare except |
| WHY & Reasoning (P1) | ❌ 2 issues | P1 bare ALWAYS/NEVER |
| Efficiency (P2, P10) | ⚠️ 1 issue | P2 verbose |
| Calibration (P3, P7) | ❌ 1 issue | P3 freedom mismatch |
| Examples & Loops (P5, P6) | ✅ Clean | — |
| Metadata Quality (P9, P11, P12) | ⚠️ 1 issue | P11 deps missing |
## 3. Issue Details
Sorted by severity. Fix 🔴 Critical first — these can prevent the skill from working at all.
### 🔴 Critical
**P3: Freedom mismatch on database migration**
- Location: Line 45-52
- Problem: "Migrate however you see fit" for a destructive DB operation
- Impact: Claude may run unverified migration, causing data loss
- Fix:
```markdown
# Before
Migrate the database as needed.
# After
Run exactly this migration script — it includes backup and rollback:
python scripts/migrate.py --verify --backup-first
F4: Missing trigger conditions in description
# Before
description: Processes PDF files.
# After
description: Extracts text and tables from PDF files, fills forms,
merges documents. Use when working with PDFs or when the user
mentions forms, document extraction, or PDF manipulation.
P1: Bare ALWAYS/NEVER without reasoning (2 occurrences)
P2: Verbose explanation of PDF basics
P11: Dependencies not listed
💡 The following structure improvements are optional but recommended for 👥 Team / 🌐 Public skills:
| Recommendation | Current | Benefit if Applied |
|---|---|---|
| "When to Use" section | ❌ Missing | Improves Claude's activation accuracy |
| "What This Skill Does" section | ❌ Missing | Quick understanding for team members |
| "How to Use" with examples | ⚠️ Partial | Enables immediate usage |
| Concrete input/output examples | ❌ Missing | Most impactful for Claude's output quality |
Note: These are suggestions, not requirements. The P1-P12 checks already cover the essentials for Claude's performance. Structure recommendations primarily benefit human readers and team onboarding.
Apply based on your scope:
## Mode 3: Auto-PR
Fork, improve, and submit a PR. Changes must be additive — removing content risks breaking existing users' workflows and disrespects the author's design decisions.
Auto-PR Workflow:
**PR Guidelines**: See `references/pull-request-guide.md` for skill-specific PR best practices, including template, tone, and validation checklist.
### Additive Principle
Existing content is preserved because the author made deliberate choices, and other users may depend on the current behavior. Breaking their workflows is worse than leaving imperfections.
❌ "Removed metadata.json (non-standard)" ✅ "Added marketplace.json alongside metadata.json"
❌ "Rewrote README in English" ✅ "Added README.en.md (original Chinese README preserved)"
❌ "Fixed the incorrect description" ✅ "Improved description with trigger conditions for better discoverability"
## Mode 4: Create New Skill
Create a skill from scratch following all F1-F12 and P1-P12 best practices.
**When to use**: User asks to "create a skill", "write a skill", "build a skill", or "scaffold a skill"
**Process**: Delegate to `references/writer.md` which contains the full creation workflow:
The writer references this SKILL.md as the single source of truth for quality criteria. This ensures created skills automatically pass review.
**Key principle**: The writer doesn't duplicate the checklist — it reads this SKILL.md fresh each time. Any updates to F/P checks automatically apply to new skills.
**User signals for Mode 4**:
- "Create a skill for [capability]"
- "Write a skill that [provides X]"
- "I need a skill to [augment Y]"
- "Make me a skill for [context]"
If the user needs a command instead of a skill, refer them to command-reviewer.
## Common Issues & Fixes
### P1 Violation: Rules Without Reasoning
The most common and impactful problem. Claude can follow rules, but understanding WHY lets it handle cases the rule didn't anticipate.
```markdown
# ❌ Bare rule
ALWAYS validate XML after editing.
# ✅ Rule with reasoning
Validate XML after every edit. Invalid XML makes Word unable to open
the document, which means the user loses their work with no recovery path.
Every token that explains common knowledge pushes out conversation context.
# ❌ Verbose (~150 tokens explaining PDFs)
PDF (Portable Document Format) is a common file format that contains
text, images, and other content. To extract text from a PDF, you'll
need a library...
# ✅ Concise (~50 tokens)
Extract text with pdfplumber:
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
text = pdf.pages[0].extract_text()
# ❌ Low freedom for a flexible task (code review)
Run exactly: python scripts/review.py --strict --all-files
Do not modify the command.
# ✅ High freedom for a flexible task
Review the code for:
- Potential bugs or edge cases
- Readability and maintainability
- Adherence to project conventions
# ❌ High freedom for a fragile task (DB migration)
You can migrate the database however you prefer.
# ✅ Low freedom for a fragile task
Run exactly this script — it includes backup and verification:
python scripts/migrate.py --verify --backup
# ❌ Claude doesn't know when to activate this
description: Processes PDF files.
# ✅ Clear activation context
description: Extracts text and tables from PDF files, fills forms,
merges documents. Use when working with PDF files or when the user
mentions PDFs, forms, or document extraction.
# ❌ Narrow — only works for the test case
When the user asks about Q4 revenue, query the finance_q4_2024 table
and format as a bar chart with blue colors.
# ✅ General — works across situations
Query the appropriate table based on the user's time period.
Format visualizations to match the data type:
- Time series → line chart
- Comparisons → bar chart
- Proportions → pie chart
# ❌ Decision paralysis
You can use pypdf, pdfplumber, PyMuPDF, pdf2image, or camelot...
# ✅ Default with escape hatch
Use pdfplumber for text extraction.
For scanned PDFs requiring OCR, use pdf2image + pytesseract instead.
references/evaluation-checklist.md — Full checklist with scoring rubric and severity guidereferences/pull-request-guide.md — Skill-specific PR best practices for Mode 3 (Auto-PR)references/writer.md — Skill creation workflow for Mode 4 (Create)