Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    gotalab

    skill-evaluator

    gotalab/skill-evaluator
    AI & ML
    285
    6 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality...

    SKILL.md

    Skill Evaluator (WIP)

    Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.

    Quick Start

    1. Read the skill's SKILL.md and understand its purpose
    2. Run automated validation: scripts/validate_skill.py <skill-path>
    3. Perform manual evaluation against criteria below
    4. Generate evaluation report with scores and recommendations

    Evaluation Workflow

    Step 1: Automated Validation

    Run the validation script first:

    scripts/validate_skill.py <path/to/skill>
    

    This checks:

    • SKILL.md exists with valid YAML frontmatter
    • Name follows conventions (lowercase, hyphens, max 64 chars)
    • Description is present and under 1024 chars
    • Body is under 500 lines
    • File references are one-level deep

    Step 2: Manual Evaluation

    Evaluate each dimension and assign a score (1-5):

    A. Naming (Weight: 10%)

    Score Criteria
    5 Gerund form (-ing), clear purpose, memorable
    4 Descriptive, follows conventions
    3 Acceptable but could be clearer
    2 Vague or misleading
    1 Violates naming rules

    Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.

    Good: processing-pdfs, analyzing-spreadsheets, building-dashboards Bad: pdf, my-skill, ClaudeHelper, anthropic-tools

    B. Description (Weight: 20%)

    Score Criteria
    5 Clear functionality + specific activation triggers + third person
    4 Good description with some triggers
    3 Adequate but missing triggers or vague
    2 Too brief or unclear purpose
    1 Missing or unhelpful

    Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."

    C. Content Quality (Weight: 30%)

    Score Criteria
    5 Concise, assumes Claude intelligence, actionable instructions
    4 Generally good, minor verbosity
    3 Some unnecessary explanations or redundancy
    2 Overly verbose or confusing
    1 Bloated, explains obvious concepts

    Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.

    D. Structure & Organization (Weight: 25%)

    Score Criteria
    5 Excellent progressive disclosure, clear navigation, optimal length
    4 Good organization, appropriate file splits
    3 Acceptable but could be better organized
    2 Poor organization, missing references, or bloated SKILL.md
    1 No structure, everything dumped in SKILL.md

    Check:

    • SKILL.md under 500 lines
    • References are one-level deep (no nested chains)
    • Long reference files (>100 lines) have table of contents
    • Uses forward slashes in all paths

    E. Degrees of Freedom (Weight: 10%)

    Score Criteria
    5 Perfect match: high freedom for flexible tasks, low for fragile operations
    4 Generally appropriate freedom levels
    3 Acceptable but could be better calibrated
    2 Mismatched: too rigid or too loose
    1 Completely wrong freedom level for the task type

    Guideline:

    • High freedom (text): Multiple valid approaches, context-dependent
    • Medium freedom (parameterized): Preferred pattern exists, some variation OK
    • Low freedom (specific scripts): Fragile operations, exact sequence required

    F. Anti-Pattern Check (Weight: 5%)

    Deduct points for each anti-pattern found:

    • Too many options without clear recommendation (-1)
    • Time-sensitive information with date conditionals (-1)
    • Inconsistent terminology (-1)
    • Windows-style paths (backslashes) (-1)
    • Deeply nested references (more than one level) (-2)
    • Scripts that punt error handling to Claude (-1)
    • Magic numbers without justification (-1)

    Step 3: Generate Report

    Use this template:

    # Skill Evaluation Report: [skill-name]
    
    ## Summary
    - **Overall Score**: X.X/5.0
    - **Recommendation**: [Ready for publication / Needs minor improvements / Needs major revision]
    
    ## Dimension Scores
    | Dimension | Score | Weight | Weighted |
    |-----------|-------|--------|----------|
    | Naming | X/5 | 10% | X.XX |
    | Description | X/5 | 20% | X.XX |
    | Content Quality | X/5 | 30% | X.XX |
    | Structure | X/5 | 25% | X.XX |
    | Degrees of Freedom | X/5 | 10% | X.XX |
    | Anti-Patterns | X/5 | 5% | X.XX |
    | **Total** | | 100% | **X.XX** |
    
    ## Strengths
    - [List 2-3 things done well]
    
    ## Areas for Improvement
    - [List specific issues with actionable fixes]
    
    ## Anti-Patterns Found
    - [List any anti-patterns detected]
    
    ## Recommendations
    1. [Priority 1 fix]
    2. [Priority 2 fix]
    3. [Priority 3 fix]
    
    ## Pre-Publication Checklist
    - [ ] Description is specific with activation triggers
    - [ ] SKILL.md under 500 lines
    - [ ] One-level-deep file references
    - [ ] Forward slashes in all paths
    - [ ] No time-sensitive information
    - [ ] Consistent terminology
    - [ ] Concrete examples provided
    - [ ] Scripts handle errors explicitly
    - [ ] All configuration values justified
    - [ ] Required packages listed
    - [ ] Tested with Haiku, Sonnet, Opus
    

    Score Interpretation

    Score Range Rating Action
    4.5 - 5.0 Excellent Ready for publication
    4.0 - 4.4 Good Minor improvements recommended
    3.0 - 3.9 Acceptable Several improvements needed
    2.0 - 2.9 Needs Work Major revision required
    1.0 - 1.9 Poor Fundamental redesign needed

    References

    • references/evaluation-criteria.md - Detailed evaluation criteria with examples
    • references/scoring-rubric.md - Complete scoring rubric and edge cases

    Examples

    See evaluations/ for example evaluation scenarios.

    Recommended Servers
    Thoughtbox
    Thoughtbox
    Clarity AI
    Clarity AI
    Repository
    gotalab/skillport
    Files