Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    databricks-solutions

    skill-test

    databricks-solutions/skill-test
    AI & ML
    19
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Testing framework for evaluating Databricks skills...

    SKILL.md

    Databricks Skills Testing Framework

    Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.

    Quick References

    • Scorers - Available scorers and quality gates
    • YAML Schemas - Manifest and ground truth formats
    • Python API - Programmatic usage examples
    • Workflows - Detailed example workflows
    • Trace Evaluation - Session trace analysis

    /skill-test Command

    The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.

    Basic Usage

    /skill-test <skill-name> [subcommand]
    

    Subcommands

    Subcommand Description
    run Run evaluation against ground truth (default)
    regression Compare current results against baseline
    init Initialize test scaffolding for a new skill
    add Interactive: prompt -> invoke skill -> test -> save
    add --trace Add test case with trace evaluation
    review Review pending candidates interactively
    review --batch Batch approve all pending candidates
    baseline Save current results as regression baseline
    mlflow Run full MLflow evaluation with LLM judges
    trace-eval Evaluate traces against skill expectations
    list-traces List available traces (MLflow or local)
    scorers List configured scorers for a skill
    scorers update Add/remove scorers or update default guidelines
    sync Sync YAML to Unity Catalog (Phase 2)

    Quick Examples

    /skill-test databricks-spark-declarative-pipelines run
    /skill-test databricks-spark-declarative-pipelines add --trace
    /skill-test databricks-spark-declarative-pipelines review --batch --filter-success
    /skill-test my-new-skill init
    

    See Workflows for detailed examples of each subcommand.

    Execution Instructions

    Environment Setup

    uv pip install -e .test/
    

    Environment variables for Databricks MLflow:

    • DATABRICKS_CONFIG_PROFILE - Databricks CLI profile (default: "DEFAULT")
    • MLFLOW_TRACKING_URI - Set to "databricks" for Databricks MLflow
    • MLFLOW_EXPERIMENT_NAME - Experiment path (e.g., "/Users/{user}/skill-test")

    Running Scripts

    All subcommands have corresponding scripts in .test/scripts/:

    uv run python .test/scripts/{subcommand}.py {skill_name} [options]
    
    Subcommand Script
    run run_eval.py
    regression regression.py
    init init_skill.py
    add add.py
    review review.py
    baseline baseline.py
    mlflow mlflow_eval.py
    scorers scorers.py
    scorers update scorers_update.py
    sync sync.py
    trace-eval trace_eval.py
    list-traces list_traces.py
    _routing mlflow routing_eval.py

    Use --help on any script for available options.

    Command Handler

    When /skill-test is invoked, parse arguments and execute the appropriate command.

    Argument Parsing

    • args[0] = skill_name (required)
    • args[1] = subcommand (optional, default: "run")

    Subcommand Routing

    Subcommand Action
    run Execute run(skill_name, ctx) and display results
    regression Execute regression(skill_name, ctx) and display comparison
    init Execute init(skill_name, ctx) to create scaffolding
    add Prompt for test input, invoke skill, run interactive()
    review Execute review(skill_name, ctx) to review pending candidates
    baseline Execute baseline(skill_name, ctx) to save as regression baseline
    mlflow Execute mlflow_eval(skill_name, ctx) with MLflow logging
    scorers Execute scorers(skill_name, ctx) to list configured scorers
    scorers update Execute scorers_update(skill_name, ctx, ...) to modify scorers

    init Behavior

    When running /skill-test <skill-name> init:

    1. Read the skill's SKILL.md to understand its purpose
    2. Create manifest.yaml with appropriate scorers and trace_expectations
    3. Create empty ground_truth.yaml and candidates.yaml templates
    4. Recommend test prompts based on documentation examples

    Follow with /skill-test <skill-name> add using recommended prompts.

    Context Setup

    Create CLIContext with MCP tools before calling any command. See Python API for details.

    File Locations

    Important: All test files are stored at the repository root level, not relative to this skill's directory.

    File Type Path
    Ground truth {repo_root}/.test/skills/{skill-name}/ground_truth.yaml
    Candidates {repo_root}/.test/skills/{skill-name}/candidates.yaml
    Manifest {repo_root}/.test/skills/{skill-name}/manifest.yaml
    Routing tests {repo_root}/.test/skills/_routing/ground_truth.yaml
    Baselines {repo_root}/.test/baselines/{skill-name}/baseline.yaml

    For example, to test databricks-spark-declarative-pipelines in this repository:

    /Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml
    

    Not relative to the skill definition:

    /Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG
    

    Directory Structure

    .test/                          # At REPOSITORY ROOT (not skill directory)
    ├── pyproject.toml              # Package config (pip install -e ".test/")
    ├── README.md                   # Contributor documentation
    ├── SKILL.md                    # Source of truth (synced to .claude/skills/)
    ├── install_skill_test.sh       # Sync script
    ├── scripts/                    # Wrapper scripts
    │   ├── _common.py              # Shared utilities
    │   ├── run_eval.py
    │   ├── regression.py
    │   ├── init_skill.py
    │   ├── add.py
    │   ├── baseline.py
    │   ├── mlflow_eval.py
    │   ├── routing_eval.py
    │   ├── trace_eval.py           # Trace evaluation
    │   ├── list_traces.py          # List available traces
    │   ├── scorers.py
    │   ├── scorers_update.py
    │   └── sync.py
    ├── src/
    │   └── skill_test/             # Python package
    │       ├── cli/                # CLI commands module
    │       ├── fixtures/           # Test fixture setup
    │       ├── scorers/            # Evaluation scorers
    │       ├── grp/                # Generate-Review-Promote pipeline
    │       └── runners/            # Evaluation runners
    ├── skills/                     # Per-skill test definitions
    │   ├── _routing/               # Routing test cases
    │   └── {skill-name}/           # Skill-specific tests
    │       ├── ground_truth.yaml
    │       ├── candidates.yaml
    │       └── manifest.yaml
    ├── tests/                      # Unit tests
    ├── references/                 # Documentation references
    └── baselines/                  # Regression baselines
    

    References

    • Scorers - Available scorers and quality gates
    • YAML Schemas - Manifest and ground truth formats
    • Python API - Programmatic usage examples
    • Workflows - Detailed example workflows
    • Trace Evaluation - Session trace analysis
    Recommended Servers
    Astro Docs
    Astro Docs
    Svelte
    Svelte
    Repository
    databricks-solutions/ai-dev-kit
    Files