skill-test

databricks-solutions/skill-test

AI & ML

1 installs

About

SKILL.md

skill-test

databricks-solutions/skill-test

AI & ML

1 installs

About

Testing framework for evaluating Databricks skills...

SKILL.md

Databricks Skills Testing Framework

Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.

Quick References

Scorers - Available scorers and quality gates
YAML Schemas - Manifest and ground truth formats
Python API - Programmatic usage examples
Workflows - Detailed example workflows
Trace Evaluation - Session trace analysis

/skill-test Command

The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.

Basic Usage

/skill-test <skill-name> [subcommand]

Subcommands

Subcommand	Description
`run`	Run evaluation against ground truth (default)
`regression`	Compare current results against baseline
`init`	Initialize test scaffolding for a new skill
`add`	Interactive: prompt -> invoke skill -> test -> save
`add --trace`	Add test case with trace evaluation
`review`	Review pending candidates interactively
`review --batch`	Batch approve all pending candidates
`baseline`	Save current results as regression baseline
`mlflow`	Run full MLflow evaluation with LLM judges
`trace-eval`	Evaluate traces against skill expectations
`list-traces`	List available traces (MLflow or local)
`scorers`	List configured scorers for a skill
`scorers update`	Add/remove scorers or update default guidelines
`sync`	Sync YAML to Unity Catalog (Phase 2)

Quick Examples

/skill-test databricks-spark-declarative-pipelines run
/skill-test databricks-spark-declarative-pipelines add --trace
/skill-test databricks-spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init

See Workflows for detailed examples of each subcommand.

Execution Instructions

Environment Setup

uv pip install -e .test/

Environment variables for Databricks MLflow:

DATABRICKS_CONFIG_PROFILE - Databricks CLI profile (default: "DEFAULT")
MLFLOW_TRACKING_URI - Set to "databricks" for Databricks MLflow
MLFLOW_EXPERIMENT_NAME - Experiment path (e.g., "/Users/{user}/skill-test")

Running Scripts

All subcommands have corresponding scripts in .test/scripts/:

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

Subcommand	Script
`run`	`run_eval.py`
`regression`	`regression.py`
`init`	`init_skill.py`
`add`	`add.py`
`review`	`review.py`
`baseline`	`baseline.py`
`mlflow`	`mlflow_eval.py`
`scorers`	`scorers.py`
`scorers update`	`scorers_update.py`
`sync`	`sync.py`
`trace-eval`	`trace_eval.py`
`list-traces`	`list_traces.py`
`_routing mlflow`	`routing_eval.py`

Use --help on any script for available options.

Command Handler

When /skill-test is invoked, parse arguments and execute the appropriate command.

Argument Parsing

args[0] = skill_name (required)
args[1] = subcommand (optional, default: "run")

Subcommand Routing

Subcommand	Action
`run`	Execute `run(skill_name, ctx)` and display results
`regression`	Execute `regression(skill_name, ctx)` and display comparison
`init`	Execute `init(skill_name, ctx)` to create scaffolding
`add`	Prompt for test input, invoke skill, run `interactive()`
`review`	Execute `review(skill_name, ctx)` to review pending candidates
`baseline`	Execute `baseline(skill_name, ctx)` to save as regression baseline
`mlflow`	Execute `mlflow_eval(skill_name, ctx)` with MLflow logging
`scorers`	Execute `scorers(skill_name, ctx)` to list configured scorers
`scorers update`	Execute `scorers_update(skill_name, ctx, ...)` to modify scorers

init Behavior

When running /skill-test <skill-name> init:

Read the skill's SKILL.md to understand its purpose
Create manifest.yaml with appropriate scorers and trace_expectations
Create empty ground_truth.yaml and candidates.yaml templates
Recommend test prompts based on documentation examples

Follow with /skill-test <skill-name> add using recommended prompts.

Context Setup

Create CLIContext with MCP tools before calling any command. See Python API for details.

File Locations

Important: All test files are stored at the repository root level, not relative to this skill's directory.

File Type	Path
Ground truth	`{repo_root}/.test/skills/{skill-name}/ground_truth.yaml`
Candidates	`{repo_root}/.test/skills/{skill-name}/candidates.yaml`
Manifest	`{repo_root}/.test/skills/{skill-name}/manifest.yaml`
Routing tests	`{repo_root}/.test/skills/_routing/ground_truth.yaml`
Baselines	`{repo_root}/.test/baselines/{skill-name}/baseline.yaml`

For example, to test databricks-spark-declarative-pipelines in this repository:

/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml

Not relative to the skill definition:

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG

Directory Structure

.test/                          # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml              # Package config (pip install -e ".test/")
├── README.md                   # Contributor documentation
├── SKILL.md                    # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh       # Sync script
├── scripts/                    # Wrapper scripts
│   ├── _common.py              # Shared utilities
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # Trace evaluation
│   ├── list_traces.py          # List available traces
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python package
│       ├── cli/                # CLI commands module
│       ├── fixtures/           # Test fixture setup
│       ├── scorers/            # Evaluation scorers
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # Evaluation runners
├── skills/                     # Per-skill test definitions
│   ├── _routing/               # Routing test cases
│   └── {skill-name}/           # Skill-specific tests
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # Unit tests
├── references/                 # Documentation references
└── baselines/                  # Regression baselines

References

Scorers - Available scorers and quality gates
YAML Schemas - Manifest and ground truth formats
Python API - Programmatic usage examples
Workflows - Detailed example workflows
Trace Evaluation - Session trace analysis

About

SKILL.md

About

Testing framework for evaluating Databricks skills...

SKILL.md

Databricks Skills Testing Framework

Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.

Quick References

Scorers - Available scorers and quality gates
YAML Schemas - Manifest and ground truth formats
Python API - Programmatic usage examples
Workflows - Detailed example workflows
Trace Evaluation - Session trace analysis

/skill-test Command

The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.

Basic Usage

/skill-test <skill-name> [subcommand]

Subcommands

Subcommand	Description
`run`	Run evaluation against ground truth (default)
`regression`	Compare current results against baseline
`init`	Initialize test scaffolding for a new skill
`add`	Interactive: prompt -> invoke skill -> test -> save
`add --trace`	Add test case with trace evaluation
`review`	Review pending candidates interactively
`review --batch`	Batch approve all pending candidates
`baseline`	Save current results as regression baseline
`mlflow`	Run full MLflow evaluation with LLM judges
`trace-eval`	Evaluate traces against skill expectations
`list-traces`	List available traces (MLflow or local)
`scorers`	List configured scorers for a skill
`scorers update`	Add/remove scorers or update default guidelines
`sync`	Sync YAML to Unity Catalog (Phase 2)

Quick Examples

/skill-test databricks-spark-declarative-pipelines run
/skill-test databricks-spark-declarative-pipelines add --trace
/skill-test databricks-spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init

See Workflows for detailed examples of each subcommand.

Execution Instructions

Environment Setup

uv pip install -e .test/

Environment variables for Databricks MLflow:

DATABRICKS_CONFIG_PROFILE - Databricks CLI profile (default: "DEFAULT")
MLFLOW_TRACKING_URI - Set to "databricks" for Databricks MLflow
MLFLOW_EXPERIMENT_NAME - Experiment path (e.g., "/Users/{user}/skill-test")

Running Scripts

All subcommands have corresponding scripts in .test/scripts/:

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

Subcommand	Script
`run`	`run_eval.py`
`regression`	`regression.py`
`init`	`init_skill.py`
`add`	`add.py`
`review`	`review.py`
`baseline`	`baseline.py`
`mlflow`	`mlflow_eval.py`
`scorers`	`scorers.py`
`scorers update`	`scorers_update.py`
`sync`	`sync.py`
`trace-eval`	`trace_eval.py`
`list-traces`	`list_traces.py`
`_routing mlflow`	`routing_eval.py`

Use --help on any script for available options.

Command Handler

When /skill-test is invoked, parse arguments and execute the appropriate command.

Argument Parsing

args[0] = skill_name (required)
args[1] = subcommand (optional, default: "run")

Subcommand Routing

Subcommand	Action
`run`	Execute `run(skill_name, ctx)` and display results
`regression`	Execute `regression(skill_name, ctx)` and display comparison
`init`	Execute `init(skill_name, ctx)` to create scaffolding
`add`	Prompt for test input, invoke skill, run `interactive()`
`review`	Execute `review(skill_name, ctx)` to review pending candidates
`baseline`	Execute `baseline(skill_name, ctx)` to save as regression baseline
`mlflow`	Execute `mlflow_eval(skill_name, ctx)` with MLflow logging
`scorers`	Execute `scorers(skill_name, ctx)` to list configured scorers
`scorers update`	Execute `scorers_update(skill_name, ctx, ...)` to modify scorers

init Behavior

When running /skill-test <skill-name> init:

Read the skill's SKILL.md to understand its purpose
Create manifest.yaml with appropriate scorers and trace_expectations
Create empty ground_truth.yaml and candidates.yaml templates
Recommend test prompts based on documentation examples

Follow with /skill-test <skill-name> add using recommended prompts.

Context Setup

Create CLIContext with MCP tools before calling any command. See Python API for details.

File Locations

Important: All test files are stored at the repository root level, not relative to this skill's directory.

File Type	Path
Ground truth	`{repo_root}/.test/skills/{skill-name}/ground_truth.yaml`
Candidates	`{repo_root}/.test/skills/{skill-name}/candidates.yaml`
Manifest	`{repo_root}/.test/skills/{skill-name}/manifest.yaml`
Routing tests	`{repo_root}/.test/skills/_routing/ground_truth.yaml`
Baselines	`{repo_root}/.test/baselines/{skill-name}/baseline.yaml`

For example, to test databricks-spark-declarative-pipelines in this repository:

/Users/.../ai-dev-kit/.test/skills/databricks-spark-declarative-pipelines/ground_truth.yaml

Not relative to the skill definition:

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG

Directory Structure

.test/                          # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml              # Package config (pip install -e ".test/")
├── README.md                   # Contributor documentation
├── SKILL.md                    # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh       # Sync script
├── scripts/                    # Wrapper scripts
│   ├── _common.py              # Shared utilities
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # Trace evaluation
│   ├── list_traces.py          # List available traces
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python package
│       ├── cli/                # CLI commands module
│       ├── fixtures/           # Test fixture setup
│       ├── scorers/            # Evaluation scorers
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # Evaluation runners
├── skills/                     # Per-skill test definitions
│   ├── _routing/               # Routing test cases
│   └── {skill-name}/           # Skill-specific tests
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # Unit tests
├── references/                 # Documentation references
└── baselines/                  # Regression baselines

References

Scorers - Available scorers and quality gates
YAML Schemas - Manifest and ground truth formats
Python API - Programmatic usage examples
Workflows - Detailed example workflows
Trace Evaluation - Session trace analysis