grey-haven-evaluation

greyhaven-ai/grey-haven-evaluation

AI & ML

1 installs

About

SKILL.md

grey-haven-evaluation

greyhaven-ai/grey-haven-evaluation

AI & ML

1 installs

About

Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems...

SKILL.md

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

80% from prompt tokens (wording, structure, examples)
15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Examples (`examples/`)

Prompt comparison - A/B testing prompts with rubrics
Model evaluation - Comparing outputs across models
Regression testing - Detecting output degradation

Reference Guides (`reference/`)

Rubric design - Multi-dimensional evaluation criteria
LLM-as-judge - Using LLMs to evaluate LLM outputs
Statistical methods - Handling non-determinism

Templates (`templates/`)

Rubric templates - Ready-to-use evaluation criteria
Judge prompts - LLM-as-judge prompt templates
Test case format - Structured test case templates

Checklists (`checklists/`)

Evaluation setup - Before running evaluations
Rubric validation - Ensuring rubric quality

Key Concepts

1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

Dimension	Weight	Criteria
Accuracy	30%	Factually correct, no hallucinations
Completeness	25%	Addresses all requirements
Clarity	20%	Well-organized, easy to understand
Conciseness	15%	No unnecessary content
Format	10%	Follows specified structure

2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal

Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals

3. LLM-as-Judge Pattern

Use a judge LLM to evaluate outputs:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│  Test LLM   │────▶│   Output    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Rubric    │────▶│ Judge LLM   │
                    └─────────────┘     └─────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │   Score     │
                                        └─────────────┘

Best Practice: Use stronger model as judge (Opus judges Sonnet).

4. Test Case Design

Structure test cases with:

interface TestCase {
  id: string
  input: string              // User message or context
  expectedBehavior: string   // What output should do
  rubric: RubricItem[]       // Evaluation criteria
  groundTruth?: string       // Optional gold standard
  metadata: {
    category: string
    difficulty: 'easy' | 'medium' | 'hard'
    createdAt: string
  }
}

Evaluation Workflow

Step 1: Define Rubric

rubric:
  dimensions:
    - name: accuracy
      weight: 0.3
      criteria:
        5: "Completely accurate, no errors"
        4: "Minor errors, doesn't affect correctness"
        3: "Some errors, partially correct"
        2: "Significant errors, mostly incorrect"
        1: "Completely incorrect or hallucinated"

Step 2: Create Test Cases

test_cases:
  - id: "code-gen-001"
    input: "Write a function to reverse a string"
    expected_behavior: "Returns working reverse function"
    ground_truth: |
      function reverse(s: string): string {
        return s.split('').reverse().join('')
      }

Step 3: Run Evaluation

# Run test suite
python evaluate.py --suite code-generation --runs 3

# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation                 │
# │ Total: 50 | Pass: 47 | Fail: 3              │
# │ Accuracy: 94% (±2.1%)                       │
# │ Avg Score: 4.2/5.0                          │
# └─────────────────────────────────────────────┘

Step 4: Analyze Results

Look for:

Low-scoring dimensions - Target for improvement
High-variance cases - Prompt needs clarification
Regression from baseline - Investigate changes

Grey Haven Integration

With TDD Workflow

1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?

With Pipeline Architecture

acquire → prepare → process → parse → render → EVALUATE
                                                  │
                                          ┌───────┴───────┐
                                          │ Compare to    │
                                          │ ground truth  │
                                          │ or rubric     │
                                          └───────────────┘

With Prompt Engineering

Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓

Use This Skill When

Testing new prompts before production
Comparing prompt variations (A/B testing)
Validating model outputs meet quality bar
Detecting regressions after changes
Building evaluation datasets
Implementing automated quality gates

Related Skills

prompt-engineering - Improve prompts based on evaluation
testing-strategy - Overall testing approaches
llm-project-development - Pipeline with evaluation stage

Quick Start

# Design your rubric
cat templates/rubric-template.yaml

# Create test cases
cat templates/test-case-template.yaml

# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md

# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md

Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15

About

SKILL.md

About

Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems...

SKILL.md

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

80% from prompt tokens (wording, structure, examples)
15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Examples (`examples/`)

Prompt comparison - A/B testing prompts with rubrics
Model evaluation - Comparing outputs across models
Regression testing - Detecting output degradation

Reference Guides (`reference/`)

Rubric design - Multi-dimensional evaluation criteria
LLM-as-judge - Using LLMs to evaluate LLM outputs
Statistical methods - Handling non-determinism

Templates (`templates/`)

Rubric templates - Ready-to-use evaluation criteria
Judge prompts - LLM-as-judge prompt templates
Test case format - Structured test case templates

Checklists (`checklists/`)

Evaluation setup - Before running evaluations
Rubric validation - Ensuring rubric quality

Key Concepts

1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

Dimension	Weight	Criteria
Accuracy	30%	Factually correct, no hallucinations
Completeness	25%	Addresses all requirements
Clarity	20%	Well-organized, easy to understand
Conciseness	15%	No unnecessary content
Format	10%	Follows specified structure

2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal

Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals

3. LLM-as-Judge Pattern

Use a judge LLM to evaluate outputs:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│  Test LLM   │────▶│   Output    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Rubric    │────▶│ Judge LLM   │
                    └─────────────┘     └─────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │   Score     │
                                        └─────────────┘

Best Practice: Use stronger model as judge (Opus judges Sonnet).

4. Test Case Design

Structure test cases with:

interface TestCase {
  id: string
  input: string              // User message or context
  expectedBehavior: string   // What output should do
  rubric: RubricItem[]       // Evaluation criteria
  groundTruth?: string       // Optional gold standard
  metadata: {
    category: string
    difficulty: 'easy' | 'medium' | 'hard'
    createdAt: string
  }
}

Evaluation Workflow

Step 1: Define Rubric

rubric:
  dimensions:
    - name: accuracy
      weight: 0.3
      criteria:
        5: "Completely accurate, no errors"
        4: "Minor errors, doesn't affect correctness"
        3: "Some errors, partially correct"
        2: "Significant errors, mostly incorrect"
        1: "Completely incorrect or hallucinated"

Step 2: Create Test Cases

test_cases:
  - id: "code-gen-001"
    input: "Write a function to reverse a string"
    expected_behavior: "Returns working reverse function"
    ground_truth: |
      function reverse(s: string): string {
        return s.split('').reverse().join('')
      }

Step 3: Run Evaluation

# Run test suite
python evaluate.py --suite code-generation --runs 3

# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation                 │
# │ Total: 50 | Pass: 47 | Fail: 3              │
# │ Accuracy: 94% (±2.1%)                       │
# │ Avg Score: 4.2/5.0                          │
# └─────────────────────────────────────────────┘

Step 4: Analyze Results

Look for:

Low-scoring dimensions - Target for improvement
High-variance cases - Prompt needs clarification
Regression from baseline - Investigate changes

Grey Haven Integration

With TDD Workflow

1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?

With Pipeline Architecture

acquire → prepare → process → parse → render → EVALUATE
                                                  │
                                          ┌───────┴───────┐
                                          │ Compare to    │
                                          │ ground truth  │
                                          │ or rubric     │
                                          └───────────────┘

With Prompt Engineering

Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓

Use This Skill When

Testing new prompts before production
Comparing prompt variations (A/B testing)
Validating model outputs meet quality bar
Detecting regressions after changes
Building evaluation datasets
Implementing automated quality gates

Related Skills

prompt-engineering - Improve prompts based on evaluation
testing-strategy - Overall testing approaches
llm-project-development - Pipeline with evaluation stage

Quick Start

# Design your rubric
cat templates/rubric-template.yaml

# Create test cases
cat templates/test-case-template.yaml

# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md

# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md

Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15

grey-haven-evaluation

About

SKILL.md

grey-haven-evaluation

About

SKILL.md

Evaluation Skill

Core Insight: The 95% Variance Finding

What's Included

Examples (examples/)

Reference Guides (reference/)

Templates (templates/)

Checklists (checklists/)

Key Concepts

1. Multi-Dimensional Rubrics

2. Handling Non-Determinism

3. LLM-as-Judge Pattern

4. Test Case Design

Evaluation Workflow

Step 1: Define Rubric

Step 2: Create Test Cases

Step 3: Run Evaluation

Step 4: Analyze Results

Grey Haven Integration

With TDD Workflow

With Pipeline Architecture

With Prompt Engineering

Use This Skill When

Related Skills

Quick Start

About

SKILL.md

About

SKILL.md

Evaluation Skill

Core Insight: The 95% Variance Finding

What's Included

Examples (examples/)

Reference Guides (reference/)

Templates (templates/)

Checklists (checklists/)

Key Concepts

1. Multi-Dimensional Rubrics

2. Handling Non-Determinism

3. LLM-as-Judge Pattern

4. Test Case Design

Evaluation Workflow

Step 1: Define Rubric

Step 2: Create Test Cases

Step 3: Run Evaluation

Step 4: Analyze Results

Grey Haven Integration

With TDD Workflow

With Pipeline Architecture

With Prompt Engineering

Use This Skill When

Related Skills

Quick Start

Examples (`examples/`)

Reference Guides (`reference/`)

Templates (`templates/`)

Checklists (`checklists/`)

Examples (`examples/`)

Reference Guides (`reference/`)

Templates (`templates/`)

Checklists (`checklists/`)