llm-evaluation

applied-artificial-intelligence/llm-evaluation

AI & ML

1 installs

About

SKILL.md

llm-evaluation

applied-artificial-intelligence/llm-evaluation

AI & ML

1 installs

About

LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics...

SKILL.md

LLM Evaluation & Testing

Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.

Quick Reference

When to use this skill:

Testing LLM application outputs
Validating prompt quality and consistency
Detecting hallucinations and factual errors
Creating evaluation benchmarks
A/B testing prompts or models
Implementing continuous evaluation (CI/CD)
Measuring retrieval quality (for RAG)
Debugging unexpected LLM behavior

Metrics covered:

Traditional: BLEU, ROUGE, BERTScore, Perplexity
LLM-as-Judge: GPT-4 evaluation, rubric-based scoring
Task-specific: Exact match, F1, accuracy, recall
Quality: Toxicity, bias, coherence, relevance

Part 1: Evaluation Fundamentals

The LLM Evaluation Challenge

Why LLM evaluation is hard:

Subjective quality - "Good" output varies by use case
No single ground truth - Multiple valid answers
Context-dependent - Same output good/bad in different scenarios
Expensive to label - Human evaluation doesn't scale
Adversarial brittleness - Small prompt changes = large output changes

Solution: Multi-layered evaluation

Layer 1: Automated Metrics (fast, scalable)
  ↓
Layer 2: LLM-as-Judge (flexible, nuanced)
  ↓
Layer 3: Human Review (gold standard, expensive)

Evaluation Dataset Structure

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class EvalExample:
    """Single evaluation example."""
    input: str  # User input / prompt
    expected_output: Optional[str]  # Gold standard (if exists)
    context: Optional[str]  # Additional context (for RAG)
    metadata: dict  # Category, difficulty, etc.

@dataclass
class EvalResult:
    """Evaluation result for one example."""
    example_id: str
    actual_output: str
    scores: dict  # {'metric_name': score}
    passed: bool
    failure_reason: Optional[str]

# Example dataset
eval_dataset = [
    EvalExample(
        input="What is the capital of France?",
        expected_output="Paris",
        context=None,
        metadata={'category': 'factual', 'difficulty': 'easy'}
    ),
    EvalExample(
        input="Explain quantum entanglement",
        expected_output=None,  # No single answer
        context=None,
        metadata={'category': 'explanation', 'difficulty': 'hard'}
    )
]

Part 2: Traditional Metrics

Metric 1: Exact Match (Simplest)

def exact_match(predicted: str, expected: str, case_sensitive: bool = False) -> float:
    """
    Binary metric: 1.0 if match, 0.0 otherwise.

    Use for: Classification, short answers, structured output
    Limitations: Too strict for generation tasks
    """
    if not case_sensitive:
        predicted = predicted.lower().strip()
        expected = expected.lower().strip()

    return 1.0 if predicted == expected else 0.0

# Example
score = exact_match("Paris", "paris")  # 1.0
score = exact_match("The capital is Paris", "Paris")  # 0.0

Metric 2: ROUGE (Recall-Oriented)

from rouge_score import rouge_scorer

def compute_rouge(predicted: str, expected: str) -> dict:
    """
    ROUGE metrics for text overlap.

    ROUGE-1: Unigram overlap
    ROUGE-2: Bigram overlap
    ROUGE-L: Longest common subsequence

    Use for: Summarization, translation
    Limitations: Doesn't capture semantics
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(expected, predicted)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Example
scores = compute_rouge(
    predicted="Paris is the capital of France",
    expected="The capital of France is Paris"
)
# {'rouge1': 0.82, 'rouge2': 0.67, 'rougeL': 0.82}

Metric 3: BERTScore (Semantic Similarity)

from bert_score import score as bert_score

def compute_bertscore(predicted: List[str], expected: List[str]) -> dict:
    """
    Semantic similarity using BERT embeddings.

    Better than ROUGE for:
    - Paraphrases
    - Semantic equivalence
    - Generation quality

    Returns: Precision, Recall, F1
    """
    P, R, F1 = bert_score(predicted, expected, lang="en", verbose=False)

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }

# Example
scores = compute_bertscore(
    predicted=["The capital of France is Paris"],
    expected=["Paris is France's capital city"]
)
# {'precision': 0.94, 'recall': 0.91, 'f1': 0.92}

Metric 4: Perplexity (Model Confidence)

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def compute_perplexity(text: str, model_name: str = "gpt2") -> float:
    """
    Perplexity: How "surprised" is the model by this text?

    Lower = More likely/fluent
    Use for: Fluency, naturalness
    Limitations: Doesn't measure correctness
    """
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

    perplexity = torch.exp(loss).item()
    return perplexity

# Example
ppl = compute_perplexity("Paris is the capital of France")  # Low (fluent)
ppl2 = compute_perplexity("Capital France the is Paris of")  # High (awkward)

Part 3: LLM-as-Judge Evaluation

Pattern 1: Rubric-Based Scoring

from openai import OpenAI

client = OpenAI()

EVALUATION_PROMPT = """
You are an expert evaluator. Score the assistant's response on a scale of 1-5 for each criterion:

**Criteria:**
1. **Accuracy**: Is the information factually correct?
2. **Completeness**: Does it fully answer the question?
3. **Clarity**: Is it easy to understand?
4. **Conciseness**: Is it appropriately brief?

**Response to evaluate:**
{response}

**Expected answer (reference):**
{expected}

Provide scores in JSON format:
{{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "clarity": <1-5>,
  "conciseness": <1-5>,
  "reasoning": "Brief explanation"
}}
"""

def llm_judge_score(response: str, expected: str) -> dict:
    """
    Use GPT-4 as judge with rubric scoring.

    Pros: Flexible, nuanced, scales well
    Cons: Costs $, potential bias, slower
    """
    prompt = EVALUATION_PROMPT.format(response=response, expected=expected)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    import json
    scores = json.loads(completion.choices[0].message.content)
    return scores

# Example
scores = llm_judge_score(
    response="Paris is the capital of France, located in the north-central part of the country.",
    expected="Paris"
)
# {'accuracy': 5, 'completeness': 5, 'clarity': 5, 'conciseness': 3, 'reasoning': '...'}

Pattern 2: Binary Pass/Fail Evaluation

PASS_FAIL_PROMPT = """
Evaluate if the assistant's response is acceptable.

**Question:** {question}
**Response:** {response}
**Criteria:** {criteria}

Return ONLY "PASS" or "FAIL" followed by a one-sentence reason.
"""

def binary_eval(question: str, response: str, criteria: str) -> tuple[bool, str]:
    """
    Simple pass/fail evaluation.

    Use for: Unit tests, regression tests, CI/CD
    """
    prompt = PASS_FAIL_PROMPT.format(
        question=question,
        response=response,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0  # Deterministic
    )

    result = completion.choices[0].message.content
    passed = result.startswith("PASS")
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return passed, reason

# Example
passed, reason = binary_eval(
    question="What is the capital of France?",
    response="The capital is Paris",
    criteria="Response must mention Paris"
)
# (True, "Response correctly identifies Paris as the capital")

Pattern 3: Pairwise Comparison (A/B Testing)

PAIRWISE_PROMPT = """
Compare two responses to the same question. Which is better?

**Question:** {question}

**Response A:**
{response_a}

**Response B:**
{response_b}

**Criteria:** {criteria}

Return ONLY: "A", "B", or "TIE", followed by a one-sentence explanation.
"""

def pairwise_comparison(
    question: str,
    response_a: str,
    response_b: str,
    criteria: str = "Overall quality, accuracy, and helpfulness"
) -> tuple[str, str]:
    """
    A/B test two responses.

    Use for: Prompt engineering, model comparison
    """
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        response_a=response_a,
        response_b=response_b,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )

    result = completion.choices[0].message.content
    winner = result.split()[0]  # "A", "B", or "TIE"
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return winner, reason

# Example
winner, reason = pairwise_comparison(
    question="Explain quantum computing",
    response_a="Quantum computers use qubits instead of bits...",
    response_b="Quantum computing is complex. It uses quantum mechanics."
)
# ("A", "Response A provides more detail and explanation")

Part 4: Hallucination Detection

Method 1: Grounding Check

def check_grounding(response: str, context: str) -> dict:
    """
    Verify response is grounded in provided context.

    Critical for RAG systems.
    """
    GROUNDING_PROMPT = """
    Context: {context}

    Response: {response}

    Is the response fully supported by the context? Answer with:
    - "GROUNDED": All claims supported
    - "PARTIALLY_GROUNDED": Some claims unsupported
    - "NOT_GROUNDED": Contains unsupported claims

    List any unsupported claims.
    """

    prompt = GROUNDING_PROMPT.format(context=context, response=response)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    result = completion.choices[0].message.content
    status = result.split("\n")[0]
    unsupported = [line for line in result.split("\n")[1:] if line.strip()]

    return {
        'grounding_status': status,
        'unsupported_claims': unsupported,
        'is_hallucination': status != "GROUNDED"
    }

Method 2: Factuality Check (External Verification)

def check_factuality(claim: str, use_search: bool = True) -> dict:
    """
    Verify factual claims using external sources.

    Options:
    1. Web search + verification
    2. Knowledge base lookup
    3. Cross-reference with trusted source
    """
    if use_search:
        # Use web search to verify
        from tavily import TavilyClient
        tavily = TavilyClient(api_key="your-key")

        # Search for evidence
        results = tavily.search(claim, max_results=3)

        # Ask LLM to verify based on search results
        VERIFY_PROMPT = """
        Claim: {claim}

        Search results:
        {results}

        Is the claim supported by these sources? Answer: TRUE, FALSE, or UNCERTAIN.
        Explanation:
        """

        prompt = VERIFY_PROMPT.format(
            claim=claim,
            results="\n\n".join([r['content'] for r in results])
        )

        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        result = completion.choices[0].message.content
        is_factual = result.startswith("TRUE")

        return {
            'claim': claim,
            'factual': is_factual,
            'evidence': results,
            'explanation': result
        }

Method 3: Self-Consistency Check

def self_consistency_check(question: str, num_samples: int = 5) -> dict:
    """
    Generate multiple responses, check for consistency.

    If model is confident, responses should be consistent.
    Inconsistency suggests hallucination risk.
    """
    responses = []

    for _ in range(num_samples):
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": question}],
            temperature=0.7  # Some randomness
        )
        responses.append(completion.choices[0].message.content)

    # Compute pairwise similarity
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(responses)
    similarities = cosine_similarity(vectors)

    # Average pairwise similarity
    avg_similarity = similarities.sum() / (len(responses) * (len(responses) - 1))

    return {
        'responses': responses,
        'avg_similarity': avg_similarity,
        'is_consistent': avg_similarity > 0.7,  # Threshold
        'confidence': 'high' if avg_similarity > 0.85 else 'medium' if avg_similarity > 0.7 else 'low'
    }

Part 5: RAG-Specific Evaluation

Retrieval Quality Metrics

def evaluate_retrieval(query: str, retrieved_docs: List[dict], relevant_doc_ids: List[str]) -> dict:
    """
    Evaluate retrieval quality using IR metrics.

    Precision: What % of retrieved docs are relevant?
    Recall: What % of relevant docs were retrieved?
    MRR: Mean Reciprocal Rank
    NDCG: Normalized Discounted Cumulative Gain
    """
    retrieved_ids = [doc['id'] for doc in retrieved_docs]

    # Precision
    true_positives = len(set(retrieved_ids) & set(relevant_doc_ids))
    precision = true_positives / len(retrieved_ids) if retrieved_ids else 0.0

    # Recall
    recall = true_positives / len(relevant_doc_ids) if relevant_doc_ids else 0.0

    # F1
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    # MRR (Mean Reciprocal Rank)
    mrr = 0.0
    for i, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_doc_ids:
            mrr = 1.0 / i
            break

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'mrr': mrr,
        'num_retrieved': len(retrieved_ids),
        'num_relevant_retrieved': true_positives
    }

End-to-End RAG Evaluation

def evaluate_rag_pipeline(
    question: str,
    generated_answer: str,
    retrieved_docs: List[dict],
    ground_truth: str,
    relevant_doc_ids: List[str]
) -> dict:
    """
    Comprehensive RAG evaluation.

    1. Retrieval quality (precision, recall)
    2. Answer quality (ROUGE, BERTScore)
    3. Answer grounding (hallucination check)
    4. Citation accuracy
    """
    # 1. Retrieval metrics
    retrieval_scores = evaluate_retrieval(question, retrieved_docs, relevant_doc_ids)

    # 2. Answer quality
    context = "\n\n".join([doc['text'] for doc in retrieved_docs])

    rouge_scores = compute_rouge(generated_answer, ground_truth)
    bert_scores = compute_bertscore([generated_answer], [ground_truth])

    # 3. Grounding check
    grounding = check_grounding(generated_answer, context)

    # 4. LLM-as-judge overall quality
    judge_scores = llm_judge_score(generated_answer, ground_truth)

    return {
        'retrieval': retrieval_scores,
        'answer_quality': {
            'rouge': rouge_scores,
            'bertscore': bert_scores
        },
        'grounding': grounding,
        'llm_judge': judge_scores,
        'overall_pass': (
            retrieval_scores['f1'] > 0.5 and
            grounding['grounding_status'] == "GROUNDED" and
            judge_scores['accuracy'] >= 4
        )
    }

Part 6: Prompt Testing Frameworks

Framework 1: Regression Test Suite

class PromptTestSuite:
    """
    Unit tests for prompts (like pytest for LLMs).
    """

    def __init__(self):
        self.tests = []
        self.results = []

    def add_test(self, name: str, input: str, criteria: str):
        """Add a test case."""
        self.tests.append({
            'name': name,
            'input': input,
            'criteria': criteria
        })

    def run(self, generate_fn):
        """Run all tests with given generation function."""
        for test in self.tests:
            response = generate_fn(test['input'])
            passed, reason = binary_eval(
                question=test['input'],
                response=response,
                criteria=test['criteria']
            )

            self.results.append({
                'test_name': test['name'],
                'passed': passed,
                'reason': reason,
                'response': response
            })

        return self.results

    def summary(self) -> dict:
        """Get test summary."""
        total = len(self.results)
        passed = sum(1 for r in self.results if r['passed'])

        return {
            'total_tests': total,
            'passed': passed,
            'failed': total - passed,
            'pass_rate': passed / total if total > 0 else 0.0
        }

# Usage
suite = PromptTestSuite()
suite.add_test("capital_france", "What is the capital of France?", "Must mention Paris")
suite.add_test("capital_germany", "What is the capital of Germany?", "Must mention Berlin")

def my_generate(prompt):
    # Your LLM call
    return client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

results = suite.run(my_generate)
print(suite.summary())
# {'total_tests': 2, 'passed': 2, 'failed': 0, 'pass_rate': 1.0}

Framework 2: A/B Testing Framework

class ABTest:
    """
    A/B test prompts or models.
    """

    def __init__(self, test_cases: List[dict]):
        self.test_cases = test_cases
        self.results = []

    def run(self, generate_a, generate_b):
        """Compare two generation functions."""
        for test in self.test_cases:
            response_a = generate_a(test['input'])
            response_b = generate_b(test['input'])

            winner, reason = pairwise_comparison(
                question=test['input'],
                response_a=response_a,
                response_b=response_b
            )

            self.results.append({
                'input': test['input'],
                'response_a': response_a,
                'response_b': response_b,
                'winner': winner,
                'reason': reason
            })

        return self.results

    def summary(self) -> dict:
        """Aggregate results."""
        total = len(self.results)
        a_wins = sum(1 for r in self.results if r['winner'] == 'A')
        b_wins = sum(1 for r in self.results if r['winner'] == 'B')
        ties = sum(1 for r in self.results if r['winner'] == 'TIE')

        return {
            'total_comparisons': total,
            'a_wins': a_wins,
            'b_wins': b_wins,
            'ties': ties,
            'a_win_rate': a_wins / total if total > 0 else 0.0,
            'statistical_significance': self._check_significance(a_wins, b_wins, total)
        }

    def _check_significance(self, a_wins, b_wins, total):
        """Simple binomial test for statistical significance."""
        from scipy.stats import binom_test
        # H0: Both equally good (p=0.5)
        p_value = binom_test(max(a_wins, b_wins), total, 0.5)
        return p_value < 0.05  # Significant at 95% confidence

Part 7: Production Monitoring

Continuous Evaluation Pipeline

import logging
from datetime import datetime

class ProductionMonitor:
    """
    Monitor LLM performance in production.
    """

    def __init__(self, sample_rate: float = 0.1):
        self.sample_rate = sample_rate
        self.metrics = []
        self.logger = logging.getLogger(__name__)

    def log_interaction(self, user_input: str, model_output: str, metadata: dict):
        """Log interaction for evaluation."""
        import random

        # Sample traffic for evaluation
        if random.random() < self.sample_rate:
            # Run automated checks
            toxicity = self._check_toxicity(model_output)
            perplexity = compute_perplexity(model_output)

            metric = {
                'timestamp': datetime.now().isoformat(),
                'user_input': user_input,
                'model_output': model_output,
                'toxicity_score': toxicity,
                'perplexity': perplexity,
                'latency_ms': metadata.get('latency_ms'),
                'model_version': metadata.get('model_version')
            }

            self.metrics.append(metric)

            # Alert if anomaly detected
            if toxicity > 0.5:
                self.logger.warning(f"High toxicity detected: {toxicity}")

    def _check_toxicity(self, text: str) -> float:
        """Check for toxic content."""
        from detoxify import Detoxify
        model = Detoxify('original')
        results = model.predict(text)
        return max(results.values())  # Max toxicity score

    def get_metrics(self) -> dict:
        """Aggregate metrics."""
        if not self.metrics:
            return {}

        return {
            'total_interactions': len(self.metrics),
            'avg_toxicity': sum(m['toxicity_score'] for m in self.metrics) / len(self.metrics),
            'avg_perplexity': sum(m['perplexity'] for m in self.metrics) / len(self.metrics),
            'avg_latency_ms': sum(m['latency_ms'] for m in self.metrics if m.get('latency_ms')) / len(self.metrics),
            'high_toxicity_rate': sum(1 for m in self.metrics if m['toxicity_score'] > 0.5) / len(self.metrics)
        }

Part 8: Best Practices

Practice 1: Layered Evaluation Strategy

# Layer 1: Fast, cheap automated checks
def quick_checks(response: str) -> bool:
    """Run fast automated checks."""
    # Length check
    if len(response) < 10:
        return False

    # Toxicity check
    if check_toxicity(response) > 0.5:
        return False

    # Basic coherence (perplexity)
    if compute_perplexity(response) > 100:
        return False

    return True

# Layer 2: LLM-as-judge (selective)
def llm_evaluation(response: str, criteria: str) -> float:
    """Run LLM evaluation on subset."""
    scores = llm_judge_score(response, criteria)
    return sum(scores.values()) / len(scores)  # Average score

# Layer 3: Human review (expensive, critical cases)
def flag_for_human_review(response: str, confidence: float) -> bool:
    """Determine if human review needed."""
    return (
        confidence < 0.7 or
        len(response) > 1000 or  # Long responses
        "uncertain" in response.lower()  # Model uncertainty
    )

# Combined pipeline
def evaluate_response(question: str, response: str) -> dict:
    # Layer 1: Quick checks
    if not quick_checks(response):
        return {'status': 'failed_quick_checks', 'human_review': False}

    # Layer 2: LLM judge
    score = llm_evaluation(response, "accuracy and helpfulness")
    confidence = score / 5.0

    # Layer 3: Human review decision
    needs_human = flag_for_human_review(response, confidence)

    return {
        'status': 'passed' if score >= 3.5 else 'failed',
        'score': score,
        'confidence': confidence,
        'human_review': needs_human
    }

Practice 2: Version Your Prompts

from typing import Dict
import hashlib

class PromptVersion:
    """Track prompt versions for A/B testing and rollback."""

    def __init__(self):
        self.versions = {}
        self.active_version = None

    def register(self, name: str, prompt_template: str, metadata: dict = None):
        """Register a prompt version."""
        version_id = hashlib.md5(prompt_template.encode()).hexdigest()[:8]

        self.versions[version_id] = {
            'name': name,
            'template': prompt_template,
            'metadata': metadata or {},
            'created_at': datetime.now(),
            'metrics': {'total_uses': 0, 'avg_score': 0.0}
        }

        return version_id

    def use(self, version_id: str, **kwargs) -> str:
        """Use a specific prompt version."""
        if version_id not in self.versions:
            raise ValueError(f"Unknown version: {version_id}")

        version = self.versions[version_id]
        version['metrics']['total_uses'] += 1

        return version['template'].format(**kwargs)

    def update_metrics(self, version_id: str, score: float):
        """Update performance metrics for a version."""
        version = self.versions[version_id]
        current_avg = version['metrics']['avg_score']
        total_uses = version['metrics']['total_uses']

        # Running average
        new_avg = ((current_avg * (total_uses - 1)) + score) / total_uses
        version['metrics']['avg_score'] = new_avg

# Usage
pm = PromptVersion()

v1 = pm.register(
    name="question_answering_v1",
    prompt_template="Answer this question: {question}",
    metadata={'author': 'alice', 'date': '2024-01-01'}
)

v2 = pm.register(
    name="question_answering_v2",
    prompt_template="You are a helpful assistant. Answer: {question}",
    metadata={'author': 'bob', 'date': '2024-01-15'}
)

# A/B test
prompt = pm.use(v1, question="What is AI?")  # 50% traffic
score = llm_evaluation(response, criteria)
pm.update_metrics(v1, score)

Quick Decision Trees

"Which evaluation method should I use?"

Have ground truth labels?
  YES → ROUGE, BERTScore, Exact Match
  NO  → LLM-as-judge, Human review

Evaluating factual correctness?
  YES → Grounding check, Factuality verification
  NO  → Subjective quality → LLM-as-judge

Need fast feedback (CI/CD)?
  YES → Binary pass/fail tests
  NO  → Comprehensive multi-metric evaluation

Budget constraints?
  Tight → Automated metrics only
  Moderate → LLM-as-judge + sampling
  No limit → Human review gold standard

"How to detect hallucinations?"

Have source documents (RAG)?
  YES → Grounding check against context
  NO  → Continue

Can verify with search?
  YES → Factuality check with web search
  NO  → Continue

Check model confidence?
  YES → Self-consistency check (multiple samples)
  NO  → Flag for human review

Resources

ROUGE: https://github.com/google-research/google-research/tree/master/rouge
BERTScore: https://github.com/Tiiiger/bert_score
OpenAI Evals: https://github.com/openai/evals
LangChain Evaluation: https://python.langchain.com/docs/guides/evaluation/
Ragas (RAG eval): https://github.com/explodinggradients/ragas

Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence

About

SKILL.md

About

LLM evaluation and testing patterns including prompt testing, hallucination detection, benchmark creation, and quality metrics...

SKILL.md

LLM Evaluation & Testing

Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.

Quick Reference

When to use this skill:

Testing LLM application outputs
Validating prompt quality and consistency
Detecting hallucinations and factual errors
Creating evaluation benchmarks
A/B testing prompts or models
Implementing continuous evaluation (CI/CD)
Measuring retrieval quality (for RAG)
Debugging unexpected LLM behavior

Metrics covered:

Traditional: BLEU, ROUGE, BERTScore, Perplexity
LLM-as-Judge: GPT-4 evaluation, rubric-based scoring
Task-specific: Exact match, F1, accuracy, recall
Quality: Toxicity, bias, coherence, relevance

Part 1: Evaluation Fundamentals

The LLM Evaluation Challenge

Why LLM evaluation is hard:

Subjective quality - "Good" output varies by use case
No single ground truth - Multiple valid answers
Context-dependent - Same output good/bad in different scenarios
Expensive to label - Human evaluation doesn't scale
Adversarial brittleness - Small prompt changes = large output changes

Solution: Multi-layered evaluation

Layer 1: Automated Metrics (fast, scalable)
  ↓
Layer 2: LLM-as-Judge (flexible, nuanced)
  ↓
Layer 3: Human Review (gold standard, expensive)

Evaluation Dataset Structure

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class EvalExample:
    """Single evaluation example."""
    input: str  # User input / prompt
    expected_output: Optional[str]  # Gold standard (if exists)
    context: Optional[str]  # Additional context (for RAG)
    metadata: dict  # Category, difficulty, etc.

@dataclass
class EvalResult:
    """Evaluation result for one example."""
    example_id: str
    actual_output: str
    scores: dict  # {'metric_name': score}
    passed: bool
    failure_reason: Optional[str]

# Example dataset
eval_dataset = [
    EvalExample(
        input="What is the capital of France?",
        expected_output="Paris",
        context=None,
        metadata={'category': 'factual', 'difficulty': 'easy'}
    ),
    EvalExample(
        input="Explain quantum entanglement",
        expected_output=None,  # No single answer
        context=None,
        metadata={'category': 'explanation', 'difficulty': 'hard'}
    )
]

Part 2: Traditional Metrics

Metric 1: Exact Match (Simplest)

def exact_match(predicted: str, expected: str, case_sensitive: bool = False) -> float:
    """
    Binary metric: 1.0 if match, 0.0 otherwise.

    Use for: Classification, short answers, structured output
    Limitations: Too strict for generation tasks
    """
    if not case_sensitive:
        predicted = predicted.lower().strip()
        expected = expected.lower().strip()

    return 1.0 if predicted == expected else 0.0

# Example
score = exact_match("Paris", "paris")  # 1.0
score = exact_match("The capital is Paris", "Paris")  # 0.0

Metric 2: ROUGE (Recall-Oriented)

from rouge_score import rouge_scorer

def compute_rouge(predicted: str, expected: str) -> dict:
    """
    ROUGE metrics for text overlap.

    ROUGE-1: Unigram overlap
    ROUGE-2: Bigram overlap
    ROUGE-L: Longest common subsequence

    Use for: Summarization, translation
    Limitations: Doesn't capture semantics
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(expected, predicted)

    return {
        'rouge1': scores['rouge1'].fmeasure,
        'rouge2': scores['rouge2'].fmeasure,
        'rougeL': scores['rougeL'].fmeasure
    }

# Example
scores = compute_rouge(
    predicted="Paris is the capital of France",
    expected="The capital of France is Paris"
)
# {'rouge1': 0.82, 'rouge2': 0.67, 'rougeL': 0.82}

Metric 3: BERTScore (Semantic Similarity)

from bert_score import score as bert_score

def compute_bertscore(predicted: List[str], expected: List[str]) -> dict:
    """
    Semantic similarity using BERT embeddings.

    Better than ROUGE for:
    - Paraphrases
    - Semantic equivalence
    - Generation quality

    Returns: Precision, Recall, F1
    """
    P, R, F1 = bert_score(predicted, expected, lang="en", verbose=False)

    return {
        'precision': P.mean().item(),
        'recall': R.mean().item(),
        'f1': F1.mean().item()
    }

# Example
scores = compute_bertscore(
    predicted=["The capital of France is Paris"],
    expected=["Paris is France's capital city"]
)
# {'precision': 0.94, 'recall': 0.91, 'f1': 0.92}

Metric 4: Perplexity (Model Confidence)

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def compute_perplexity(text: str, model_name: str = "gpt2") -> float:
    """
    Perplexity: How "surprised" is the model by this text?

    Lower = More likely/fluent
    Use for: Fluency, naturalness
    Limitations: Doesn't measure correctness
    """
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss

    perplexity = torch.exp(loss).item()
    return perplexity

# Example
ppl = compute_perplexity("Paris is the capital of France")  # Low (fluent)
ppl2 = compute_perplexity("Capital France the is Paris of")  # High (awkward)

Part 3: LLM-as-Judge Evaluation

Pattern 1: Rubric-Based Scoring

from openai import OpenAI

client = OpenAI()

EVALUATION_PROMPT = """
You are an expert evaluator. Score the assistant's response on a scale of 1-5 for each criterion:

**Criteria:**
1. **Accuracy**: Is the information factually correct?
2. **Completeness**: Does it fully answer the question?
3. **Clarity**: Is it easy to understand?
4. **Conciseness**: Is it appropriately brief?

**Response to evaluate:**
{response}

**Expected answer (reference):**
{expected}

Provide scores in JSON format:
{{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "clarity": <1-5>,
  "conciseness": <1-5>,
  "reasoning": "Brief explanation"
}}
"""

def llm_judge_score(response: str, expected: str) -> dict:
    """
    Use GPT-4 as judge with rubric scoring.

    Pros: Flexible, nuanced, scales well
    Cons: Costs $, potential bias, slower
    """
    prompt = EVALUATION_PROMPT.format(response=response, expected=expected)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    import json
    scores = json.loads(completion.choices[0].message.content)
    return scores

# Example
scores = llm_judge_score(
    response="Paris is the capital of France, located in the north-central part of the country.",
    expected="Paris"
)
# {'accuracy': 5, 'completeness': 5, 'clarity': 5, 'conciseness': 3, 'reasoning': '...'}

Pattern 2: Binary Pass/Fail Evaluation

PASS_FAIL_PROMPT = """
Evaluate if the assistant's response is acceptable.

**Question:** {question}
**Response:** {response}
**Criteria:** {criteria}

Return ONLY "PASS" or "FAIL" followed by a one-sentence reason.
"""

def binary_eval(question: str, response: str, criteria: str) -> tuple[bool, str]:
    """
    Simple pass/fail evaluation.

    Use for: Unit tests, regression tests, CI/CD
    """
    prompt = PASS_FAIL_PROMPT.format(
        question=question,
        response=response,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0  # Deterministic
    )

    result = completion.choices[0].message.content
    passed = result.startswith("PASS")
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return passed, reason

# Example
passed, reason = binary_eval(
    question="What is the capital of France?",
    response="The capital is Paris",
    criteria="Response must mention Paris"
)
# (True, "Response correctly identifies Paris as the capital")

Pattern 3: Pairwise Comparison (A/B Testing)

PAIRWISE_PROMPT = """
Compare two responses to the same question. Which is better?

**Question:** {question}

**Response A:**
{response_a}

**Response B:**
{response_b}

**Criteria:** {criteria}

Return ONLY: "A", "B", or "TIE", followed by a one-sentence explanation.
"""

def pairwise_comparison(
    question: str,
    response_a: str,
    response_b: str,
    criteria: str = "Overall quality, accuracy, and helpfulness"
) -> tuple[str, str]:
    """
    A/B test two responses.

    Use for: Prompt engineering, model comparison
    """
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        response_a=response_a,
        response_b=response_b,
        criteria=criteria
    )

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )

    result = completion.choices[0].message.content
    winner = result.split()[0]  # "A", "B", or "TIE"
    reason = result.split(":", 1)[1].strip() if ":" in result else result

    return winner, reason

# Example
winner, reason = pairwise_comparison(
    question="Explain quantum computing",
    response_a="Quantum computers use qubits instead of bits...",
    response_b="Quantum computing is complex. It uses quantum mechanics."
)
# ("A", "Response A provides more detail and explanation")

Part 4: Hallucination Detection

Method 1: Grounding Check

def check_grounding(response: str, context: str) -> dict:
    """
    Verify response is grounded in provided context.

    Critical for RAG systems.
    """
    GROUNDING_PROMPT = """
    Context: {context}

    Response: {response}

    Is the response fully supported by the context? Answer with:
    - "GROUNDED": All claims supported
    - "PARTIALLY_GROUNDED": Some claims unsupported
    - "NOT_GROUNDED": Contains unsupported claims

    List any unsupported claims.
    """

    prompt = GROUNDING_PROMPT.format(context=context, response=response)

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )

    result = completion.choices[0].message.content
    status = result.split("\n")[0]
    unsupported = [line for line in result.split("\n")[1:] if line.strip()]

    return {
        'grounding_status': status,
        'unsupported_claims': unsupported,
        'is_hallucination': status != "GROUNDED"
    }

Method 2: Factuality Check (External Verification)

def check_factuality(claim: str, use_search: bool = True) -> dict:
    """
    Verify factual claims using external sources.

    Options:
    1. Web search + verification
    2. Knowledge base lookup
    3. Cross-reference with trusted source
    """
    if use_search:
        # Use web search to verify
        from tavily import TavilyClient
        tavily = TavilyClient(api_key="your-key")

        # Search for evidence
        results = tavily.search(claim, max_results=3)

        # Ask LLM to verify based on search results
        VERIFY_PROMPT = """
        Claim: {claim}

        Search results:
        {results}

        Is the claim supported by these sources? Answer: TRUE, FALSE, or UNCERTAIN.
        Explanation:
        """

        prompt = VERIFY_PROMPT.format(
            claim=claim,
            results="\n\n".join([r['content'] for r in results])
        )

        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        result = completion.choices[0].message.content
        is_factual = result.startswith("TRUE")

        return {
            'claim': claim,
            'factual': is_factual,
            'evidence': results,
            'explanation': result
        }

Method 3: Self-Consistency Check

def self_consistency_check(question: str, num_samples: int = 5) -> dict:
    """
    Generate multiple responses, check for consistency.

    If model is confident, responses should be consistent.
    Inconsistency suggests hallucination risk.
    """
    responses = []

    for _ in range(num_samples):
        completion = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": question}],
            temperature=0.7  # Some randomness
        )
        responses.append(completion.choices[0].message.content)

    # Compute pairwise similarity
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity

    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(responses)
    similarities = cosine_similarity(vectors)

    # Average pairwise similarity
    avg_similarity = similarities.sum() / (len(responses) * (len(responses) - 1))

    return {
        'responses': responses,
        'avg_similarity': avg_similarity,
        'is_consistent': avg_similarity > 0.7,  # Threshold
        'confidence': 'high' if avg_similarity > 0.85 else 'medium' if avg_similarity > 0.7 else 'low'
    }

Part 5: RAG-Specific Evaluation

Retrieval Quality Metrics

def evaluate_retrieval(query: str, retrieved_docs: List[dict], relevant_doc_ids: List[str]) -> dict:
    """
    Evaluate retrieval quality using IR metrics.

    Precision: What % of retrieved docs are relevant?
    Recall: What % of relevant docs were retrieved?
    MRR: Mean Reciprocal Rank
    NDCG: Normalized Discounted Cumulative Gain
    """
    retrieved_ids = [doc['id'] for doc in retrieved_docs]

    # Precision
    true_positives = len(set(retrieved_ids) & set(relevant_doc_ids))
    precision = true_positives / len(retrieved_ids) if retrieved_ids else 0.0

    # Recall
    recall = true_positives / len(relevant_doc_ids) if relevant_doc_ids else 0.0

    # F1
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0

    # MRR (Mean Reciprocal Rank)
    mrr = 0.0
    for i, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_doc_ids:
            mrr = 1.0 / i
            break

    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'mrr': mrr,
        'num_retrieved': len(retrieved_ids),
        'num_relevant_retrieved': true_positives
    }

End-to-End RAG Evaluation

def evaluate_rag_pipeline(
    question: str,
    generated_answer: str,
    retrieved_docs: List[dict],
    ground_truth: str,
    relevant_doc_ids: List[str]
) -> dict:
    """
    Comprehensive RAG evaluation.

    1. Retrieval quality (precision, recall)
    2. Answer quality (ROUGE, BERTScore)
    3. Answer grounding (hallucination check)
    4. Citation accuracy
    """
    # 1. Retrieval metrics
    retrieval_scores = evaluate_retrieval(question, retrieved_docs, relevant_doc_ids)

    # 2. Answer quality
    context = "\n\n".join([doc['text'] for doc in retrieved_docs])

    rouge_scores = compute_rouge(generated_answer, ground_truth)
    bert_scores = compute_bertscore([generated_answer], [ground_truth])

    # 3. Grounding check
    grounding = check_grounding(generated_answer, context)

    # 4. LLM-as-judge overall quality
    judge_scores = llm_judge_score(generated_answer, ground_truth)

    return {
        'retrieval': retrieval_scores,
        'answer_quality': {
            'rouge': rouge_scores,
            'bertscore': bert_scores
        },
        'grounding': grounding,
        'llm_judge': judge_scores,
        'overall_pass': (
            retrieval_scores['f1'] > 0.5 and
            grounding['grounding_status'] == "GROUNDED" and
            judge_scores['accuracy'] >= 4
        )
    }

Part 6: Prompt Testing Frameworks

Framework 1: Regression Test Suite

class PromptTestSuite:
    """
    Unit tests for prompts (like pytest for LLMs).
    """

    def __init__(self):
        self.tests = []
        self.results = []

    def add_test(self, name: str, input: str, criteria: str):
        """Add a test case."""
        self.tests.append({
            'name': name,
            'input': input,
            'criteria': criteria
        })

    def run(self, generate_fn):
        """Run all tests with given generation function."""
        for test in self.tests:
            response = generate_fn(test['input'])
            passed, reason = binary_eval(
                question=test['input'],
                response=response,
                criteria=test['criteria']
            )

            self.results.append({
                'test_name': test['name'],
                'passed': passed,
                'reason': reason,
                'response': response
            })

        return self.results

    def summary(self) -> dict:
        """Get test summary."""
        total = len(self.results)
        passed = sum(1 for r in self.results if r['passed'])

        return {
            'total_tests': total,
            'passed': passed,
            'failed': total - passed,
            'pass_rate': passed / total if total > 0 else 0.0
        }

# Usage
suite = PromptTestSuite()
suite.add_test("capital_france", "What is the capital of France?", "Must mention Paris")
suite.add_test("capital_germany", "What is the capital of Germany?", "Must mention Berlin")

def my_generate(prompt):
    # Your LLM call
    return client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

results = suite.run(my_generate)
print(suite.summary())
# {'total_tests': 2, 'passed': 2, 'failed': 0, 'pass_rate': 1.0}

Framework 2: A/B Testing Framework

class ABTest:
    """
    A/B test prompts or models.
    """

    def __init__(self, test_cases: List[dict]):
        self.test_cases = test_cases
        self.results = []

    def run(self, generate_a, generate_b):
        """Compare two generation functions."""
        for test in self.test_cases:
            response_a = generate_a(test['input'])
            response_b = generate_b(test['input'])

            winner, reason = pairwise_comparison(
                question=test['input'],
                response_a=response_a,
                response_b=response_b
            )

            self.results.append({
                'input': test['input'],
                'response_a': response_a,
                'response_b': response_b,
                'winner': winner,
                'reason': reason
            })

        return self.results

    def summary(self) -> dict:
        """Aggregate results."""
        total = len(self.results)
        a_wins = sum(1 for r in self.results if r['winner'] == 'A')
        b_wins = sum(1 for r in self.results if r['winner'] == 'B')
        ties = sum(1 for r in self.results if r['winner'] == 'TIE')

        return {
            'total_comparisons': total,
            'a_wins': a_wins,
            'b_wins': b_wins,
            'ties': ties,
            'a_win_rate': a_wins / total if total > 0 else 0.0,
            'statistical_significance': self._check_significance(a_wins, b_wins, total)
        }

    def _check_significance(self, a_wins, b_wins, total):
        """Simple binomial test for statistical significance."""
        from scipy.stats import binom_test
        # H0: Both equally good (p=0.5)
        p_value = binom_test(max(a_wins, b_wins), total, 0.5)
        return p_value < 0.05  # Significant at 95% confidence

Part 7: Production Monitoring

Continuous Evaluation Pipeline

import logging
from datetime import datetime

class ProductionMonitor:
    """
    Monitor LLM performance in production.
    """

    def __init__(self, sample_rate: float = 0.1):
        self.sample_rate = sample_rate
        self.metrics = []
        self.logger = logging.getLogger(__name__)

    def log_interaction(self, user_input: str, model_output: str, metadata: dict):
        """Log interaction for evaluation."""
        import random

        # Sample traffic for evaluation
        if random.random() < self.sample_rate:
            # Run automated checks
            toxicity = self._check_toxicity(model_output)
            perplexity = compute_perplexity(model_output)

            metric = {
                'timestamp': datetime.now().isoformat(),
                'user_input': user_input,
                'model_output': model_output,
                'toxicity_score': toxicity,
                'perplexity': perplexity,
                'latency_ms': metadata.get('latency_ms'),
                'model_version': metadata.get('model_version')
            }

            self.metrics.append(metric)

            # Alert if anomaly detected
            if toxicity > 0.5:
                self.logger.warning(f"High toxicity detected: {toxicity}")

    def _check_toxicity(self, text: str) -> float:
        """Check for toxic content."""
        from detoxify import Detoxify
        model = Detoxify('original')
        results = model.predict(text)
        return max(results.values())  # Max toxicity score

    def get_metrics(self) -> dict:
        """Aggregate metrics."""
        if not self.metrics:
            return {}

        return {
            'total_interactions': len(self.metrics),
            'avg_toxicity': sum(m['toxicity_score'] for m in self.metrics) / len(self.metrics),
            'avg_perplexity': sum(m['perplexity'] for m in self.metrics) / len(self.metrics),
            'avg_latency_ms': sum(m['latency_ms'] for m in self.metrics if m.get('latency_ms')) / len(self.metrics),
            'high_toxicity_rate': sum(1 for m in self.metrics if m['toxicity_score'] > 0.5) / len(self.metrics)
        }

Part 8: Best Practices

Practice 1: Layered Evaluation Strategy

# Layer 1: Fast, cheap automated checks
def quick_checks(response: str) -> bool:
    """Run fast automated checks."""
    # Length check
    if len(response) < 10:
        return False

    # Toxicity check
    if check_toxicity(response) > 0.5:
        return False

    # Basic coherence (perplexity)
    if compute_perplexity(response) > 100:
        return False

    return True

# Layer 2: LLM-as-judge (selective)
def llm_evaluation(response: str, criteria: str) -> float:
    """Run LLM evaluation on subset."""
    scores = llm_judge_score(response, criteria)
    return sum(scores.values()) / len(scores)  # Average score

# Layer 3: Human review (expensive, critical cases)
def flag_for_human_review(response: str, confidence: float) -> bool:
    """Determine if human review needed."""
    return (
        confidence < 0.7 or
        len(response) > 1000 or  # Long responses
        "uncertain" in response.lower()  # Model uncertainty
    )

# Combined pipeline
def evaluate_response(question: str, response: str) -> dict:
    # Layer 1: Quick checks
    if not quick_checks(response):
        return {'status': 'failed_quick_checks', 'human_review': False}

    # Layer 2: LLM judge
    score = llm_evaluation(response, "accuracy and helpfulness")
    confidence = score / 5.0

    # Layer 3: Human review decision
    needs_human = flag_for_human_review(response, confidence)

    return {
        'status': 'passed' if score >= 3.5 else 'failed',
        'score': score,
        'confidence': confidence,
        'human_review': needs_human
    }

Practice 2: Version Your Prompts

from typing import Dict
import hashlib

class PromptVersion:
    """Track prompt versions for A/B testing and rollback."""

    def __init__(self):
        self.versions = {}
        self.active_version = None

    def register(self, name: str, prompt_template: str, metadata: dict = None):
        """Register a prompt version."""
        version_id = hashlib.md5(prompt_template.encode()).hexdigest()[:8]

        self.versions[version_id] = {
            'name': name,
            'template': prompt_template,
            'metadata': metadata or {},
            'created_at': datetime.now(),
            'metrics': {'total_uses': 0, 'avg_score': 0.0}
        }

        return version_id

    def use(self, version_id: str, **kwargs) -> str:
        """Use a specific prompt version."""
        if version_id not in self.versions:
            raise ValueError(f"Unknown version: {version_id}")

        version = self.versions[version_id]
        version['metrics']['total_uses'] += 1

        return version['template'].format(**kwargs)

    def update_metrics(self, version_id: str, score: float):
        """Update performance metrics for a version."""
        version = self.versions[version_id]
        current_avg = version['metrics']['avg_score']
        total_uses = version['metrics']['total_uses']

        # Running average
        new_avg = ((current_avg * (total_uses - 1)) + score) / total_uses
        version['metrics']['avg_score'] = new_avg

# Usage
pm = PromptVersion()

v1 = pm.register(
    name="question_answering_v1",
    prompt_template="Answer this question: {question}",
    metadata={'author': 'alice', 'date': '2024-01-01'}
)

v2 = pm.register(
    name="question_answering_v2",
    prompt_template="You are a helpful assistant. Answer: {question}",
    metadata={'author': 'bob', 'date': '2024-01-15'}
)

# A/B test
prompt = pm.use(v1, question="What is AI?")  # 50% traffic
score = llm_evaluation(response, criteria)
pm.update_metrics(v1, score)

Quick Decision Trees

"Which evaluation method should I use?"

Have ground truth labels?
  YES → ROUGE, BERTScore, Exact Match
  NO  → LLM-as-judge, Human review

Evaluating factual correctness?
  YES → Grounding check, Factuality verification
  NO  → Subjective quality → LLM-as-judge

Need fast feedback (CI/CD)?
  YES → Binary pass/fail tests
  NO  → Comprehensive multi-metric evaluation

Budget constraints?
  Tight → Automated metrics only
  Moderate → LLM-as-judge + sampling
  No limit → Human review gold standard

"How to detect hallucinations?"

Have source documents (RAG)?
  YES → Grounding check against context
  NO  → Continue

Can verify with search?
  YES → Factuality check with web search
  NO  → Continue

Check model confidence?
  YES → Self-consistency check (multiple samples)
  NO  → Flag for human review

Resources

ROUGE: https://github.com/google-research/google-research/tree/master/rouge
BERTScore: https://github.com/Tiiiger/bert_score
OpenAI Evals: https://github.com/openai/evals
LangChain Evaluation: https://python.langchain.com/docs/guides/evaluation/
Ragas (RAG eval): https://github.com/explodinggradients/ragas

Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence