Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    daymade

    promptfoo-evaluation

    daymade/promptfoo-evaluation
    AI & ML
    531
    2 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Configures and runs LLM evaluation using Promptfoo framework...

    SKILL.md

    Promptfoo Evaluation

    Overview

    This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.

    Quick Start

    # Initialize a new evaluation project
    npx promptfoo@latest init
    
    # Run evaluation
    npx promptfoo@latest eval
    
    # View results in browser
    npx promptfoo@latest view
    

    Configuration Structure

    A typical Promptfoo project structure:

    project/
    ├── promptfooconfig.yaml    # Main configuration
    ├── prompts/
    │   ├── system.md           # System prompt
    │   └── chat.json           # Chat format prompt
    ├── tests/
    │   └── cases.yaml          # Test cases
    └── scripts/
        └── metrics.py          # Custom Python assertions
    

    Core Configuration (promptfooconfig.yaml)

    # yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
    description: "My LLM Evaluation"
    
    # Prompts to test
    prompts:
      - file://prompts/system.md
      - file://prompts/chat.json
    
    # Models to compare
    providers:
      - id: anthropic:messages:claude-sonnet-4-6
        label: Claude-Sonnet-4.6
      - id: openai:gpt-4.1
        label: GPT-4.1
    
    # Test cases
    tests: file://tests/cases.yaml
    
    # Concurrency control (MUST be under commandLineOptions, NOT top-level)
    commandLineOptions:
      maxConcurrency: 2
    
    # Default assertions for all tests
    defaultTest:
      assert:
        - type: python
          value: file://scripts/metrics.py:custom_assert
        - type: llm-rubric
          value: |
            Evaluate the response quality on a 0-1 scale.
          threshold: 0.7
    
    # Output path
    outputPath: results/eval-results.json
    

    Prompt Formats

    Text Prompt (system.md)

    You are a helpful assistant.
    
    Task: {{task}}
    Context: {{context}}
    

    Chat Format (chat.json)

    [
      {"role": "system", "content": "{{system_prompt}}"},
      {"role": "user", "content": "{{user_input}}"}
    ]
    

    Few-Shot Pattern

    Embed examples directly in prompt or use chat format with assistant messages:

    [
      {"role": "system", "content": "{{system_prompt}}"},
      {"role": "user", "content": "Example input: {{example_input}}"},
      {"role": "assistant", "content": "{{example_output}}"},
      {"role": "user", "content": "Now process: {{actual_input}}"}
    ]
    

    Test Cases (tests/cases.yaml)

    - description: "Test case 1"
      vars:
        system_prompt: file://prompts/system.md
        user_input: "Hello world"
        # Load content from files
        context: file://data/context.txt
      assert:
        - type: contains
          value: "expected text"
        - type: python
          value: file://scripts/metrics.py:custom_check
          threshold: 0.8
    

    Python Custom Assertions

    Create a Python file for custom assertions (e.g., scripts/metrics.py):

    def get_assert(output: str, context: dict) -> dict:
        """Default assertion function."""
        vars_dict = context.get('vars', {})
    
        # Access test variables
        expected = vars_dict.get('expected', '')
    
        # Return result
        return {
            "pass": expected in output,
            "score": 0.8,
            "reason": "Contains expected content",
            "named_scores": {"relevance": 0.9}
        }
    
    def custom_check(output: str, context: dict) -> dict:
        """Custom named assertion."""
        word_count = len(output.split())
        passed = 100 <= word_count <= 500
    
        return {
            "pass": passed,
            "score": min(1.0, word_count / 300),
            "reason": f"Word count: {word_count}"
        }
    

    Key points:

    • Default function name is get_assert
    • Specify function with file://path.py:function_name
    • Return bool, float (score), or dict with pass/score/reason
    • Access variables via context['vars']

    LLM-as-Judge (llm-rubric)

    assert:
      - type: llm-rubric
        value: |
          Evaluate the response based on:
          1. Accuracy of information
          2. Clarity of explanation
          3. Completeness
    
          Score 0.0-1.0 where 0.7+ is passing.
        threshold: 0.7
        provider: openai:gpt-4.1  # Optional: override grader model
    

    When using a relay/proxy API, each llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:

    assert:
      - type: llm-rubric
        value: |
          Evaluate quality on a 0-1 scale.
        threshold: 0.7
        provider:
          id: anthropic:messages:claude-sonnet-4-6
          config:
            apiBaseUrl: https://your-relay.example.com/api
    

    Best practices:

    • Provide clear scoring criteria
    • Use threshold to set minimum passing score
    • Default grader uses available API keys (OpenAI → Anthropic → Google)
    • When using relay/proxy: every llm-rubric must have its own provider with apiBaseUrl — the main provider's apiBaseUrl is NOT inherited

    Common Assertion Types

    Type Usage Example
    contains Check substring value: "hello"
    icontains Case-insensitive value: "HELLO"
    equals Exact match value: "42"
    regex Pattern match value: "\\d{4}"
    python Custom logic value: file://script.py
    llm-rubric LLM grading value: "Is professional"
    latency Response time threshold: 1000

    File References

    All file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.

    # Load file content as variable
    vars:
      content: file://data/input.txt
    
    # Load prompt from file
    prompts:
      - file://prompts/main.md
    
    # Load test cases from file
    tests: file://tests/cases.yaml
    
    # Load Python assertion
    assert:
      - type: python
        value: file://scripts/check.py:validate
    

    Running Evaluations

    # Basic run
    npx promptfoo@latest eval
    
    # With specific config
    npx promptfoo@latest eval --config path/to/config.yaml
    
    # Output to file
    npx promptfoo@latest eval --output results.json
    
    # Filter tests
    npx promptfoo@latest eval --filter-metadata category=math
    
    # View results
    npx promptfoo@latest view
    

    Relay / Proxy API Configuration

    When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:

    providers:
      - id: anthropic:messages:claude-sonnet-4-6
        label: Claude-Sonnet-4.6
        config:
          max_tokens: 4096
          apiBaseUrl: https://your-relay.example.com/api  # Promptfoo appends /v1/messages
    
    # CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level)
    commandLineOptions:
      maxConcurrency: 1  # Respect relay rate limits
    

    Key rules:

    • apiBaseUrl goes in providers[].config — Promptfoo appends /v1/messages automatically
    • maxConcurrency must be under commandLineOptions: — placing it at top level is silently ignored
    • When using relay with LLM-as-judge, set maxConcurrency: 1 to avoid concurrent request limits (generation + grading share the same pool)
    • Pass relay token as ANTHROPIC_API_KEY env var

    Troubleshooting

    Python not found:

    export PROMPTFOO_PYTHON=python3
    

    Large outputs truncated: Outputs over 30000 characters are truncated. Use head_limit in assertions.

    File not found errors: All file:// paths resolve relative to promptfooconfig.yaml location.

    maxConcurrency ignored (shows "up to N at a time"): maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.

    LLM-as-judge returns 401 with relay API: Each llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.

    HTML tags in model output inflating metrics: Models may output <br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:

    import re
    clean_text = re.sub(r'<[^>]+>', '', raw_text)
    

    Echo Provider (Preview Mode)

    Use the echo provider to preview rendered prompts without making API calls:

    # promptfooconfig-preview.yaml
    providers:
      - echo  # Returns prompt as output, no API calls
    
    tests:
      - vars:
          input: "test content"
    

    Use cases:

    • Preview prompt rendering before expensive API calls
    • Verify Few-shot examples are loaded correctly
    • Debug variable substitution issues
    • Validate prompt structure
    # Run preview mode
    npx promptfoo@latest eval --config promptfooconfig-preview.yaml
    

    Cost: Free - no API tokens consumed.

    Advanced Few-Shot Implementation

    Multi-turn Conversation Pattern

    For complex few-shot learning with full examples:

    [
      {"role": "system", "content": "{{system_prompt}}"},
    
      // Few-shot Example 1
      {"role": "user", "content": "Task: {{example_input_1}}"},
      {"role": "assistant", "content": "{{example_output_1}}"},
    
      // Few-shot Example 2 (optional)
      {"role": "user", "content": "Task: {{example_input_2}}"},
      {"role": "assistant", "content": "{{example_output_2}}"},
    
      // Actual test
      {"role": "user", "content": "Task: {{actual_input}}"}
    ]
    

    Test case configuration:

    tests:
      - vars:
          system_prompt: file://prompts/system.md
          # Few-shot examples
          example_input_1: file://data/examples/input1.txt
          example_output_1: file://data/examples/output1.txt
          example_input_2: file://data/examples/input2.txt
          example_output_2: file://data/examples/output2.txt
          # Actual test
          actual_input: file://data/test1.txt
    

    Best practices:

    • Use 1-3 few-shot examples (more may dilute effectiveness)
    • Ensure examples match the task format exactly
    • Load examples from files for better maintainability
    • Use echo provider first to verify structure

    Long Text Handling

    For Chinese/long-form content evaluations (10k+ characters):

    Configuration:

    providers:
      - id: anthropic:messages:claude-sonnet-4-6
        config:
          max_tokens: 8192  # Increase for long outputs
    
    defaultTest:
      assert:
        - type: python
          value: file://scripts/metrics.py:check_length
    

    Python assertion for text metrics:

    import re
    
    def strip_tags(text: str) -> str:
        """Remove HTML tags for pure text."""
        return re.sub(r'<[^>]+>', '', text)
    
    def check_length(output: str, context: dict) -> dict:
        """Check output length constraints."""
        raw_input = context['vars'].get('raw_input', '')
    
        input_len = len(strip_tags(raw_input))
        output_len = len(strip_tags(output))
    
        reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
    
        return {
            "pass": 0.7 <= reduction_ratio <= 0.9,
            "score": reduction_ratio,
            "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
            "named_scores": {
                "input_length": input_len,
                "output_length": output_len,
                "reduction_ratio": reduction_ratio
            }
        }
    

    Real-World Example

    Project: Chinese short-video content curation from long transcripts

    Structure:

    tiaogaoren/
    ├── promptfooconfig.yaml          # Production config
    ├── promptfooconfig-preview.yaml  # Preview config (echo provider)
    ├── prompts/
    │   ├── tiaogaoren-prompt.json   # Chat format with few-shot
    │   └── v4/system-v4.md          # System prompt
    ├── tests/cases.yaml              # 3 test samples
    ├── scripts/metrics.py            # Custom metrics (reduction ratio, etc.)
    ├── data/                         # 5 samples (2 few-shot, 3 eval)
    └── results/
    

    See: ./tiaogaoren/ (example project root) for full implementation.

    Resources

    For detailed API reference and advanced patterns, see references/promptfoo_api.md.

    Recommended Servers
    Browserbase
    Browserbase
    Repository
    daymade/claude-code-skills
    Files