Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    github

    agentic-eval

    github/agentic-eval
    AI & ML
    20,589
    162 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Patterns and techniques for evaluating and improving AI agent outputs...

    SKILL.md

    Agentic Evaluation Patterns

    Patterns for self-improvement through iterative evaluation and refinement.

    Overview

    Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

    Generate → Evaluate → Critique → Refine → Output
        ↑                              │
        └──────────────────────────────┘
    

    When to Use

    • Quality-critical generation: Code, reports, analysis requiring high accuracy
    • Tasks with clear evaluation criteria: Defined success metrics exist
    • Content requiring specific standards: Style guides, compliance, formatting

    Pattern 1: Basic Reflection

    Agent evaluates and improves its own output through self-critique.

    def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
        """Generate with reflection loop."""
        output = llm(f"Complete this task:\n{task}")
        
        for i in range(max_iterations):
            # Self-critique
            critique = llm(f"""
            Evaluate this output against criteria: {criteria}
            Output: {output}
            Rate each: PASS/FAIL with feedback as JSON.
            """)
            
            critique_data = json.loads(critique)
            all_pass = all(c["status"] == "PASS" for c in critique_data.values())
            if all_pass:
                return output
            
            # Refine based on critique
            failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
            output = llm(f"Improve to address: {failed}\nOriginal: {output}")
        
        return output
    

    Key insight: Use structured JSON output for reliable parsing of critique results.


    Pattern 2: Evaluator-Optimizer

    Separate generation and evaluation into distinct components for clearer responsibilities.

    class EvaluatorOptimizer:
        def __init__(self, score_threshold: float = 0.8):
            self.score_threshold = score_threshold
        
        def generate(self, task: str) -> str:
            return llm(f"Complete: {task}")
        
        def evaluate(self, output: str, task: str) -> dict:
            return json.loads(llm(f"""
            Evaluate output for task: {task}
            Output: {output}
            Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
            """))
        
        def optimize(self, output: str, feedback: dict) -> str:
            return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
        
        def run(self, task: str, max_iterations: int = 3) -> str:
            output = self.generate(task)
            for _ in range(max_iterations):
                evaluation = self.evaluate(output, task)
                if evaluation["overall_score"] >= self.score_threshold:
                    break
                output = self.optimize(output, evaluation)
            return output
    

    Pattern 3: Code-Specific Reflection

    Test-driven refinement loop for code generation.

    class CodeReflector:
        def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
            code = llm(f"Write Python code for: {spec}")
            tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
            
            for _ in range(max_iterations):
                result = run_tests(code, tests)
                if result["success"]:
                    return code
                code = llm(f"Fix error: {result['error']}\nCode: {code}")
            return code
    

    Evaluation Strategies

    Outcome-Based

    Evaluate whether output achieves the expected result.

    def evaluate_outcome(task: str, output: str, expected: str) -> str:
        return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")
    

    LLM-as-Judge

    Use LLM to compare and rank outputs.

    def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
        return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")
    

    Rubric-Based

    Score outputs against weighted dimensions.

    RUBRIC = {
        "accuracy": {"weight": 0.4},
        "clarity": {"weight": 0.3},
        "completeness": {"weight": 0.3}
    }
    
    def evaluate_with_rubric(output: str, rubric: dict) -> float:
        scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}"))
        return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5
    

    Best Practices

    Practice Rationale
    Clear criteria Define specific, measurable evaluation criteria upfront
    Iteration limits Set max iterations (3-5) to prevent infinite loops
    Convergence check Stop if output score isn't improving between iterations
    Log history Keep full trajectory for debugging and analysis
    Structured output Use JSON for reliable parsing of evaluation results

    Quick Start Checklist

    ## Evaluation Implementation Checklist
    
    ### Setup
    - [ ] Define evaluation criteria/rubric
    - [ ] Set score threshold for "good enough"
    - [ ] Configure max iterations (default: 3)
    
    ### Implementation
    - [ ] Implement generate() function
    - [ ] Implement evaluate() function with structured output
    - [ ] Implement optimize() function
    - [ ] Wire up the refinement loop
    
    ### Safety
    - [ ] Add convergence detection
    - [ ] Log all iterations for debugging
    - [ ] Handle evaluation parse failures gracefully
    
    Recommended Servers
    Thoughtbox
    Thoughtbox
    Browser tool
    Browser tool
    Nimble MCP Server
    Nimble MCP Server
    Repository
    github/awesome-copilot
    Files