Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    davila7

    evaluating-llms-harness

    davila7/evaluating-llms-harness
    AI & ML
    19,892
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag)...

    SKILL.md

    lm-evaluation-harness - LLM Benchmarking

    Quick start

    lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

    Installation:

    pip install lm-eval
    

    Evaluate any HuggingFace model:

    lm_eval --model hf \
      --model_args pretrained=meta-llama/Llama-2-7b-hf \
      --tasks mmlu,gsm8k,hellaswag \
      --device cuda:0 \
      --batch_size 8
    

    View available tasks:

    lm_eval --tasks list
    

    Common workflows

    Workflow 1: Standard benchmark evaluation

    Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

    Copy this checklist:

    Benchmark Evaluation:
    - [ ] Step 1: Choose benchmark suite
    - [ ] Step 2: Configure model
    - [ ] Step 3: Run evaluation
    - [ ] Step 4: Analyze results
    

    Step 1: Choose benchmark suite

    Core reasoning benchmarks:

    • MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
    • GSM8K - Grade school math word problems
    • HellaSwag - Common sense reasoning
    • TruthfulQA - Truthfulness and factuality
    • ARC (AI2 Reasoning Challenge) - Science questions

    Code benchmarks:

    • HumanEval - Python code generation (164 problems)
    • MBPP (Mostly Basic Python Problems) - Python coding

    Standard suite (recommended for model releases):

    --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge
    

    Step 2: Configure model

    HuggingFace model:

    lm_eval --model hf \
      --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
      --tasks mmlu \
      --device cuda:0 \
      --batch_size auto  # Auto-detect optimal batch size
    

    Quantized model (4-bit/8-bit):

    lm_eval --model hf \
      --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
      --tasks mmlu \
      --device cuda:0
    

    Custom checkpoint:

    lm_eval --model hf \
      --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
      --tasks mmlu \
      --device cuda:0
    

    Step 3: Run evaluation

    # Full MMLU evaluation (57 subjects)
    lm_eval --model hf \
      --model_args pretrained=meta-llama/Llama-2-7b-hf \
      --tasks mmlu \
      --num_fewshot 5 \  # 5-shot evaluation (standard)
      --batch_size 8 \
      --output_path results/ \
      --log_samples  # Save individual predictions
    
    # Multiple benchmarks at once
    lm_eval --model hf \
      --model_args pretrained=meta-llama/Llama-2-7b-hf \
      --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
      --num_fewshot 5 \
      --batch_size 8 \
      --output_path results/llama2-7b-eval.json
    

    Step 4: Analyze results

    Results saved to results/llama2-7b-eval.json:

    {
      "results": {
        "mmlu": {
          "acc": 0.459,
          "acc_stderr": 0.004
        },
        "gsm8k": {
          "exact_match": 0.142,
          "exact_match_stderr": 0.006
        },
        "hellaswag": {
          "acc_norm": 0.765,
          "acc_norm_stderr": 0.004
        }
      },
      "config": {
        "model": "hf",
        "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
        "num_fewshot": 5
      }
    }
    

    Workflow 2: Track training progress

    Evaluate checkpoints during training.

    Training Progress Tracking:
    - [ ] Step 1: Set up periodic evaluation
    - [ ] Step 2: Choose quick benchmarks
    - [ ] Step 3: Automate evaluation
    - [ ] Step 4: Plot learning curves
    

    Step 1: Set up periodic evaluation

    Evaluate every N training steps:

    #!/bin/bash
    # eval_checkpoint.sh
    
    CHECKPOINT_DIR=$1
    STEP=$2
    
    lm_eval --model hf \
      --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
      --tasks gsm8k,hellaswag \
      --num_fewshot 0 \  # 0-shot for speed
      --batch_size 16 \
      --output_path results/step-$STEP.json
    

    Step 2: Choose quick benchmarks

    Fast benchmarks for frequent evaluation:

    • HellaSwag: ~10 minutes on 1 GPU
    • GSM8K: ~5 minutes
    • PIQA: ~2 minutes

    Avoid for frequent eval (too slow):

    • MMLU: ~2 hours (57 subjects)
    • HumanEval: Requires code execution

    Step 3: Automate evaluation

    Integrate with training script:

    # In training loop
    if step % eval_interval == 0:
        model.save_pretrained(f"checkpoints/step-{step}")
    
        # Run evaluation
        os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
    

    Or use PyTorch Lightning callbacks:

    from pytorch_lightning import Callback
    
    class EvalHarnessCallback(Callback):
        def on_validation_epoch_end(self, trainer, pl_module):
            step = trainer.global_step
            checkpoint_path = f"checkpoints/step-{step}"
    
            # Save checkpoint
            trainer.save_checkpoint(checkpoint_path)
    
            # Run lm-eval
            os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
    

    Step 4: Plot learning curves

    import json
    import matplotlib.pyplot as plt
    
    # Load all results
    steps = []
    mmlu_scores = []
    
    for file in sorted(glob.glob("results/step-*.json")):
        with open(file) as f:
            data = json.load(f)
            step = int(file.split("-")[1].split(".")[0])
            steps.append(step)
            mmlu_scores.append(data["results"]["mmlu"]["acc"])
    
    # Plot
    plt.plot(steps, mmlu_scores)
    plt.xlabel("Training Step")
    plt.ylabel("MMLU Accuracy")
    plt.title("Training Progress")
    plt.savefig("training_curve.png")
    

    Workflow 3: Compare multiple models

    Benchmark suite for model comparison.

    Model Comparison:
    - [ ] Step 1: Define model list
    - [ ] Step 2: Run evaluations
    - [ ] Step 3: Generate comparison table
    

    Step 1: Define model list

    # models.txt
    meta-llama/Llama-2-7b-hf
    meta-llama/Llama-2-13b-hf
    mistralai/Mistral-7B-v0.1
    microsoft/phi-2
    

    Step 2: Run evaluations

    #!/bin/bash
    # eval_all_models.sh
    
    TASKS="mmlu,gsm8k,hellaswag,truthfulqa"
    
    while read model; do
        echo "Evaluating $model"
    
        # Extract model name for output file
        model_name=$(echo $model | sed 's/\//-/g')
    
        lm_eval --model hf \
          --model_args pretrained=$model,dtype=bfloat16 \
          --tasks $TASKS \
          --num_fewshot 5 \
          --batch_size auto \
          --output_path results/$model_name.json
    
    done < models.txt
    

    Step 3: Generate comparison table

    import json
    import pandas as pd
    
    models = [
        "meta-llama-Llama-2-7b-hf",
        "meta-llama-Llama-2-13b-hf",
        "mistralai-Mistral-7B-v0.1",
        "microsoft-phi-2"
    ]
    
    tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]
    
    results = []
    for model in models:
        with open(f"results/{model}.json") as f:
            data = json.load(f)
            row = {"Model": model.replace("-", "/")}
            for task in tasks:
                # Get primary metric for each task
                metrics = data["results"][task]
                if "acc" in metrics:
                    row[task.upper()] = f"{metrics['acc']:.3f}"
                elif "exact_match" in metrics:
                    row[task.upper()] = f"{metrics['exact_match']:.3f}"
            results.append(row)
    
    df = pd.DataFrame(results)
    print(df.to_markdown(index=False))
    

    Output:

    | Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
    |------------------------|-------|-------|-----------|------------|
    | meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
    | meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
    | mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
    | microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |
    

    Workflow 4: Evaluate with vLLM (faster inference)

    Use vLLM backend for 5-10x faster evaluation.

    vLLM Evaluation:
    - [ ] Step 1: Install vLLM
    - [ ] Step 2: Configure vLLM backend
    - [ ] Step 3: Run evaluation
    

    Step 1: Install vLLM

    pip install vllm
    

    Step 2: Configure vLLM backend

    lm_eval --model vllm \
      --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
      --tasks mmlu \
      --batch_size auto
    

    Step 3: Run evaluation

    vLLM is 5-10× faster than standard HuggingFace:

    # Standard HF: ~2 hours for MMLU on 7B model
    lm_eval --model hf \
      --model_args pretrained=meta-llama/Llama-2-7b-hf \
      --tasks mmlu \
      --batch_size 8
    
    # vLLM: ~15-20 minutes for MMLU on 7B model
    lm_eval --model vllm \
      --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
      --tasks mmlu \
      --batch_size auto
    

    When to use vs alternatives

    Use lm-evaluation-harness when:

    • Benchmarking models for academic papers
    • Comparing model quality across standard tasks
    • Tracking training progress
    • Reporting standardized metrics (everyone uses same prompts)
    • Need reproducible evaluation

    Use alternatives instead:

    • HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
    • AlpacaEval: Instruction-following evaluation with LLM judges
    • MT-Bench: Conversational multi-turn evaluation
    • Custom scripts: Domain-specific evaluation

    Common issues

    Issue: Evaluation too slow

    Use vLLM backend:

    lm_eval --model vllm \
      --model_args pretrained=model-name,tensor_parallel_size=2
    

    Or reduce fewshot examples:

    --num_fewshot 0  # Instead of 5
    

    Or evaluate subset of MMLU:

    --tasks mmlu_stem  # Only STEM subjects
    

    Issue: Out of memory

    Reduce batch size:

    --batch_size 1  # Or --batch_size auto
    

    Use quantization:

    --model_args pretrained=model-name,load_in_8bit=True
    

    Enable CPU offloading:

    --model_args pretrained=model-name,device_map=auto,offload_folder=offload
    

    Issue: Different results than reported

    Check fewshot count:

    --num_fewshot 5  # Most papers use 5-shot
    

    Check exact task name:

    --tasks mmlu  # Not mmlu_direct or mmlu_fewshot
    

    Verify model and tokenizer match:

    --model_args pretrained=model-name,tokenizer=same-model-name
    

    Issue: HumanEval not executing code

    Install execution dependencies:

    pip install human-eval
    

    Enable code execution:

    lm_eval --model hf \
      --model_args pretrained=model-name \
      --tasks humaneval \
      --allow_code_execution  # Required for HumanEval
    

    Advanced topics

    Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

    Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

    API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

    Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

    Hardware requirements

    • GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
    • VRAM:
      • 7B model: 16GB (bf16) or 8GB (8-bit)
      • 13B model: 28GB (bf16) or 14GB (8-bit)
      • 70B model: Requires multi-GPU or quantization
    • Time (7B model, single A100):
      • HellaSwag: 10 minutes
      • GSM8K: 5 minutes
      • MMLU (full): 2 hours
      • HumanEval: 20 minutes

    Resources

    • GitHub: https://github.com/EleutherAI/lm-evaluation-harness
    • Docs: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs
    • Task library: 60+ tasks including MMLU, GSM8K, HumanEval, TruthfulQA, HellaSwag, ARC, WinoGrande, etc.
    • Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)
    Recommended Servers
    Parallel Web Search
    Parallel Web Search
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    Clarity AI
    Clarity AI
    Repository
    davila7/claude-code-templates
    Files