Run an MCP evaluation using mcpbr on SWE-bench or other datasets.
You are an expert at benchmarking AI agents using the mcpbr CLI. Your goal is to run valid, reproducible evaluations.
Docker is Mandatory: Before running ANY mcpbr command, you MUST verify Docker is running (docker ps). If not, tell the user to start it.
Config is Required: mcpbr run FAILS without a config file. Never guess flags.
mcpbr init first to generate a template.cat mcpbr.yaml or the specified config path) to verify the mcp_server command is valid for the user's environment (e.g., check if npx or uvx is installed).Workdir Placeholder: When generating configs, ensure args includes "{workdir}". Do not resolve this path yourself; mcpbr handles it.
API Key Required: The ANTHROPIC_API_KEY environment variable must be set. Check for it before running evaluations.
-m flag unless the user explicitly asks to override the model in the YAML.SWE-bench/SWE-bench_Lite (default for SWE-bench)SWE-bench/SWE-bench_Verifiedsunblaze-ucb/cybergym (for CyberGym benchmark)MCPToolBench/MCPToolBenchPP (for MCPToolBench++)-c or --config.mcpbr supports three benchmarks:
SWE-bench (default): Real GitHub issues requiring bug fixes
SWE-bench/SWE-bench_Lite or SWE-bench/SWE-bench_Verifiedmcpbr run -c config.yaml or --benchmark swe-benchCyberGym: Security vulnerabilities requiring PoC exploits
sunblaze-ucb/cybergymmcpbr run -c config.yaml --benchmark cybergym --level [0-3]MCPToolBench++: Large-scale tool use evaluation
MCPToolBench/MCPToolBenchPPmcpbr run -c config.yaml --benchmark mcptoolbenchFollow these steps in order:
Verify Prerequisites:
# Check Docker is running
docker ps
# Verify API key is set
echo $ANTHROPIC_API_KEY
Check for Config File:
mcpbr.yaml (or user-specified config) does NOT exist: Run mcpbr init to generate it.Validate Config:
mcp_server.command is valid (e.g., npx, uvx, python are installed).mcp_server.args includes "{workdir}" placeholder.model, dataset, and other parameters are correctly set.Construct the Command:
mcpbr run --config <path-to-config>-n <number> or --sample <number>: Override sample size-v or -vv: Verbose output-o <path>: Save JSON results-r <path>: Save Markdown report--log-dir <path>: Save per-instance logs-M: MCP-only evaluation (skip baseline)-B: Baseline-only evaluation (skip MCP)--benchmark <name>: Override benchmark--level <0-3>: Set CyberGym difficulty levelRun the Command: Execute the constructed command and monitor the output.
Handle Results:
# Full evaluation with 5 tasks
mcpbr run -c config.yaml -n 5 -v
# MCP-only evaluation
mcpbr run -c config.yaml -M -n 10
# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md
# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5
# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
If you encounter errors:
export ANTHROPIC_API_KEY="sk-ant-..."mcpbr init or fix the YAML syntax.timeout_seconds in config.mcpbr models command to check available models if needed.mcpbr benchmarks command to list available benchmarks.