run-benchmark

greynewell/run-benchmark

Data & Analytics

1 installs

About

SKILL.md

run-benchmark

greynewell/run-benchmark

Data & Analytics

1 installs

About

Run an MCP evaluation using mcpbr on SWE-bench or other datasets.

SKILL.md

Instructions

You are an expert at benchmarking AI agents using the mcpbr CLI. Your goal is to run valid, reproducible evaluations.

Critical Constraints (DO NOT IGNORE)

Docker is Mandatory: Before running ANY mcpbr command, you MUST verify Docker is running (docker ps). If not, tell the user to start it.
Config is Required: mcpbr run FAILS without a config file. Never guess flags.
- IF no config exists: Run mcpbr init first to generate a template.
- IF config exists: Read it (cat mcpbr.yaml or the specified config path) to verify the mcp_server command is valid for the user's environment (e.g., check if npx or uvx is installed).
Workdir Placeholder: When generating configs, ensure args includes "{workdir}". Do not resolve this path yourself; mcpbr handles it.
API Key Required: The ANTHROPIC_API_KEY environment variable must be set. Check for it before running evaluations.

Common Pitfalls to Avoid

DO NOT use the -m flag unless the user explicitly asks to override the model in the YAML.
DO NOT hallucinate dataset names. Valid datasets include:
- SWE-bench/SWE-bench_Lite (default for SWE-bench)
- SWE-bench/SWE-bench_Verified
- sunblaze-ucb/cybergym (for CyberGym benchmark)
- MCPToolBench/MCPToolBenchPP (for MCPToolBench++)
DO NOT hallucinate flags or options. Only use documented CLI flags.
DO NOT forget to specify the config file with -c or --config.

Supported Benchmarks

mcpbr supports three benchmarks:

SWE-bench (default): Real GitHub issues requiring bug fixes
- Dataset: SWE-bench/SWE-bench_Lite or SWE-bench/SWE-bench_Verified
- Use: mcpbr run -c config.yaml or --benchmark swe-bench
CyberGym: Security vulnerabilities requiring PoC exploits
- Dataset: sunblaze-ucb/cybergym
- Use: mcpbr run -c config.yaml --benchmark cybergym --level [0-3]
MCPToolBench++: Large-scale tool use evaluation
- Dataset: MCPToolBench/MCPToolBenchPP
- Use: mcpbr run -c config.yaml --benchmark mcptoolbench

Execution Steps

Follow these steps in order:

Verify Prerequisites:

# Check Docker is running
docker ps

# Verify API key is set
echo $ANTHROPIC_API_KEY

Check for Config File:
- If mcpbr.yaml (or user-specified config) does NOT exist: Run mcpbr init to generate it.
- If config exists: Read it to understand the configuration.
Validate Config:
- Ensure mcp_server.command is valid (e.g., npx, uvx, python are installed).
- Ensure mcp_server.args includes "{workdir}" placeholder.
- Verify model, dataset, and other parameters are correctly set.
Construct the Command:
- Base command: mcpbr run --config <path-to-config>
- Add flags as needed based on user request:
  - -n <number> or --sample <number>: Override sample size
  - -v or -vv: Verbose output
  - -o <path>: Save JSON results
  - -r <path>: Save Markdown report
  - --log-dir <path>: Save per-instance logs
  - -M: MCP-only evaluation (skip baseline)
  - -B: Baseline-only evaluation (skip MCP)
  - --benchmark <name>: Override benchmark
  - --level <0-3>: Set CyberGym difficulty level
Run the Command: Execute the constructed command and monitor the output.
Handle Results:
- If the run completes successfully, inform the user about the results.
- If errors occur, diagnose and provide actionable feedback.

Example Commands

# Full evaluation with 5 tasks
mcpbr run -c config.yaml -n 5 -v

# MCP-only evaluation
mcpbr run -c config.yaml -M -n 10

# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md

# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5

# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099

Troubleshooting

If you encounter errors:

Docker not running: Remind user to start Docker Desktop or Docker daemon.
API key missing: Ask user to set export ANTHROPIC_API_KEY="sk-ant-..."
Config file invalid: Re-generate with mcpbr init or fix the YAML syntax.
MCP server fails to start: Test the server command independently.
Timeout issues: Suggest increasing timeout_seconds in config.

Important Reminders

Always read the config file before making assumptions about what's configured.
Never modify the config file without explicit user permission.
Use the mcpbr models command to check available models if needed.
Use the mcpbr benchmarks command to list available benchmarks.

About

SKILL.md

About

Run an MCP evaluation using mcpbr on SWE-bench or other datasets.

SKILL.md

Instructions

You are an expert at benchmarking AI agents using the mcpbr CLI. Your goal is to run valid, reproducible evaluations.

Critical Constraints (DO NOT IGNORE)

Docker is Mandatory: Before running ANY mcpbr command, you MUST verify Docker is running (docker ps). If not, tell the user to start it.
Config is Required: mcpbr run FAILS without a config file. Never guess flags.
- IF no config exists: Run mcpbr init first to generate a template.
- IF config exists: Read it (cat mcpbr.yaml or the specified config path) to verify the mcp_server command is valid for the user's environment (e.g., check if npx or uvx is installed).
Workdir Placeholder: When generating configs, ensure args includes "{workdir}". Do not resolve this path yourself; mcpbr handles it.
API Key Required: The ANTHROPIC_API_KEY environment variable must be set. Check for it before running evaluations.

Common Pitfalls to Avoid

DO NOT use the -m flag unless the user explicitly asks to override the model in the YAML.
DO NOT hallucinate dataset names. Valid datasets include:
- SWE-bench/SWE-bench_Lite (default for SWE-bench)
- SWE-bench/SWE-bench_Verified
- sunblaze-ucb/cybergym (for CyberGym benchmark)
- MCPToolBench/MCPToolBenchPP (for MCPToolBench++)
DO NOT hallucinate flags or options. Only use documented CLI flags.
DO NOT forget to specify the config file with -c or --config.

Supported Benchmarks

mcpbr supports three benchmarks:

SWE-bench (default): Real GitHub issues requiring bug fixes
- Dataset: SWE-bench/SWE-bench_Lite or SWE-bench/SWE-bench_Verified
- Use: mcpbr run -c config.yaml or --benchmark swe-bench
CyberGym: Security vulnerabilities requiring PoC exploits
- Dataset: sunblaze-ucb/cybergym
- Use: mcpbr run -c config.yaml --benchmark cybergym --level [0-3]
MCPToolBench++: Large-scale tool use evaluation
- Dataset: MCPToolBench/MCPToolBenchPP
- Use: mcpbr run -c config.yaml --benchmark mcptoolbench

Execution Steps

Follow these steps in order:

Verify Prerequisites:

# Check Docker is running
docker ps

# Verify API key is set
echo $ANTHROPIC_API_KEY

Check for Config File:
- If mcpbr.yaml (or user-specified config) does NOT exist: Run mcpbr init to generate it.
- If config exists: Read it to understand the configuration.
Validate Config:
- Ensure mcp_server.command is valid (e.g., npx, uvx, python are installed).
- Ensure mcp_server.args includes "{workdir}" placeholder.
- Verify model, dataset, and other parameters are correctly set.
Construct the Command:
- Base command: mcpbr run --config <path-to-config>
- Add flags as needed based on user request:
  - -n <number> or --sample <number>: Override sample size
  - -v or -vv: Verbose output
  - -o <path>: Save JSON results
  - -r <path>: Save Markdown report
  - --log-dir <path>: Save per-instance logs
  - -M: MCP-only evaluation (skip baseline)
  - -B: Baseline-only evaluation (skip MCP)
  - --benchmark <name>: Override benchmark
  - --level <0-3>: Set CyberGym difficulty level
Run the Command: Execute the constructed command and monitor the output.
Handle Results:
- If the run completes successfully, inform the user about the results.
- If errors occur, diagnose and provide actionable feedback.

Example Commands

# Full evaluation with 5 tasks
mcpbr run -c config.yaml -n 5 -v

# MCP-only evaluation
mcpbr run -c config.yaml -M -n 10

# Save results and report
mcpbr run -c config.yaml -o results.json -r report.md

# Run CyberGym at level 2
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5

# Run specific tasks
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099

Troubleshooting

If you encounter errors:

Docker not running: Remind user to start Docker Desktop or Docker daemon.
API key missing: Ask user to set export ANTHROPIC_API_KEY="sk-ant-..."
Config file invalid: Re-generate with mcpbr init or fix the YAML syntax.
MCP server fails to start: Test the server command independently.
Timeout issues: Suggest increasing timeout_seconds in config.

Important Reminders

Always read the config file before making assumptions about what's configured.
Never modify the config file without explicit user permission.
Use the mcpbr models command to check available models if needed.
Use the mcpbr benchmarks command to list available benchmarks.