Import model evaluation results into ReasonScape m12x dataset. Use when importing new models, quantization variants, or adding evaluation data...
You are helping the user import new model evaluation results into the ReasonScape dataset structure.
This project has two active test suites:
data/r12/)data/inf-tables/)m12x has been archived and its data removed. Do not reference m12x.
When the user invokes /import <model-name>, follow these steps:
Important: Use the AskUserQuestion tool only for all user interactions to keep the flow smooth.
If the user does not provide a model name/pattern, list all available results:
ls -1 results/
Otherwise, search results/ for folders matching the model name pattern across all suites:
ls -1 results/ | grep -i "<model-name>"
Extract all unique model variants found and group by:
Folder naming patterns:
<date>_r12_d0_<model-name>_<template>_<sampler>_<mode>_<precision><date>_inf-tables_<tiers>_<model-name>_<template>_<sampler>_<mode>_<precision><tiers> is a hyphen-separated list of difficulty degrees, e.g. d10-14-18-22-26-30Note: r12 has a single difficulty tier (d0 only). tables300 spans multiple difficulty degrees in a single folder.
Suite selection:
AskUserQuestion to let the user pick which suite to import intoHandle variant ambiguity:
AskUserQuestion to confirm which to importBefore fetching metadata, search for any existing cohort data for this model across all suites.
From the model name and results folders, determine the cohort directory name.
What is a cohort? A cohort groups equivalent models together, including:
A cohort does NOT include:
Determining cohort name:
2026-..._r12_d0_MiroThinker-v1.5-30B-fp16_*)MiroThinker-v1.5-30B, Qwen3-30B-A3B, GLM-4.5, etc.data/<suite>/<ModelIdentifier>/The cohort.py list --search command searches inside all available cohorts by default:
python ./cohort.py list --search '<ModelFamilyRegExp>'
If a match is found, extract and reuse:
quant:* and ctx:* as appropriateCheck if the cohort already exists in the target suite:
ls "data/<suite>/<ModelFamily>/" 2>/dev/null
cat "data/<suite>/<ModelFamily>/evals.json" 2>/dev/null | jq .
Before proceeding, report:
data/<suite>/<ModelFamily>/Skip this step if a usable hf_id was found in step 2b.
Otherwise, search using the HF CLI:
hf models ls --search "<model-name>" --limit 10
Use AskUserQuestion to confirm the HF ID:
Skip this step if arch and family were found in step 2b.
Otherwise, run modelinfo:
python analyze.py modelinfo --hf-id <hf_id> --output-dir /tmp/import-modelinfo
Read the generated MODELINFO.md and extract:
family:qwen3)*MoeForCausalLM → arch:moe*ForCausalLM (non-MoE) → arch:densearch:ssm or arch:hybridParse the result folder names to extract:
MiroThinker-v1.5-30B-fp16, Qwen3-30B-A3B-Thinking-2507-fp16-16kzeroshot-nosys, zerocot-nosysqwen3-think-max, greedy-max-16k → ctx:16k facet (ONLY if explicitly in folder name, otherwise omit)Build the groups array. If copying from m12x, add or adjust quant:* and ctx:* facets as needed.
Size (infer from parameter count in model name):
tiny: <4Bsmall: 4-8Bmid: 8-20Blarge: 20-100Bxlarge: >100BArchitecture (from modelinfo or m12x):
arch:dense / arch:moe / arch:ssm / arch:hybridQuantization (from folder name - ALWAYS include):
quant:fp16 / quant:fp8 / quant:awq / quant:gptqContext Length (from folder name - ONLY if explicit):
ctx:16k if folder name contains -16kFamilies:
base_model (e.g., family:qwen3, family:llama)family:mirothinker)Create a Python script that moves result folders into the r12 cohort directory and appends to its evals.json.
For each variant found:
#!/usr/bin/env python3
import json
import shutil
from pathlib import Path
from glob import glob
# Model: <model-name>-<quant>
cohort_dir = Path("data/r12/<ModelFamily>")
eval_json = cohort_dir / "evals.json"
print("Processing <model-name>-<quant>...")
# Create cohort directory if needed
cohort_dir.mkdir(parents=True, exist_ok=True)
# Move result folders
for src in glob("results/*r12*<model-name>-<quant>*"):
dest = cohort_dir / Path(src).name
shutil.move(src, dest)
print(f" Moved {src} -> {dest}")
# Build new eval entry
new_eval = {
"evaluate": {"glob": "data/r12/<ModelFamily>/*_<model-name>-<quant>_<template>_<sampler>_*/*"},
"filters": {"model": "<model-name>-<quant>", "template": "<template>", "sampler": "<sampler>"},
"label": "<Human Readable Label>",
"groups": [<facet-array>],
"hf_id": "<hf_id>"
}
# Create or append to evals.json
if eval_json.exists():
with open(eval_json) as f:
evals = json.load(f)
evals.append(new_eval)
else:
evals = [new_eval]
with open(eval_json, 'w') as f:
json.dump(evals, f, indent=2)
print("✓ Imported <model-name>-<quant>")
CRITICAL - Glob Pattern Rules:
When a model has multiple template/sampler combinations (e.g., both thinking and instruct modes), you MUST create separate eval entries with SPECIFIC globs:
❌ WRONG (overly broad glob matches multiple templates):
{
"evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_*/*"},
"filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"}
}
This glob *_Model-fp16-16k_*/* will match BOTH zeroshot-nosys and zerocot-nosys templates, causing:
✗ Expected exactly 1 scenario (model/template/sampler), got 2
✅ CORRECT (specific globs that include template name):
[
{
"evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zeroshot-nosys_think-max_*/*"},
"filters": {"model": "Model-fp16-16k", "template": "zeroshot-nosys", "sampler": "think-max"},
"label": "Model (FP16, 16k) Thinking"
},
{
"evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zerocot-nosys_instruct_*/*"},
"filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"},
"label": "Model (FP16, 16k) Instruct"
}
]
Key Rules:
*_<template>_<sampler>_*/* not just *_*/*Other Important Notes:
"hf_quant_id": "<quant_id>" field if this is a quantized version with a separate HF repo*-fp16* to *-fp16_* so it won't match -fp16-16k_ variantstags field: r12 does not use leaderboard tags (omit entirely)Save the generated script to /tmp/ with a UNIQUE filename and execute it immediately.
Important: /tmp/ persists across skill invocations, so always use a unique filename to avoid Write tool conflicts with previous runs.
Use Bash heredoc to create and execute the script:
SCRIPT_PATH="/tmp/import-$(date +%s).py"
cat > "$SCRIPT_PATH" << 'EOF'
#!/usr/bin/env python3
...script content...
EOF
python3 "$SCRIPT_PATH"
After execution, verify the import is correct:
# Verify glob patterns match expected scenarios
python ./cohort.py verify "data/r12/<ModelFamily>"
This command validates that:
If verification passes, also list the cohort to show the newly imported evals:
python ./cohort.py list "data/r12/<ModelFamily>"
Report success and show the newly imported eval entries from the cohort listing.
r12_d0 or tables300_d* folders)ctx:* facethf_id per eval entryzeroshot-nosys (thinking), another for zerocot-nosys (instruct). Always verify with cohort.py verify after import to ensure each eval has exactly one scenario matching its filters.tinysmallmidlargexlargeUser: /import gemma-3-12b
1. Found: results/2026-02-18_23-31-16_r12_d0_gemma-3-12b-it-fp16_zerocot-nosys_greedy-max_normal_flash
Suite: r12 (only one suite found)
2. cohort.py list --search 'gemma-3-12b': NOT FOUND in any existing cohort
3. HF search → google/gemma-3-12b-it confirmed
4. Fetch modelinfo → arch:dense, family:gemma3
5. Faceting: family:gemma3, arch:dense, size:mid, quant:fp16
6. r12 cohort: NEW (data/r12/Gemma-3-12B/ does not exist yet)
7. Generate and execute script → move folder, create evals.json
8. Verify with cohort.py verify data/r12/Gemma-3-12B
9. ✓ Imported to data/r12/Gemma-3-12B/
User: /import NewModel-30B
1. Found: results/2026-02-19_r12_d0_NewModel-30B-fp16_zeroshot-nosys_greedy-max_normal_flash
Suite: r12
2. cohort.py list --search 'NewModel-30B': FOUND in data/r12/NewModel-30B/evals.json
- Reusing: label="NewModel-30B (FP16)", groups=[family:newmodel, arch:dense, size:large], hf_id=org/NewModel-30B
3. r12 cohort: EXISTS (append new eval)
4. Generate and execute script
5. Verify with cohort.py verify data/r12/NewModel-30B
6. ✓ Imported to data/r12/NewModel-30B/
User: /import Qwen3-4B-Thinking-16k
1. Found: results/2026-02-18_r12_d0_Qwen3-4B-Thinking-2507-fp16-16k_zeroshot-nosys_qwen3-think-max_normal_flash
Suite: r12
2. cohort.py list --search 'Qwen3-4B': FOUND — reuse base metadata from existing cohort
3. r12 cohort: EXISTS (data/r12/Qwen3-4B/evals.json already has fp16 entry)
4. Extracted: ctx:16k from "-16k" suffix
5. Refine existing glob: *-fp16_* to prevent matching -16k variant
6. Add new entry with ctx:16k facet
7. Execute, verify
8. ✓ Added -16k variant to data/r12/Qwen3-4B/
User: /import nemotron-120b
1. Found: results/2026-03-15_09-35-22_tables300_d10-14-18-22-26-30_NVIDIA-Nemotron-3-Super-120B-A12B-FP8-32k_zeroshot-nosys_nemotron3-max_normal_flash
Suite: tables300 (only one suite found)
2. cohort.py list --search 'Nemotron-3-Super-120B': NOT FOUND
3. HF search → nvidia/Nemotron-3-Super-120B-A12B confirmed
4. Fetch modelinfo → arch:moe, family:nemotron3
5. Faceting: family:nemotron3, arch:moe, size:xlarge, quant:fp8, ctx:32k
6. tables300 cohort: NEW (data/inf-tables/NVIDIA-Nemotron-3-Super-120B-A12B/ does not exist)
7. Generate and execute script → move folder, create evals.json
8. Verify with cohort.py verify data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B
9. ✓ Imported to data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B/
Symptom (from evaluate.py):
✗ Failed eval ccbea6
Expected exactly 1 scenario (model/template/sampler), got 2:
{('Model-fp16-16k', 'zeroshot-nosys', 'sampler1'), ('Model-fp16-16k', 'zerocot-nosys', 'sampler2')}
Symptom (from cohort.py verify):
Eval: Model (FP16, 16k)
eval_id: ccbea6
Expected: model=Model-fp16-16k, template=zerocot-nosys, sampler=instruct
Glob: data/r12/Model/*_Model-fp16-16k_*/*
Found 2 result folders
✗ Expected exactly 1 scenario (model/template/sampler), got 2:
- (Model-fp16-16k, zeroshot-nosys, sampler1)
- (Model-fp16-16k, zerocot-nosys, sampler2)
Cause: Glob pattern is too broad and matches multiple template/sampler combinations.
Diagnosis:
# Use verify command to catch this BEFORE running evaluate.py
python ./cohort.py verify "data/<suite>/<Cohort>"
Fix:
ls -1 data/<suite>/<Cohort>/zeroshot-nosys vs zerocot-nosys)data/<suite>/<Cohort>/evals.json*_Model-fp16-16k_*/* with specific patterns:*_Model-fp16-16k_zeroshot-nosys_sampler1_*/**_Model-fp16-16k_zerocot-nosys_sampler2_*/*python ./cohort.py verify "data/<suite>/<Cohort>"python ./cohort.py list "data/<suite>/<Cohort>"Symptom (from cohort.py verify):
Eval: Model (FP16, 16k)
eval_id: abc123
Expected: model=Model-v1_5-fp16, template=zeroshot-nosys, sampler=greedy
Glob: data/r12/Model/*_Model-v1_5-fp16_*/*
Found 1 result folders
✗ Scenario mismatch!
Expected: (Model-v1_5-fp16, zeroshot-nosys, greedy)
Actual: (Model-v1-5-fp16, zeroshot-nosys, greedy)
Cause: Filters use different naming than actual folder names (e.g., underscores vs hyphens).
Fix:
ls -1 data/<suite>/<Cohort>/data/<suite>/<Cohort>/evals.jsonmodel filter to match the actual folder namingpython ./cohort.py verify "data/<suite>/<Cohort>"source venv/bin/activated0 only) — do not expect d1/d2 foldersd10-14-18-22-26-30)cohort.py list --search searches all cohorts by default — no path argument needed/tmp/ for temporary modelinfo cache