NewFlame, an assistant that learns and improves. Available on

import

the-crypt-keeper/import

AI & ML

About

SKILL.md

import

the-crypt-keeper/import

AI & ML

About

Import model evaluation results into ReasonScape m12x dataset. Use when importing new models, quantization variants, or adding evaluation data...

SKILL.md

Import Model Skill

You are helping the user import new model evaluation results into the ReasonScape dataset structure.

Supported Test Suites

This project has two active test suites:

r12 — primary reasoning evaluation suite (data/r12/)
tables300 — inference/tables evaluation suite (data/inf-tables/)

m12x has been archived and its data removed. Do not reference m12x.

Workflow

When the user invokes /import <model-name>, follow these steps:

Important: Use the AskUserQuestion tool only for all user interactions to keep the flow smooth.

1. Search for Results and Determine Suite

If the user does not provide a model name/pattern, list all available results:

ls -1 results/

Otherwise, search results/ for folders matching the model name pattern across all suites:

ls -1 results/ | grep -i "<model-name>"

Extract all unique model variants found and group by:

Test suite: r12 vs tables300 (from folder name prefix)
Quantization (e.g., fp16, AWQ, FP8)
Sampler variants (e.g., base vs sglang)
Context variants (e.g., default vs 16k)
Template and sampler combinations

Folder naming patterns:

r12: <date>_r12_d0_<model-name>_<template>_<sampler>_<mode>_<precision>
tables300: <date>_inf-tables_<tiers>_<model-name>_<template>_<sampler>_<mode>_<precision>
- <tiers> is a hyphen-separated list of difficulty degrees, e.g. d10-14-18-22-26-30

Note: r12 has a single difficulty tier (d0 only). tables300 spans multiple difficulty degrees in a single folder.

Suite selection:

If the user specified a suite → use that suite
If results exist for only ONE suite → use that suite and inform the user
If results exist for MULTIPLE suites → use AskUserQuestion to let the user pick which suite to import into
If NO results found → error and ask user to check the model name

Handle variant ambiguity:

If only ONE distinct variant found → Proceed with that variant
If MULTIPLE distinct variants found → Use AskUserQuestion to confirm which to import

2. Check for Existing Cohort (CRITICAL STEP)

Before fetching metadata, search for any existing cohort data for this model across all suites.

2a. Determine Cohort Directory Name

From the model name and results folders, determine the cohort directory name.

What is a cohort? A cohort groups equivalent models together, including:

The base model and its variants (e.g., Qwen3-30B-A3B-Thinking + Qwen3-30B-A3B-Instruct)
All quantizations of the same model (fp16, AWQ, FP8, GPTQ)
Context extensions/REAPs (e.g., GLM-4.7 is a REAP of GLM-4.5)

A cohort does NOT include:

Significantly different model sizes (GLM-4.5-Air is 1/3 the size → separate cohort)
Different model architectures or generations

Determining cohort name:

Look at folder name patterns (e.g., 2026-..._r12_d0_MiroThinker-v1.5-30B-fp16_*)
Extract base model identifier: MiroThinker-v1.5-30B, Qwen3-30B-A3B, GLM-4.5, etc.
Target cohort: data/<suite>/<ModelIdentifier>/

2b. Search All Existing Cohorts

The cohort.py list --search command searches inside all available cohorts by default:

python ./cohort.py list --search '<ModelFamilyRegExp>'

If a match is found, extract and reuse:

Label: Copy or adapt the human-readable label
groups: Copy the facet array (family, arch, size) — add/adjust quant:* and ctx:* as appropriate
hf_id: Copy directly
hf_quant_id: Copy if applicable

2c. Check Target Suite Cohort

Check if the cohort already exists in the target suite:

ls "data/<suite>/<ModelFamily>/" 2>/dev/null
cat "data/<suite>/<ModelFamily>/evals.json" 2>/dev/null | jq .

2d. Report Context

Before proceeding, report:

Cohort directory: data/<suite>/<ModelFamily>/
Existing cohort search: FOUND (with reusable metadata) or NOT FOUND
Target suite cohort: EXISTS (existing evals) or NEW
If FOUND: List reusable metadata (label, groups, hf_id)

3. Find HuggingFace Model ID

Skip this step if a usable hf_id was found in step 2b.

Otherwise, search using the HF CLI:

hf models ls --search "<model-name>" --limit 10

Use AskUserQuestion to confirm the HF ID:

If 1 clear match: Ask "Confirm HF model?" with the top result as default option
If multiple matches: Present top 3-5 as options and let user pick

4. Fetch Model Metadata

Skip this step if arch and family were found in step 2b.

Otherwise, run modelinfo:

python analyze.py modelinfo --hf-id <hf_id> --output-dir /tmp/import-modelinfo

Read the generated MODELINFO.md and extract:

base_model: Extract base architecture family (e.g., "Qwen/Qwen3-30B" → family:qwen3)
Architecture: Extract arch type:
- *MoeForCausalLM → arch:moe
- *ForCausalLM (non-MoE) → arch:dense
- Look for SSM/Mamba/hybrid indicators → arch:ssm or arch:hybrid

5. Extract Metadata from Folder Names

Parse the result folder names to extract:

Model name with quant: e.g., MiroThinker-v1.5-30B-fp16, Qwen3-30B-A3B-Thinking-2507-fp16-16k
Template: e.g., zeroshot-nosys, zerocot-nosys
Sampler: e.g., qwen3-think-max, greedy-max
Quantization: from model name suffix (fp16, AWQ, FP8, GPTQ)
Context length: from suffix like -16k → ctx:16k facet (ONLY if explicitly in folder name, otherwise omit)

6. Determine Faceting

Build the groups array. If copying from m12x, add or adjust quant:* and ctx:* facets as needed.

Size (infer from parameter count in model name):

tiny: <4B
small: 4-8B
mid: 8-20B
large: 20-100B
xlarge: >100B

Architecture (from modelinfo or m12x):

arch:dense / arch:moe / arch:ssm / arch:hybrid

Quantization (from folder name - ALWAYS include):

quant:fp16 / quant:fp8 / quant:awq / quant:gptq

Context Length (from folder name - ONLY if explicit):

ctx:16k if folder name contains -16k
OMIT this facet if no context suffix in folder name

Families:

Base family from base_model (e.g., family:qwen3, family:llama)
Finetune family from model name if applicable (e.g., family:mirothinker)

7. Generate Migration Script

Create a Python script that moves result folders into the r12 cohort directory and appends to its evals.json.

For each variant found:

#!/usr/bin/env python3
import json
import shutil
from pathlib import Path
from glob import glob

# Model: <model-name>-<quant>
cohort_dir = Path("data/r12/<ModelFamily>")
eval_json = cohort_dir / "evals.json"

print("Processing <model-name>-<quant>...")

# Create cohort directory if needed
cohort_dir.mkdir(parents=True, exist_ok=True)

# Move result folders
for src in glob("results/*r12*<model-name>-<quant>*"):
    dest = cohort_dir / Path(src).name
    shutil.move(src, dest)
    print(f"  Moved {src} -> {dest}")

# Build new eval entry
new_eval = {
    "evaluate": {"glob": "data/r12/<ModelFamily>/*_<model-name>-<quant>_<template>_<sampler>_*/*"},
    "filters": {"model": "<model-name>-<quant>", "template": "<template>", "sampler": "<sampler>"},
    "label": "<Human Readable Label>",
    "groups": [<facet-array>],
    "hf_id": "<hf_id>"
}

# Create or append to evals.json
if eval_json.exists():
    with open(eval_json) as f:
        evals = json.load(f)
    evals.append(new_eval)
else:
    evals = [new_eval]

with open(eval_json, 'w') as f:
    json.dump(evals, f, indent=2)

print("✓ Imported <model-name>-<quant>")

CRITICAL - Glob Pattern Rules:

When a model has multiple template/sampler combinations (e.g., both thinking and instruct modes), you MUST create separate eval entries with SPECIFIC globs:

❌ WRONG (overly broad glob matches multiple templates):

{
  "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_*/*"},
  "filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"}
}

This glob *_Model-fp16-16k_*/* will match BOTH zeroshot-nosys and zerocot-nosys templates, causing:

✗ Expected exactly 1 scenario (model/template/sampler), got 2

✅ CORRECT (specific globs that include template name):

[
  {
    "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zeroshot-nosys_think-max_*/*"},
    "filters": {"model": "Model-fp16-16k", "template": "zeroshot-nosys", "sampler": "think-max"},
    "label": "Model (FP16, 16k) Thinking"
  },
  {
    "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zerocot-nosys_instruct_*/*"},
    "filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"},
    "label": "Model (FP16, 16k) Instruct"
  }
]

Key Rules:

Each eval entry must match EXACTLY ONE (model, template, sampler) combination
Include template name in glob: *_<template>_<sampler>_*/* not just *_*/*
Add descriptive labels to distinguish variants: "Thinking" vs "Instruct"
Never use wildcards that could match multiple templates

Other Important Notes:

Add "hf_quant_id": "<quant_id>" field if this is a quantized version with a separate HF repo
When adding variants to existing r12 cohort: Refine old globs to prevent matching new variants
- Example: Change *-fp16* to *-fp16_* so it won't match -fp16-16k_ variants
No tags field: r12 does not use leaderboard tags (omit entirely)

8. Execute

Save the generated script to /tmp/ with a UNIQUE filename and execute it immediately.

Important: /tmp/ persists across skill invocations, so always use a unique filename to avoid Write tool conflicts with previous runs.

Use Bash heredoc to create and execute the script:

SCRIPT_PATH="/tmp/import-$(date +%s).py"
cat > "$SCRIPT_PATH" << 'EOF'
#!/usr/bin/env python3
...script content...
EOF
python3 "$SCRIPT_PATH"

9. Verify

After execution, verify the import is correct:

# Verify glob patterns match expected scenarios
python ./cohort.py verify "data/r12/<ModelFamily>"

This command validates that:

Each eval's glob matches exactly one (model, template, sampler) scenario
The matched scenario matches the filters in the eval definition
No parsing errors or missing folders

If verification passes, also list the cohort to show the newly imported evals:

python ./cohort.py list "data/r12/<ModelFamily>"

Report success and show the newly imported eval entries from the cohort listing.

Edge Cases

m12x cohort exists: Reuse label, groups, hf_id from existing m12x entry - this is the common case
Brand new model (not in m12x): Run full modelinfo fetch and construct groups from scratch
r12 cohort already exists: Append to evals.json; refine existing globs if needed to avoid cross-matching
Multiple HF matches: Show top 3, ask user to pick
No results found: Error and ask user to check the model name (ensure results are r12_d0 or tables300_d* folders)
Multiple quantizations: Import all variants found
Multiple context lengths: Each context length gets its own eval entry with appropriate ctx:* facet
Different model variants with separate HF IDs: Fetch metadata and store correct hf_id per eval entry
Multiple template/sampler combinations: Create SEPARATE eval entries with SPECIFIC globs that include the template name (see CRITICAL section in step 7). Common pattern: one entry for zeroshot-nosys (thinking), another for zerocot-nosys (instruct). Always verify with cohort.py verify after import to ensure each eval has exactly one scenario matching its filters.

Size Band Reference

1B, 2B, 3B → tiny
4B, 7B, 8B → small
9B-19B → mid
20B-99B → large
100B+ → xlarge

Example Invocations

Example 1: New r12 Import (Model Not Yet in Any Suite)

User: /import gemma-3-12b

1. Found: results/2026-02-18_23-31-16_r12_d0_gemma-3-12b-it-fp16_zerocot-nosys_greedy-max_normal_flash
   Suite: r12 (only one suite found)
2. cohort.py list --search 'gemma-3-12b': NOT FOUND in any existing cohort
3. HF search → google/gemma-3-12b-it confirmed
4. Fetch modelinfo → arch:dense, family:gemma3
5. Faceting: family:gemma3, arch:dense, size:mid, quant:fp16
6. r12 cohort: NEW (data/r12/Gemma-3-12B/ does not exist yet)
7. Generate and execute script → move folder, create evals.json
8. Verify with cohort.py verify data/r12/Gemma-3-12B
9. ✓ Imported to data/r12/Gemma-3-12B/

Example 2: Import with Existing Cohort Metadata (Common Case)

User: /import NewModel-30B

1. Found: results/2026-02-19_r12_d0_NewModel-30B-fp16_zeroshot-nosys_greedy-max_normal_flash
   Suite: r12
2. cohort.py list --search 'NewModel-30B': FOUND in data/r12/NewModel-30B/evals.json
   - Reusing: label="NewModel-30B (FP16)", groups=[family:newmodel, arch:dense, size:large], hf_id=org/NewModel-30B
3. r12 cohort: EXISTS (append new eval)
4. Generate and execute script
5. Verify with cohort.py verify data/r12/NewModel-30B
6. ✓ Imported to data/r12/NewModel-30B/

Example 3: Adding Context Variant to Existing Cohort

User: /import Qwen3-4B-Thinking-16k

1. Found: results/2026-02-18_r12_d0_Qwen3-4B-Thinking-2507-fp16-16k_zeroshot-nosys_qwen3-think-max_normal_flash
   Suite: r12
2. cohort.py list --search 'Qwen3-4B': FOUND — reuse base metadata from existing cohort
3. r12 cohort: EXISTS (data/r12/Qwen3-4B/evals.json already has fp16 entry)
4. Extracted: ctx:16k from "-16k" suffix
5. Refine existing glob: *-fp16_* to prevent matching -16k variant
6. Add new entry with ctx:16k facet
7. Execute, verify
8. ✓ Added -16k variant to data/r12/Qwen3-4B/

Example 4: tables300 Import

User: /import nemotron-120b

1. Found: results/2026-03-15_09-35-22_tables300_d10-14-18-22-26-30_NVIDIA-Nemotron-3-Super-120B-A12B-FP8-32k_zeroshot-nosys_nemotron3-max_normal_flash
   Suite: tables300 (only one suite found)
2. cohort.py list --search 'Nemotron-3-Super-120B': NOT FOUND
3. HF search → nvidia/Nemotron-3-Super-120B-A12B confirmed
4. Fetch modelinfo → arch:moe, family:nemotron3
5. Faceting: family:nemotron3, arch:moe, size:xlarge, quant:fp8, ctx:32k
6. tables300 cohort: NEW (data/inf-tables/NVIDIA-Nemotron-3-Super-120B-A12B/ does not exist)
7. Generate and execute script → move folder, create evals.json
8. Verify with cohort.py verify data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B
9. ✓ Imported to data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B/

Troubleshooting

Error: "Expected exactly 1 scenario, got 2"

Symptom (from evaluate.py):

✗ Failed eval ccbea6
Expected exactly 1 scenario (model/template/sampler), got 2:
{('Model-fp16-16k', 'zeroshot-nosys', 'sampler1'), ('Model-fp16-16k', 'zerocot-nosys', 'sampler2')}

Symptom (from cohort.py verify):

Eval: Model (FP16, 16k)
  eval_id: ccbea6
  Expected: model=Model-fp16-16k, template=zerocot-nosys, sampler=instruct
  Glob: data/r12/Model/*_Model-fp16-16k_*/*
  Found 2 result folders
  ✗ Expected exactly 1 scenario (model/template/sampler), got 2:
    - (Model-fp16-16k, zeroshot-nosys, sampler1)
    - (Model-fp16-16k, zerocot-nosys, sampler2)

Cause: Glob pattern is too broad and matches multiple template/sampler combinations.

Diagnosis:

# Use verify command to catch this BEFORE running evaluate.py
python ./cohort.py verify "data/<suite>/<Cohort>"

Fix:

Check the result folders: ls -1 data/<suite>/<Cohort>/
Identify which templates exist (e.g., zeroshot-nosys vs zerocot-nosys)
Edit data/<suite>/<Cohort>/evals.json
Replace broad glob *_Model-fp16-16k_*/* with specific patterns:
- *_Model-fp16-16k_zeroshot-nosys_sampler1_*/*
- *_Model-fp16-16k_zerocot-nosys_sampler2_*/*
Add descriptive labels to distinguish them ("Thinking" vs "Instruct")
Verify fix: python ./cohort.py verify "data/<suite>/<Cohort>"
Confirm: python ./cohort.py list "data/<suite>/<Cohort>"

Error: "Scenario mismatch"

Symptom (from cohort.py verify):

Eval: Model (FP16, 16k)
  eval_id: abc123
  Expected: model=Model-v1_5-fp16, template=zeroshot-nosys, sampler=greedy
  Glob: data/r12/Model/*_Model-v1_5-fp16_*/*
  Found 1 result folders
  ✗ Scenario mismatch!
    Expected: (Model-v1_5-fp16, zeroshot-nosys, greedy)
    Actual:   (Model-v1-5-fp16, zeroshot-nosys, greedy)

Cause: Filters use different naming than actual folder names (e.g., underscores vs hyphens).

Fix:

Check the actual folder name: ls -1 data/<suite>/<Cohort>/
Edit data/<suite>/<Cohort>/evals.json
Update the model filter to match the actual folder naming
Update the glob pattern to match the corrected model name
Verify fix: python ./cohort.py verify "data/<suite>/<Cohort>"

Notes

Always work from the ReasonScape root directory
Activate venv before running analyze.py: source venv/bin/activate
r12 results have a single tier (d0 only) — do not expect d1/d2 folders
tables300 results span multiple difficulty degrees encoded in the folder name (e.g., d10-14-18-22-26-30)
cohort.py list --search searches all cohorts by default — no path argument needed
Use /tmp/ for temporary modelinfo cache
Be explicit about what you're doing at each step

About

SKILL.md

About

Import model evaluation results into ReasonScape m12x dataset. Use when importing new models, quantization variants, or adding evaluation data...

SKILL.md

Import Model Skill

You are helping the user import new model evaluation results into the ReasonScape dataset structure.

Supported Test Suites

This project has two active test suites:

r12 — primary reasoning evaluation suite (data/r12/)
tables300 — inference/tables evaluation suite (data/inf-tables/)

m12x has been archived and its data removed. Do not reference m12x.

Workflow

When the user invokes /import <model-name>, follow these steps:

Important: Use the AskUserQuestion tool only for all user interactions to keep the flow smooth.

1. Search for Results and Determine Suite

If the user does not provide a model name/pattern, list all available results:

ls -1 results/

Otherwise, search results/ for folders matching the model name pattern across all suites:

ls -1 results/ | grep -i "<model-name>"

Extract all unique model variants found and group by:

Test suite: r12 vs tables300 (from folder name prefix)
Quantization (e.g., fp16, AWQ, FP8)
Sampler variants (e.g., base vs sglang)
Context variants (e.g., default vs 16k)
Template and sampler combinations

Folder naming patterns:

r12: <date>_r12_d0_<model-name>_<template>_<sampler>_<mode>_<precision>
tables300: <date>_inf-tables_<tiers>_<model-name>_<template>_<sampler>_<mode>_<precision>
- <tiers> is a hyphen-separated list of difficulty degrees, e.g. d10-14-18-22-26-30

Note: r12 has a single difficulty tier (d0 only). tables300 spans multiple difficulty degrees in a single folder.

Suite selection:

If the user specified a suite → use that suite
If results exist for only ONE suite → use that suite and inform the user
If results exist for MULTIPLE suites → use AskUserQuestion to let the user pick which suite to import into
If NO results found → error and ask user to check the model name

Handle variant ambiguity:

If only ONE distinct variant found → Proceed with that variant
If MULTIPLE distinct variants found → Use AskUserQuestion to confirm which to import

2. Check for Existing Cohort (CRITICAL STEP)

Before fetching metadata, search for any existing cohort data for this model across all suites.

2a. Determine Cohort Directory Name

From the model name and results folders, determine the cohort directory name.

What is a cohort? A cohort groups equivalent models together, including:

The base model and its variants (e.g., Qwen3-30B-A3B-Thinking + Qwen3-30B-A3B-Instruct)
All quantizations of the same model (fp16, AWQ, FP8, GPTQ)
Context extensions/REAPs (e.g., GLM-4.7 is a REAP of GLM-4.5)

A cohort does NOT include:

Significantly different model sizes (GLM-4.5-Air is 1/3 the size → separate cohort)
Different model architectures or generations

Determining cohort name:

Look at folder name patterns (e.g., 2026-..._r12_d0_MiroThinker-v1.5-30B-fp16_*)
Extract base model identifier: MiroThinker-v1.5-30B, Qwen3-30B-A3B, GLM-4.5, etc.
Target cohort: data/<suite>/<ModelIdentifier>/

2b. Search All Existing Cohorts

The cohort.py list --search command searches inside all available cohorts by default:

python ./cohort.py list --search '<ModelFamilyRegExp>'

If a match is found, extract and reuse:

Label: Copy or adapt the human-readable label
groups: Copy the facet array (family, arch, size) — add/adjust quant:* and ctx:* as appropriate
hf_id: Copy directly
hf_quant_id: Copy if applicable

2c. Check Target Suite Cohort

Check if the cohort already exists in the target suite:

ls "data/<suite>/<ModelFamily>/" 2>/dev/null
cat "data/<suite>/<ModelFamily>/evals.json" 2>/dev/null | jq .

2d. Report Context

Before proceeding, report:

Cohort directory: data/<suite>/<ModelFamily>/
Existing cohort search: FOUND (with reusable metadata) or NOT FOUND
Target suite cohort: EXISTS (existing evals) or NEW
If FOUND: List reusable metadata (label, groups, hf_id)

3. Find HuggingFace Model ID

Skip this step if a usable hf_id was found in step 2b.

Otherwise, search using the HF CLI:

hf models ls --search "<model-name>" --limit 10

Use AskUserQuestion to confirm the HF ID:

If 1 clear match: Ask "Confirm HF model?" with the top result as default option
If multiple matches: Present top 3-5 as options and let user pick

4. Fetch Model Metadata

Skip this step if arch and family were found in step 2b.

Otherwise, run modelinfo:

python analyze.py modelinfo --hf-id <hf_id> --output-dir /tmp/import-modelinfo

Read the generated MODELINFO.md and extract:

base_model: Extract base architecture family (e.g., "Qwen/Qwen3-30B" → family:qwen3)
Architecture: Extract arch type:
- *MoeForCausalLM → arch:moe
- *ForCausalLM (non-MoE) → arch:dense
- Look for SSM/Mamba/hybrid indicators → arch:ssm or arch:hybrid

5. Extract Metadata from Folder Names

Parse the result folder names to extract:

Model name with quant: e.g., MiroThinker-v1.5-30B-fp16, Qwen3-30B-A3B-Thinking-2507-fp16-16k
Template: e.g., zeroshot-nosys, zerocot-nosys
Sampler: e.g., qwen3-think-max, greedy-max
Quantization: from model name suffix (fp16, AWQ, FP8, GPTQ)
Context length: from suffix like -16k → ctx:16k facet (ONLY if explicitly in folder name, otherwise omit)

6. Determine Faceting

Build the groups array. If copying from m12x, add or adjust quant:* and ctx:* facets as needed.

Size (infer from parameter count in model name):

tiny: <4B
small: 4-8B
mid: 8-20B
large: 20-100B
xlarge: >100B

Architecture (from modelinfo or m12x):

arch:dense / arch:moe / arch:ssm / arch:hybrid

Quantization (from folder name - ALWAYS include):

quant:fp16 / quant:fp8 / quant:awq / quant:gptq

Context Length (from folder name - ONLY if explicit):

ctx:16k if folder name contains -16k
OMIT this facet if no context suffix in folder name

Families:

Base family from base_model (e.g., family:qwen3, family:llama)
Finetune family from model name if applicable (e.g., family:mirothinker)

7. Generate Migration Script

Create a Python script that moves result folders into the r12 cohort directory and appends to its evals.json.

For each variant found:

#!/usr/bin/env python3
import json
import shutil
from pathlib import Path
from glob import glob

# Model: <model-name>-<quant>
cohort_dir = Path("data/r12/<ModelFamily>")
eval_json = cohort_dir / "evals.json"

print("Processing <model-name>-<quant>...")

# Create cohort directory if needed
cohort_dir.mkdir(parents=True, exist_ok=True)

# Move result folders
for src in glob("results/*r12*<model-name>-<quant>*"):
    dest = cohort_dir / Path(src).name
    shutil.move(src, dest)
    print(f"  Moved {src} -> {dest}")

# Build new eval entry
new_eval = {
    "evaluate": {"glob": "data/r12/<ModelFamily>/*_<model-name>-<quant>_<template>_<sampler>_*/*"},
    "filters": {"model": "<model-name>-<quant>", "template": "<template>", "sampler": "<sampler>"},
    "label": "<Human Readable Label>",
    "groups": [<facet-array>],
    "hf_id": "<hf_id>"
}

# Create or append to evals.json
if eval_json.exists():
    with open(eval_json) as f:
        evals = json.load(f)
    evals.append(new_eval)
else:
    evals = [new_eval]

with open(eval_json, 'w') as f:
    json.dump(evals, f, indent=2)

print("✓ Imported <model-name>-<quant>")

CRITICAL - Glob Pattern Rules:

When a model has multiple template/sampler combinations (e.g., both thinking and instruct modes), you MUST create separate eval entries with SPECIFIC globs:

❌ WRONG (overly broad glob matches multiple templates):

{
  "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_*/*"},
  "filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"}
}

This glob *_Model-fp16-16k_*/* will match BOTH zeroshot-nosys and zerocot-nosys templates, causing:

✗ Expected exactly 1 scenario (model/template/sampler), got 2

✅ CORRECT (specific globs that include template name):

[
  {
    "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zeroshot-nosys_think-max_*/*"},
    "filters": {"model": "Model-fp16-16k", "template": "zeroshot-nosys", "sampler": "think-max"},
    "label": "Model (FP16, 16k) Thinking"
  },
  {
    "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zerocot-nosys_instruct_*/*"},
    "filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"},
    "label": "Model (FP16, 16k) Instruct"
  }
]

Key Rules:

Each eval entry must match EXACTLY ONE (model, template, sampler) combination
Include template name in glob: *_<template>_<sampler>_*/* not just *_*/*
Add descriptive labels to distinguish variants: "Thinking" vs "Instruct"
Never use wildcards that could match multiple templates

Other Important Notes:

Add "hf_quant_id": "<quant_id>" field if this is a quantized version with a separate HF repo
When adding variants to existing r12 cohort: Refine old globs to prevent matching new variants
- Example: Change *-fp16* to *-fp16_* so it won't match -fp16-16k_ variants
No tags field: r12 does not use leaderboard tags (omit entirely)

8. Execute

Save the generated script to /tmp/ with a UNIQUE filename and execute it immediately.

Important: /tmp/ persists across skill invocations, so always use a unique filename to avoid Write tool conflicts with previous runs.

Use Bash heredoc to create and execute the script:

SCRIPT_PATH="/tmp/import-$(date +%s).py"
cat > "$SCRIPT_PATH" << 'EOF'
#!/usr/bin/env python3
...script content...
EOF
python3 "$SCRIPT_PATH"

9. Verify

After execution, verify the import is correct:

# Verify glob patterns match expected scenarios
python ./cohort.py verify "data/r12/<ModelFamily>"

This command validates that:

Each eval's glob matches exactly one (model, template, sampler) scenario
The matched scenario matches the filters in the eval definition
No parsing errors or missing folders

If verification passes, also list the cohort to show the newly imported evals:

python ./cohort.py list "data/r12/<ModelFamily>"

Report success and show the newly imported eval entries from the cohort listing.

Edge Cases

m12x cohort exists: Reuse label, groups, hf_id from existing m12x entry - this is the common case
Brand new model (not in m12x): Run full modelinfo fetch and construct groups from scratch
r12 cohort already exists: Append to evals.json; refine existing globs if needed to avoid cross-matching
Multiple HF matches: Show top 3, ask user to pick
No results found: Error and ask user to check the model name (ensure results are r12_d0 or tables300_d* folders)
Multiple quantizations: Import all variants found
Multiple context lengths: Each context length gets its own eval entry with appropriate ctx:* facet
Different model variants with separate HF IDs: Fetch metadata and store correct hf_id per eval entry
Multiple template/sampler combinations: Create SEPARATE eval entries with SPECIFIC globs that include the template name (see CRITICAL section in step 7). Common pattern: one entry for zeroshot-nosys (thinking), another for zerocot-nosys (instruct). Always verify with cohort.py verify after import to ensure each eval has exactly one scenario matching its filters.

Size Band Reference

1B, 2B, 3B → tiny
4B, 7B, 8B → small
9B-19B → mid
20B-99B → large
100B+ → xlarge

Example Invocations

Example 1: New r12 Import (Model Not Yet in Any Suite)

User: /import gemma-3-12b

1. Found: results/2026-02-18_23-31-16_r12_d0_gemma-3-12b-it-fp16_zerocot-nosys_greedy-max_normal_flash
   Suite: r12 (only one suite found)
2. cohort.py list --search 'gemma-3-12b': NOT FOUND in any existing cohort
3. HF search → google/gemma-3-12b-it confirmed
4. Fetch modelinfo → arch:dense, family:gemma3
5. Faceting: family:gemma3, arch:dense, size:mid, quant:fp16
6. r12 cohort: NEW (data/r12/Gemma-3-12B/ does not exist yet)
7. Generate and execute script → move folder, create evals.json
8. Verify with cohort.py verify data/r12/Gemma-3-12B
9. ✓ Imported to data/r12/Gemma-3-12B/

Example 2: Import with Existing Cohort Metadata (Common Case)

User: /import NewModel-30B

1. Found: results/2026-02-19_r12_d0_NewModel-30B-fp16_zeroshot-nosys_greedy-max_normal_flash
   Suite: r12
2. cohort.py list --search 'NewModel-30B': FOUND in data/r12/NewModel-30B/evals.json
   - Reusing: label="NewModel-30B (FP16)", groups=[family:newmodel, arch:dense, size:large], hf_id=org/NewModel-30B
3. r12 cohort: EXISTS (append new eval)
4. Generate and execute script
5. Verify with cohort.py verify data/r12/NewModel-30B
6. ✓ Imported to data/r12/NewModel-30B/

Example 3: Adding Context Variant to Existing Cohort

User: /import Qwen3-4B-Thinking-16k

1. Found: results/2026-02-18_r12_d0_Qwen3-4B-Thinking-2507-fp16-16k_zeroshot-nosys_qwen3-think-max_normal_flash
   Suite: r12
2. cohort.py list --search 'Qwen3-4B': FOUND — reuse base metadata from existing cohort
3. r12 cohort: EXISTS (data/r12/Qwen3-4B/evals.json already has fp16 entry)
4. Extracted: ctx:16k from "-16k" suffix
5. Refine existing glob: *-fp16_* to prevent matching -16k variant
6. Add new entry with ctx:16k facet
7. Execute, verify
8. ✓ Added -16k variant to data/r12/Qwen3-4B/

Example 4: tables300 Import

User: /import nemotron-120b

1. Found: results/2026-03-15_09-35-22_tables300_d10-14-18-22-26-30_NVIDIA-Nemotron-3-Super-120B-A12B-FP8-32k_zeroshot-nosys_nemotron3-max_normal_flash
   Suite: tables300 (only one suite found)
2. cohort.py list --search 'Nemotron-3-Super-120B': NOT FOUND
3. HF search → nvidia/Nemotron-3-Super-120B-A12B confirmed
4. Fetch modelinfo → arch:moe, family:nemotron3
5. Faceting: family:nemotron3, arch:moe, size:xlarge, quant:fp8, ctx:32k
6. tables300 cohort: NEW (data/inf-tables/NVIDIA-Nemotron-3-Super-120B-A12B/ does not exist)
7. Generate and execute script → move folder, create evals.json
8. Verify with cohort.py verify data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B
9. ✓ Imported to data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B/

Troubleshooting

Error: "Expected exactly 1 scenario, got 2"

Symptom (from evaluate.py):

✗ Failed eval ccbea6
Expected exactly 1 scenario (model/template/sampler), got 2:
{('Model-fp16-16k', 'zeroshot-nosys', 'sampler1'), ('Model-fp16-16k', 'zerocot-nosys', 'sampler2')}

Symptom (from cohort.py verify):

Eval: Model (FP16, 16k)
  eval_id: ccbea6
  Expected: model=Model-fp16-16k, template=zerocot-nosys, sampler=instruct
  Glob: data/r12/Model/*_Model-fp16-16k_*/*
  Found 2 result folders
  ✗ Expected exactly 1 scenario (model/template/sampler), got 2:
    - (Model-fp16-16k, zeroshot-nosys, sampler1)
    - (Model-fp16-16k, zerocot-nosys, sampler2)

Cause: Glob pattern is too broad and matches multiple template/sampler combinations.

Diagnosis:

# Use verify command to catch this BEFORE running evaluate.py
python ./cohort.py verify "data/<suite>/<Cohort>"

Fix:

Check the result folders: ls -1 data/<suite>/<Cohort>/
Identify which templates exist (e.g., zeroshot-nosys vs zerocot-nosys)
Edit data/<suite>/<Cohort>/evals.json
Replace broad glob *_Model-fp16-16k_*/* with specific patterns:
- *_Model-fp16-16k_zeroshot-nosys_sampler1_*/*
- *_Model-fp16-16k_zerocot-nosys_sampler2_*/*
Add descriptive labels to distinguish them ("Thinking" vs "Instruct")
Verify fix: python ./cohort.py verify "data/<suite>/<Cohort>"
Confirm: python ./cohort.py list "data/<suite>/<Cohort>"

Error: "Scenario mismatch"

Symptom (from cohort.py verify):

Eval: Model (FP16, 16k)
  eval_id: abc123
  Expected: model=Model-v1_5-fp16, template=zeroshot-nosys, sampler=greedy
  Glob: data/r12/Model/*_Model-v1_5-fp16_*/*
  Found 1 result folders
  ✗ Scenario mismatch!
    Expected: (Model-v1_5-fp16, zeroshot-nosys, greedy)
    Actual:   (Model-v1-5-fp16, zeroshot-nosys, greedy)

Cause: Filters use different naming than actual folder names (e.g., underscores vs hyphens).

Fix:

Check the actual folder name: ls -1 data/<suite>/<Cohort>/
Edit data/<suite>/<Cohort>/evals.json
Update the model filter to match the actual folder naming
Update the glob pattern to match the corrected model name
Verify fix: python ./cohort.py verify "data/<suite>/<Cohort>"

Notes

Always work from the ReasonScape root directory
Activate venv before running analyze.py: source venv/bin/activate
r12 results have a single tier (d0 only) — do not expect d1/d2 folders
tables300 results span multiple difficulty degrees encoded in the folder name (e.g., d10-14-18-22-26-30)
cohort.py list --search searches all cohorts by default — no path argument needed
Use /tmp/ for temporary modelinfo cache
Be explicit about what you're doing at each step