Smithery Logo
MCPsSkillsDocsPricing
Login
NewFlame, an assistant that learns and improves. Available onTelegramSlack
    the-crypt-keeper

    import

    the-crypt-keeper/import
    AI & ML
    36

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─
    Smithery Logo

    Give agents more agency

    Resources

    DocumentationPrivacy PolicySystem Status

    Company

    PricingAboutBlog

    Connect

    © 2026 Smithery. All rights reserved.

    About

    Import model evaluation results into ReasonScape m12x dataset. Use when importing new models, quantization variants, or adding evaluation data...

    SKILL.md

    Import Model Skill

    You are helping the user import new model evaluation results into the ReasonScape dataset structure.

    Supported Test Suites

    This project has two active test suites:

    • r12 — primary reasoning evaluation suite (data/r12/)
    • tables300 — inference/tables evaluation suite (data/inf-tables/)

    m12x has been archived and its data removed. Do not reference m12x.

    Workflow

    When the user invokes /import <model-name>, follow these steps:

    Important: Use the AskUserQuestion tool only for all user interactions to keep the flow smooth.

    1. Search for Results and Determine Suite

    If the user does not provide a model name/pattern, list all available results:

    ls -1 results/
    

    Otherwise, search results/ for folders matching the model name pattern across all suites:

    ls -1 results/ | grep -i "<model-name>"
    

    Extract all unique model variants found and group by:

    • Test suite: r12 vs tables300 (from folder name prefix)
    • Quantization (e.g., fp16, AWQ, FP8)
    • Sampler variants (e.g., base vs sglang)
    • Context variants (e.g., default vs 16k)
    • Template and sampler combinations

    Folder naming patterns:

    • r12: <date>_r12_d0_<model-name>_<template>_<sampler>_<mode>_<precision>
    • tables300: <date>_inf-tables_<tiers>_<model-name>_<template>_<sampler>_<mode>_<precision>
      • <tiers> is a hyphen-separated list of difficulty degrees, e.g. d10-14-18-22-26-30

    Note: r12 has a single difficulty tier (d0 only). tables300 spans multiple difficulty degrees in a single folder.

    Suite selection:

    • If the user specified a suite → use that suite
    • If results exist for only ONE suite → use that suite and inform the user
    • If results exist for MULTIPLE suites → use AskUserQuestion to let the user pick which suite to import into
    • If NO results found → error and ask user to check the model name

    Handle variant ambiguity:

    • If only ONE distinct variant found → Proceed with that variant
    • If MULTIPLE distinct variants found → Use AskUserQuestion to confirm which to import

    2. Check for Existing Cohort (CRITICAL STEP)

    Before fetching metadata, search for any existing cohort data for this model across all suites.

    2a. Determine Cohort Directory Name

    From the model name and results folders, determine the cohort directory name.

    What is a cohort? A cohort groups equivalent models together, including:

    • The base model and its variants (e.g., Qwen3-30B-A3B-Thinking + Qwen3-30B-A3B-Instruct)
    • All quantizations of the same model (fp16, AWQ, FP8, GPTQ)
    • Context extensions/REAPs (e.g., GLM-4.7 is a REAP of GLM-4.5)

    A cohort does NOT include:

    • Significantly different model sizes (GLM-4.5-Air is 1/3 the size → separate cohort)
    • Different model architectures or generations

    Determining cohort name:

    • Look at folder name patterns (e.g., 2026-..._r12_d0_MiroThinker-v1.5-30B-fp16_*)
    • Extract base model identifier: MiroThinker-v1.5-30B, Qwen3-30B-A3B, GLM-4.5, etc.
    • Target cohort: data/<suite>/<ModelIdentifier>/

    2b. Search All Existing Cohorts

    The cohort.py list --search command searches inside all available cohorts by default:

    python ./cohort.py list --search '<ModelFamilyRegExp>'
    

    If a match is found, extract and reuse:

    • Label: Copy or adapt the human-readable label
    • groups: Copy the facet array (family, arch, size) — add/adjust quant:* and ctx:* as appropriate
    • hf_id: Copy directly
    • hf_quant_id: Copy if applicable

    2c. Check Target Suite Cohort

    Check if the cohort already exists in the target suite:

    ls "data/<suite>/<ModelFamily>/" 2>/dev/null
    cat "data/<suite>/<ModelFamily>/evals.json" 2>/dev/null | jq .
    

    2d. Report Context

    Before proceeding, report:

    • Cohort directory: data/<suite>/<ModelFamily>/
    • Existing cohort search: FOUND (with reusable metadata) or NOT FOUND
    • Target suite cohort: EXISTS (existing evals) or NEW
    • If FOUND: List reusable metadata (label, groups, hf_id)

    3. Find HuggingFace Model ID

    Skip this step if a usable hf_id was found in step 2b.

    Otherwise, search using the HF CLI:

    hf models ls --search "<model-name>" --limit 10
    

    Use AskUserQuestion to confirm the HF ID:

    • If 1 clear match: Ask "Confirm HF model?" with the top result as default option
    • If multiple matches: Present top 3-5 as options and let user pick

    4. Fetch Model Metadata

    Skip this step if arch and family were found in step 2b.

    Otherwise, run modelinfo:

    python analyze.py modelinfo --hf-id <hf_id> --output-dir /tmp/import-modelinfo
    

    Read the generated MODELINFO.md and extract:

    • base_model: Extract base architecture family (e.g., "Qwen/Qwen3-30B" → family:qwen3)
    • Architecture: Extract arch type:
      • *MoeForCausalLM → arch:moe
      • *ForCausalLM (non-MoE) → arch:dense
      • Look for SSM/Mamba/hybrid indicators → arch:ssm or arch:hybrid

    5. Extract Metadata from Folder Names

    Parse the result folder names to extract:

    • Model name with quant: e.g., MiroThinker-v1.5-30B-fp16, Qwen3-30B-A3B-Thinking-2507-fp16-16k
    • Template: e.g., zeroshot-nosys, zerocot-nosys
    • Sampler: e.g., qwen3-think-max, greedy-max
    • Quantization: from model name suffix (fp16, AWQ, FP8, GPTQ)
    • Context length: from suffix like -16k → ctx:16k facet (ONLY if explicitly in folder name, otherwise omit)

    6. Determine Faceting

    Build the groups array. If copying from m12x, add or adjust quant:* and ctx:* facets as needed.

    Size (infer from parameter count in model name):

    • tiny: <4B
    • small: 4-8B
    • mid: 8-20B
    • large: 20-100B
    • xlarge: >100B

    Architecture (from modelinfo or m12x):

    • arch:dense / arch:moe / arch:ssm / arch:hybrid

    Quantization (from folder name - ALWAYS include):

    • quant:fp16 / quant:fp8 / quant:awq / quant:gptq

    Context Length (from folder name - ONLY if explicit):

    • ctx:16k if folder name contains -16k
    • OMIT this facet if no context suffix in folder name

    Families:

    • Base family from base_model (e.g., family:qwen3, family:llama)
    • Finetune family from model name if applicable (e.g., family:mirothinker)

    7. Generate Migration Script

    Create a Python script that moves result folders into the r12 cohort directory and appends to its evals.json.

    For each variant found:

    #!/usr/bin/env python3
    import json
    import shutil
    from pathlib import Path
    from glob import glob
    
    # Model: <model-name>-<quant>
    cohort_dir = Path("data/r12/<ModelFamily>")
    eval_json = cohort_dir / "evals.json"
    
    print("Processing <model-name>-<quant>...")
    
    # Create cohort directory if needed
    cohort_dir.mkdir(parents=True, exist_ok=True)
    
    # Move result folders
    for src in glob("results/*r12*<model-name>-<quant>*"):
        dest = cohort_dir / Path(src).name
        shutil.move(src, dest)
        print(f"  Moved {src} -> {dest}")
    
    # Build new eval entry
    new_eval = {
        "evaluate": {"glob": "data/r12/<ModelFamily>/*_<model-name>-<quant>_<template>_<sampler>_*/*"},
        "filters": {"model": "<model-name>-<quant>", "template": "<template>", "sampler": "<sampler>"},
        "label": "<Human Readable Label>",
        "groups": [<facet-array>],
        "hf_id": "<hf_id>"
    }
    
    # Create or append to evals.json
    if eval_json.exists():
        with open(eval_json) as f:
            evals = json.load(f)
        evals.append(new_eval)
    else:
        evals = [new_eval]
    
    with open(eval_json, 'w') as f:
        json.dump(evals, f, indent=2)
    
    print("✓ Imported <model-name>-<quant>")
    

    CRITICAL - Glob Pattern Rules:

    When a model has multiple template/sampler combinations (e.g., both thinking and instruct modes), you MUST create separate eval entries with SPECIFIC globs:

    ❌ WRONG (overly broad glob matches multiple templates):

    {
      "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_*/*"},
      "filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"}
    }
    

    This glob *_Model-fp16-16k_*/* will match BOTH zeroshot-nosys and zerocot-nosys templates, causing:

    ✗ Expected exactly 1 scenario (model/template/sampler), got 2
    

    ✅ CORRECT (specific globs that include template name):

    [
      {
        "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zeroshot-nosys_think-max_*/*"},
        "filters": {"model": "Model-fp16-16k", "template": "zeroshot-nosys", "sampler": "think-max"},
        "label": "Model (FP16, 16k) Thinking"
      },
      {
        "evaluate": {"glob": "data/r12/Model/*_Model-fp16-16k_zerocot-nosys_instruct_*/*"},
        "filters": {"model": "Model-fp16-16k", "template": "zerocot-nosys", "sampler": "instruct"},
        "label": "Model (FP16, 16k) Instruct"
      }
    ]
    

    Key Rules:

    1. Each eval entry must match EXACTLY ONE (model, template, sampler) combination
    2. Include template name in glob: *_<template>_<sampler>_*/* not just *_*/*
    3. Add descriptive labels to distinguish variants: "Thinking" vs "Instruct"
    4. Never use wildcards that could match multiple templates

    Other Important Notes:

    • Add "hf_quant_id": "<quant_id>" field if this is a quantized version with a separate HF repo
    • When adding variants to existing r12 cohort: Refine old globs to prevent matching new variants
      • Example: Change *-fp16* to *-fp16_* so it won't match -fp16-16k_ variants
    • No tags field: r12 does not use leaderboard tags (omit entirely)

    8. Execute

    Save the generated script to /tmp/ with a UNIQUE filename and execute it immediately.

    Important: /tmp/ persists across skill invocations, so always use a unique filename to avoid Write tool conflicts with previous runs.

    Use Bash heredoc to create and execute the script:

    SCRIPT_PATH="/tmp/import-$(date +%s).py"
    cat > "$SCRIPT_PATH" << 'EOF'
    #!/usr/bin/env python3
    ...script content...
    EOF
    python3 "$SCRIPT_PATH"
    

    9. Verify

    After execution, verify the import is correct:

    # Verify glob patterns match expected scenarios
    python ./cohort.py verify "data/r12/<ModelFamily>"
    

    This command validates that:

    • Each eval's glob matches exactly one (model, template, sampler) scenario
    • The matched scenario matches the filters in the eval definition
    • No parsing errors or missing folders

    If verification passes, also list the cohort to show the newly imported evals:

    python ./cohort.py list "data/r12/<ModelFamily>"
    

    Report success and show the newly imported eval entries from the cohort listing.

    Edge Cases

    • m12x cohort exists: Reuse label, groups, hf_id from existing m12x entry - this is the common case
    • Brand new model (not in m12x): Run full modelinfo fetch and construct groups from scratch
    • r12 cohort already exists: Append to evals.json; refine existing globs if needed to avoid cross-matching
    • Multiple HF matches: Show top 3, ask user to pick
    • No results found: Error and ask user to check the model name (ensure results are r12_d0 or tables300_d* folders)
    • Multiple quantizations: Import all variants found
    • Multiple context lengths: Each context length gets its own eval entry with appropriate ctx:* facet
    • Different model variants with separate HF IDs: Fetch metadata and store correct hf_id per eval entry
    • Multiple template/sampler combinations: Create SEPARATE eval entries with SPECIFIC globs that include the template name (see CRITICAL section in step 7). Common pattern: one entry for zeroshot-nosys (thinking), another for zerocot-nosys (instruct). Always verify with cohort.py verify after import to ensure each eval has exactly one scenario matching its filters.

    Size Band Reference

    • 1B, 2B, 3B → tiny
    • 4B, 7B, 8B → small
    • 9B-19B → mid
    • 20B-99B → large
    • 100B+ → xlarge

    Example Invocations

    Example 1: New r12 Import (Model Not Yet in Any Suite)

    User: /import gemma-3-12b
    
    1. Found: results/2026-02-18_23-31-16_r12_d0_gemma-3-12b-it-fp16_zerocot-nosys_greedy-max_normal_flash
       Suite: r12 (only one suite found)
    2. cohort.py list --search 'gemma-3-12b': NOT FOUND in any existing cohort
    3. HF search → google/gemma-3-12b-it confirmed
    4. Fetch modelinfo → arch:dense, family:gemma3
    5. Faceting: family:gemma3, arch:dense, size:mid, quant:fp16
    6. r12 cohort: NEW (data/r12/Gemma-3-12B/ does not exist yet)
    7. Generate and execute script → move folder, create evals.json
    8. Verify with cohort.py verify data/r12/Gemma-3-12B
    9. ✓ Imported to data/r12/Gemma-3-12B/
    

    Example 2: Import with Existing Cohort Metadata (Common Case)

    User: /import NewModel-30B
    
    1. Found: results/2026-02-19_r12_d0_NewModel-30B-fp16_zeroshot-nosys_greedy-max_normal_flash
       Suite: r12
    2. cohort.py list --search 'NewModel-30B': FOUND in data/r12/NewModel-30B/evals.json
       - Reusing: label="NewModel-30B (FP16)", groups=[family:newmodel, arch:dense, size:large], hf_id=org/NewModel-30B
    3. r12 cohort: EXISTS (append new eval)
    4. Generate and execute script
    5. Verify with cohort.py verify data/r12/NewModel-30B
    6. ✓ Imported to data/r12/NewModel-30B/
    

    Example 3: Adding Context Variant to Existing Cohort

    User: /import Qwen3-4B-Thinking-16k
    
    1. Found: results/2026-02-18_r12_d0_Qwen3-4B-Thinking-2507-fp16-16k_zeroshot-nosys_qwen3-think-max_normal_flash
       Suite: r12
    2. cohort.py list --search 'Qwen3-4B': FOUND — reuse base metadata from existing cohort
    3. r12 cohort: EXISTS (data/r12/Qwen3-4B/evals.json already has fp16 entry)
    4. Extracted: ctx:16k from "-16k" suffix
    5. Refine existing glob: *-fp16_* to prevent matching -16k variant
    6. Add new entry with ctx:16k facet
    7. Execute, verify
    8. ✓ Added -16k variant to data/r12/Qwen3-4B/
    

    Example 4: tables300 Import

    User: /import nemotron-120b
    
    1. Found: results/2026-03-15_09-35-22_tables300_d10-14-18-22-26-30_NVIDIA-Nemotron-3-Super-120B-A12B-FP8-32k_zeroshot-nosys_nemotron3-max_normal_flash
       Suite: tables300 (only one suite found)
    2. cohort.py list --search 'Nemotron-3-Super-120B': NOT FOUND
    3. HF search → nvidia/Nemotron-3-Super-120B-A12B confirmed
    4. Fetch modelinfo → arch:moe, family:nemotron3
    5. Faceting: family:nemotron3, arch:moe, size:xlarge, quant:fp8, ctx:32k
    6. tables300 cohort: NEW (data/inf-tables/NVIDIA-Nemotron-3-Super-120B-A12B/ does not exist)
    7. Generate and execute script → move folder, create evals.json
    8. Verify with cohort.py verify data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B
    9. ✓ Imported to data/tables300/NVIDIA-Nemotron-3-Super-120B-A12B/
    

    Troubleshooting

    Error: "Expected exactly 1 scenario, got 2"

    Symptom (from evaluate.py):

    ✗ Failed eval ccbea6
    Expected exactly 1 scenario (model/template/sampler), got 2:
    {('Model-fp16-16k', 'zeroshot-nosys', 'sampler1'), ('Model-fp16-16k', 'zerocot-nosys', 'sampler2')}
    

    Symptom (from cohort.py verify):

    Eval: Model (FP16, 16k)
      eval_id: ccbea6
      Expected: model=Model-fp16-16k, template=zerocot-nosys, sampler=instruct
      Glob: data/r12/Model/*_Model-fp16-16k_*/*
      Found 2 result folders
      ✗ Expected exactly 1 scenario (model/template/sampler), got 2:
        - (Model-fp16-16k, zeroshot-nosys, sampler1)
        - (Model-fp16-16k, zerocot-nosys, sampler2)
    

    Cause: Glob pattern is too broad and matches multiple template/sampler combinations.

    Diagnosis:

    # Use verify command to catch this BEFORE running evaluate.py
    python ./cohort.py verify "data/<suite>/<Cohort>"
    

    Fix:

    1. Check the result folders: ls -1 data/<suite>/<Cohort>/
    2. Identify which templates exist (e.g., zeroshot-nosys vs zerocot-nosys)
    3. Edit data/<suite>/<Cohort>/evals.json
    4. Replace broad glob *_Model-fp16-16k_*/* with specific patterns:
      • *_Model-fp16-16k_zeroshot-nosys_sampler1_*/*
      • *_Model-fp16-16k_zerocot-nosys_sampler2_*/*
    5. Add descriptive labels to distinguish them ("Thinking" vs "Instruct")
    6. Verify fix: python ./cohort.py verify "data/<suite>/<Cohort>"
    7. Confirm: python ./cohort.py list "data/<suite>/<Cohort>"

    Error: "Scenario mismatch"

    Symptom (from cohort.py verify):

    Eval: Model (FP16, 16k)
      eval_id: abc123
      Expected: model=Model-v1_5-fp16, template=zeroshot-nosys, sampler=greedy
      Glob: data/r12/Model/*_Model-v1_5-fp16_*/*
      Found 1 result folders
      ✗ Scenario mismatch!
        Expected: (Model-v1_5-fp16, zeroshot-nosys, greedy)
        Actual:   (Model-v1-5-fp16, zeroshot-nosys, greedy)
    

    Cause: Filters use different naming than actual folder names (e.g., underscores vs hyphens).

    Fix:

    1. Check the actual folder name: ls -1 data/<suite>/<Cohort>/
    2. Edit data/<suite>/<Cohort>/evals.json
    3. Update the model filter to match the actual folder naming
    4. Update the glob pattern to match the corrected model name
    5. Verify fix: python ./cohort.py verify "data/<suite>/<Cohort>"

    Notes

    • Always work from the ReasonScape root directory
    • Activate venv before running analyze.py: source venv/bin/activate
    • r12 results have a single tier (d0 only) — do not expect d1/d2 folders
    • tables300 results span multiple difficulty degrees encoded in the folder name (e.g., d10-14-18-22-26-30)
    • cohort.py list --search searches all cohorts by default — no path argument needed
    • Use /tmp/ for temporary modelinfo cache
    • Be explicit about what you're doing at each step
    Recommended Servers
    Hugging Face
    Hugging Face
    Local Model Suitability MCP
    Local Model Suitability MCP
    Repository
    the-crypt-keeper/reasonscape
    Files