Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases...
Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.
Use the data above first. Only re-run these commands manually if the injected context is empty or you need to refresh after making changes.
Most common usage:
# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (extended_suite: 7 production models + lang-harness sweep) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (JSON with history preservation)
# 3. Update axiom scorecard KPI (if features affect axiom compliance)
# 4. Extract metrics and UPDATE CHANGELOG.md automatically
# 5. Move design docs from planned/ to implemented/
# 6. Run docs-sync to verify website accuracy (version constants, PLANNED banners, examples)
# 7. Commit all changes to git
🚨 CRITICAL: For releases, ALWAYS use --full flag by default
Invoke this skill when:
scripts/run_eval_baseline.sh <version> [--full] [--cross-harness]Run evaluation baseline for a release version.
🚨 CRITICAL: ALWAYS use --full for releases!
Usage:
# ✅ RECOMMENDED release baseline (standard + agent + 4-language Explorer sweep) — ~$23
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0 --full --lang-harness
# Standard + agent only (no 4-lang Explorer sweep) — ~$16
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.15.0 --full
# Major release: includes cross-harness comparison (gpt5-5 + opencode-gpt5-5 etc.) — ~$47
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0 --full --cross-harness
# ❌ Dev only — 3 cheap models, AILANG lang only (quick testing/validation) — ~$3.50
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0
Output:
Running eval baseline for 0.3.14...
Mode: FULL (extended_suite: 7 production models)
Expected cost: ~$16 (FULL) or ~$23 (FULL + lang-harness) or ~$47 (FULL + cross-harness)
Expected time: ~30-60 minutes
[Running benchmarks...]
✓ Baseline complete
Results: eval_results/baselines/0.3.14
Files: 726 result files
What it does:
extended_suite (--full, 6 models): gpt5-5 (Apr 2026 flagship), gpt5-4-mini, claude-opus-4-7, claude-sonnet-4-6, gemini-3-1-pro, gemini-3-flashdev_models (default): gpt5-4-mini, claude-haiku-4-5, gemini-3-flashagent_suite (4 cost-tuned models): claude-sonnet-4-6 (longitudinal anchor), gemini-3-flash, gpt5-4-mini, opencode-sonnet-4-6 (cross-harness pair)smoke (15), core (22), stretch (11), vision (6)core,stretch — Core is the headline metric, Stretch is harder mixed resultscore 70%+ for AILANG; vision intentionally lowlang_harness_suite: claude-haiku-4-5, gemini-3-flash, gpt5-4-mini, opencode-haikucore only (22 benchmarks) — stretch is skipped here even if --tier core,stretch was set globallycontract_bst_validate, contract_roman_numeral, effect_composition, effect_tracking_io_fs) and auto-skip on JS/Go runsharness_suite (6 models, paired across harnesses)eval_results/baselines/vX.X.X/⚠️ Note: gpt5-5-pro is in models.yml but not in any default suite — agent mode is blocked
(codex rejects with ChatGPT account, opencode returns 0 tool calls). Don't add it to suites.
scripts/update_dashboard.sh <version>Update website benchmark dashboard with new release data.
Usage:
.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14
Output:
Updating dashboard for 0.3.14...
1/5 Generating Docusaurus markdown...
✓ Written to docs/docs/benchmarks/performance.md
2/5 Generating dashboard JSON with history...
✓ Written to docs/static/benchmarks/latest.json (history preserved)
3/5 Validating JSON...
✓ Version: 0.3.14
✓ Success rate: 0.627
4/5 Clearing Docusaurus cache...
✓ Cache cleared
5/5 Summary
✓ Dashboard updated for 0.3.14
✓ Markdown: docs/docs/benchmarks/performance.md
✓ JSON: docs/static/benchmarks/latest.json
Next steps:
1. Test locally: cd docs && npm start
2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
3. Verify timeline shows 0.3.14
4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
6. Push: git push
What it does:
scripts/extract_changelog_metrics.sh [json_file]Extract benchmark metrics from dashboard JSON for CHANGELOG.
Usage:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json
Output:
Extracting metrics from docs/static/benchmarks/latest.json...
=== CHANGELOG.md Template ===
### Benchmark Results (M-EVAL)
**Overall Performance**: 59.1% success rate (399 total runs)
**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)
**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)
=== End Template ===
Use this template in CHANGELOG.md for 0.3.15
What it does:
scripts/cleanup_design_docs.sh <version> [--dry-run] [--force] [--check-only]Move design docs from planned/ to implemented/ after a release.
Features:
Usage:
# Check-only: Report issues without making changes
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --check-only
# Preview what would be moved/deleted/relocated
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --dry-run
# Execute: Move implemented docs, delete duplicates, relocate misplaced
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9
# Force move all docs regardless of status
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --force
Output:
Design Doc Cleanup for v0.5.9
==================================
Checking 5 design doc(s) in design_docs/planned/v0_5_9/:
Phase 1: Detecting issues...
[DUPLICATE] m-fix-if-else-let-block.md
Already exists in design_docs/implemented/v0_5_9/
[MISPLACED] m-codegen-value-types.md
Target: v0.5.10 (folder: v0_5_9)
Should be in: design_docs/planned/v0_5_10/
Issues found:
- 1 duplicate(s) (can be deleted)
- 1 misplaced doc(s) (wrong version folder)
Phase 2: Processing docs...
[DELETED] m-fix-if-else-let-block.md (duplicate - already in implemented/)
[RELOCATED] m-codegen-value-types.md → design_docs/planned/v0_5_10/
[MOVED] m-dx11-cycles.md → design_docs/implemented/v0_5_9/
[NEEDS REVIEW] m-unfinished-feature.md
Found: **Status**: Planned
Summary:
✓ Deleted 1 duplicate(s)
✓ Relocated 1 doc(s) to correct version folder
✓ Moved 1 doc(s) to design_docs/implemented/v0_5_9/
⚠ 1 doc(s) need review (not marked as Implemented)
What it does:
--check-only to only report issues, --dry-run to preview actions, --force to move allgit tag -l vX.X.X
gh release view vX.X.X
If release doesn't exist, run release-manager skill first.
🚨 CRITICAL: ALWAYS use --full for releases!
Correct workflow:
# ✅ RECOMMENDED for releases — adds 4-language Explorer data for ~$7 more
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full --lang-harness
# Minimum acceptable for releases
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full
This runs all 7 production models (extended_suite) with both AILANG and Python.
Tier scope for releases:
--tier core,stretch — 33 benchmarks (22 core + 11 stretch)--tier core — 22 benchmarks (Core is the headline metric)--tier smoke,core,stretch — 48 benchmarks (include smoke for sanity)vision benchmarks are research-grade and excluded by default — opt in explicitly
with --tier vision if you want to publish those numbers.Override tier via the script's --tier flag (see run_eval_baseline.sh --help). If
unsure, the default is tuned to produce a release-ready baseline in ~30–60 minutes.
Cost & time (default tier core,stretch = 33 benchmarks):
| Mode | Cost | Time | Use for |
|---|---|---|---|
--full |
~$16 | ~30-40 min | Standard release |
--full --lang-harness |
~$23 | ~45-60 min | Recommended — adds 4-lang Explorer data |
--full --cross-harness |
~$47 | ~45-60 min | Major releases (vX.0, quarterly) |
| dev (no flags) | ~$3.50 | ~10-15 min | Quick validation only — never for releases |
❌ WRONG workflow (what happened with v0.3.22):
# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline
If baseline times out or is interrupted:
# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
--output eval_results/baselines/X.X.X --skip-existing
The --skip-existing flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".
Use the automation script:
.claude/skills/post-release/scripts/update_dashboard.sh X.X.X
IMPORTANT: This script automatically:
Test locally (optional but recommended):
cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance
Verify:
Commit dashboard updates:
git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push
Review and update the axiom scorecard:
# View current scorecard
ailang axioms
# The scorecard is at docs/static/benchmarks/axiom_scorecard.json
# Update scores if features were added/improved that affect axiom compliance
When to update scores:
Update history entry:
{
"version": "vX.X.X",
"date": "YYYY-MM-DD",
"score": 18,
"maxScore": 24,
"percentage": 75.0,
"notes": "Added capability budgets (A9 +1)"
}
Generate metrics template:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh X.X.X
This outputs a formatted template with:
Update changelog automatically:
changelogs/ (find with ls changelogs/ | grep current)CHANGELOG.md is an index file — do NOT write entries thereFor CHANGELOG template format, see resources/version_notes.md.
v0.8.0+ (chain-based - recommended):
# Find the chain ID from the latest eval run
ailang eval-chains list
# View per-benchmark pass/fail with cost and turns
ailang eval-chains view <chain-id>
# Pass rate breakdown
ailang eval-chains stats <chain-id>
# Show failures with error details
ailang eval-chains failures <chain-id>
# Generate chain-based report
ailang eval-report --from-chain <chain-id> X.X.X --format=json
Legacy (file-based):
# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X
Target metrics: Avg Turns ≤1.5x gap, Avg Tokens ≤2.0x gap vs Python.
For detailed agent analysis guide, see resources/version_notes.md.
After dashboard + changelog metrics, run the curation analysis primitives. These inform
what to keep, demote, or promote for the next release — the suite is curated, not
accumulated. See benchmarks/CURATION.md for the full philosophy.
# Tag coverage: which of the 12 canonical tags are thin or over-represented?
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --by-tags
# AILANG-only wins: AILANG beats Python by ≥ 10pp — protect these from regressions
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --ailang-wins
# Dual-mode saturation audit (RECOMMENDED for demotion decisions):
# Lists every benchmark's standard AND agent pass rate per language.
# Demote candidates require ≥95% on ALL 4 dimensions (std-AI, std-Py, agent-AI, agent-Py).
# Standard-only saturation is misleading: many benchmarks are 100/100 in standard
# but drop to 12-50% in agent mode — those are still valuable signal, KEEP IN CORE.
.claude/skills/benchmark-manager/scripts/audit_saturation.sh vX.X.X
# Built-in saturated check (uses topline successRate only — less reliable, prefer the script above):
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --show-saturated
# Optional: compare this release's tag deltas against the previous baseline
ailang eval-report eval_results/baselines/vX.X.X vX.X.X --format=json
Record three things in the release notes (or the design doc retro):
stretch or be
retired in the next sprint.This step is cheap (seconds), has no external dependencies, and produces the input for the next release's benchmark-manager / eval-gap-finder work.
Step 1: Check for issues (duplicates, misplaced docs):
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --check-only
Step 2: Preview all changes:
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --dry-run
Step 3: Check any flagged docs:
[DUPLICATE] - These will be deleted (already in implemented/)[MISPLACED] - These will be relocated to correct version folder[NEEDS REVIEW] - Update **Status**: to Implemented if done, or leave for next versionStep 4: Execute the cleanup:
.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X
Step 5: Commit the changes:
git add design_docs/
git commit -m "docs: cleanup design docs for vX_Y_Z"
The script automatically:
prompts/ with latest AILANG syntax (ailang prompt)prompts/devtools/ with latest toolchain docs (ailang devtools-prompt)ailang devtools-prompt | grep "new-command"docs/) with latest featuresdocs/guides/evaluation/ if significant benchmark improvementsdocs/LIMITATIONS.md:# Test examples from LIMITATIONS.md
# Example: Test Y-combinator still fails (should fail)
echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl
# Example: Test named recursion works (should succeed)
ailang run examples/factorial.ail
# Example: Test polymorphic operator limitation (should panic with floats)
# Create test file and verify behavior matches documentation
git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"Run docs-sync to verify website accuracy:
# Check version constants are correct
.claude/skills/docs-sync/scripts/check_versions.sh
# Audit design docs vs website claims
.claude/skills/docs-sync/scripts/audit_design_docs.sh
# Generate full sync report
.claude/skills/docs-sync/scripts/generate_report.sh
What docs-sync checks:
docs/src/constants/version.js match git tagIf issues found:
Commit docs-sync fixes:
git add docs/
git commit -m "docs: sync website with vX.X.X implementation"
See docs-sync skill for full documentation.
release-manager runs make brain-index-syntax-reset immediately after the
tag pushes. This step verifies that the reset actually populated the
brain with chunks tagged with the new release version — protects against
silent indexer failures that would leave Claude pulling stale snippets.
Spot-check ≥5 chunks reference the active version:
EXPECTED_VERSION="$(ailang prompt --version-active)"
ailang cache search --namespace ailang-syntax --limit 5 "string" \
| grep -c "version:${EXPECTED_VERSION}" \
|| { echo "FAIL: μRAG corpus does not reference $EXPECTED_VERSION"; exit 1; }
Quick stat check:
ailang cache stats | grep -E "ailang-(syntax|builtins|examples)"
# Expect: ailang-syntax >= 50, ailang-builtins >= 250, ailang-examples >= 100
Append a one-line audit to release notes (or eval_results/baselines/<version>/notes.md):
μRAG corpus reindex verified: <ailang-syntax count>, <ailang-builtins count>, <ailang-examples count> chunks @ <version>.
If verification fails:
make brain-index-syntax-reset from the project root.ailang prompt --version-active returns the new release tag.See resources/post_release_checklist.md for complete step-by-step checklist.
ailang binary installed (for eval baseline)Solution: Use --skip-existing flag to resume:
bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X
Cause: Wrong JSON file (performance matrix vs dashboard JSON)
Solution: Use update_dashboard.sh script, not manual file copying
Cause: Stale build cache
Solution: Run cd docs && npm run clear && rm -rf .docusaurus build
Cause: Didn't run update_dashboard.sh with correct version
Solution: Re-run update_dashboard.sh X.X.X with correct version
This skill loads information progressively:
scripts/ directory (automation)resources/post_release_checklist.md (detailed checklist)Scripts execute without loading into context window, saving tokens while providing powerful automation.
For historical improvements and lessons learned, see resources/version_notes.md.
Key points:
--validate flag to check configuration before running--tier core,stretch default) — benchmarks
are resolved from benchmarks/*.yml by tier, not a hardcoded listailang eval-matrix --by-tags/--show-saturated/ --ailang-wins in Step 5b--full flag for release baselines (all production models)--check-only to report issues without changes--dry-run to preview all actions--force to move all regardless of status