ailang-post-release-tasks

sunholo-data/ailang-post-release-tasks

DevOps

About

SKILL.md

ailang-post-release-tasks

sunholo-data/ailang-post-release-tasks

DevOps

About

Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases...

SKILL.md

AILANG Post-Release Tasks

Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.

Current State

Current version: !'cat std/VERSION'
Latest tag: !'git describe --tags --abbrev=0 2>/dev/null || echo "no tags"'
GitHub release: !'gh release list --limit 3 2>/dev/null | head -3 || echo "gh not available"'
Existing baselines: !'ls eval_results/baselines/ 2>/dev/null | tail -5 || echo "none"'
Dashboard version: !'jq -r .version docs/static/benchmarks/latest.json 2>/dev/null || echo "not found"'
Active changelog: !'ls changelogs/ | grep current 2>/dev/null'

Use the data above first. Only re-run these commands manually if the injected context is empty or you need to refresh after making changes.

Quick Start

Most common usage:

# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (extended_suite: 7 production models + lang-harness sweep) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (JSON with history preservation)
# 3. Update axiom scorecard KPI (if features affect axiom compliance)
# 4. Extract metrics and UPDATE CHANGELOG.md automatically
# 5. Move design docs from planned/ to implemented/
# 6. Run docs-sync to verify website accuracy (version constants, PLANNED banners, examples)
# 7. Commit all changes to git

🚨 CRITICAL: For releases, ALWAYS use --full flag by default

Dev models (without --full) are only for quick testing/validation, NOT releases
Users expect full benchmark results when they say "post-release" or "update dashboard"
Never start with dev models and then try to add production models later

When to Use This Skill

Invoke this skill when:

User says "post-release tasks", "update dashboard", "run benchmarks"
After successful release (once GitHub release is published)
User asks about eval baselines or benchmark results
User wants to update documentation after a release

Available Scripts

`scripts/run_eval_baseline.sh <version> [--full] [--cross-harness]`

Run evaluation baseline for a release version.

🚨 CRITICAL: ALWAYS use --full for releases!

Usage:

# ✅ RECOMMENDED release baseline (standard + agent + 4-language Explorer sweep) — ~$23
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0 --full --lang-harness

# Standard + agent only (no 4-lang Explorer sweep) — ~$16
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.15.0 --full

# Major release: includes cross-harness comparison (gpt5-5 + opencode-gpt5-5 etc.) — ~$47
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0 --full --cross-harness

# ❌ Dev only — 3 cheap models, AILANG lang only (quick testing/validation) — ~$3.50
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0

Output:

Running eval baseline for 0.3.14...
Mode: FULL (extended_suite: 7 production models)
Expected cost: ~$16 (FULL) or ~$23 (FULL + lang-harness) or ~$47 (FULL + cross-harness)
Expected time: ~30-60 minutes

[Running benchmarks...]

✓ Baseline complete
  Results: eval_results/baselines/0.3.14
  Files: 726 result files

What it does:

Step 1: Standard eval (0-shot + self-repair)
- Uses extended_suite (--full, 6 models): gpt5-5 (Apr 2026 flagship), gpt5-4-mini, claude-opus-4-7, claude-sonnet-4-6, gemini-3-1-pro, gemini-3-flash
- Or dev_models (default): gpt5-4-mini, claude-haiku-4-5, gemini-3-flash
- Both AILANG and Python; all benchmarks in selected tier(s)
Step 2: Agent eval — flagship models × AILANG+Python
- agent_suite (4 cost-tuned models): claude-sonnet-4-6 (longitudinal anchor), gemini-3-flash, gpt5-4-mini, opencode-sonnet-4-6 (cross-harness pair)
- Tier system (v0.14.0+): smoke (15), core (22), stretch (11), vision (6)
- Default scope: core,stretch — Core is the headline metric, Stretch is harder mixed results
- Expected: core 70%+ for AILANG; vision intentionally low
- Feeds the ailang-vs-python comparison story in the Model Leaderboard page
Step 3 (--lang-harness): Language × Harness sweep — cheapest models × 4 languages
- lang_harness_suite: claude-haiku-4-5, gemini-3-flash, gpt5-4-mini, opencode-haiku
- All 4 languages: ailang, python, javascript, go
- Tier: core only (22 benchmarks) — stretch is skipped here even if --tier core,stretch was set globally
- Note: 4 core benchmarks are AILANG/Python-only (contract_bst_validate, contract_roman_numeral, effect_composition, effect_tracking_io_fs) and auto-skip on JS/Go runs
- Feeds the Agent Harness Explorer language spread and cross-harness comparison data
- Cost: ~$7 extra
--cross-harness: Replaces Step 2 with harness_suite (6 models, paired across harnesses)
- claude-sonnet-4-6 + opencode-sonnet-4-6, gemini-3-flash + opencode-gemini-3-flash, gpt5-5 + opencode-gpt5-5
- Cost: ~$31 extra vs base FULL (3x)
Saves combined results to eval_results/baselines/vX.X.X/
Accepts version with or without 'v' prefix

⚠️ Note: gpt5-5-pro is in models.yml but not in any default suite — agent mode is blocked (codex rejects with ChatGPT account, opencode returns 0 tool calls). Don't add it to suites.

`scripts/update_dashboard.sh <version>`

Update website benchmark dashboard with new release data.

Usage:

.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14

Output:

Updating dashboard for 0.3.14...

1/5 Generating Docusaurus markdown...
  ✓ Written to docs/docs/benchmarks/performance.md

2/5 Generating dashboard JSON with history...
  ✓ Written to docs/static/benchmarks/latest.json (history preserved)

3/5 Validating JSON...
  ✓ Version: 0.3.14
  ✓ Success rate: 0.627

4/5 Clearing Docusaurus cache...
  ✓ Cache cleared

5/5 Summary
  ✓ Dashboard updated for 0.3.14
  ✓ Markdown: docs/docs/benchmarks/performance.md
  ✓ JSON: docs/static/benchmarks/latest.json

Next steps:
  1. Test locally: cd docs && npm start
  2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
  3. Verify timeline shows 0.3.14
  4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
  5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
  6. Push: git push

What it does:

Generates Docusaurus-formatted markdown
Updates dashboard JSON with history preservation
Validates JSON structure (version matches input exactly)
Clears Docusaurus build cache
Provides next steps for testing and committing
Accepts version with or without 'v' prefix

`scripts/extract_changelog_metrics.sh [json_file]`

Extract benchmark metrics from dashboard JSON for CHANGELOG.

Usage:

.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json

Output:

Extracting metrics from docs/static/benchmarks/latest.json...

=== CHANGELOG.md Template ===

### Benchmark Results (M-EVAL)

**Overall Performance**: 59.1% success rate (399 total runs)

**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)

**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)

=== End Template ===

Use this template in CHANGELOG.md for 0.3.15

What it does:

Parses dashboard JSON for metrics
Calculates percentages and gap between AILANG/Python
Auto-compares with previous version from history
Formats comparison text automatically (+X% improvement or -X% regression)
Generates ready-to-paste CHANGELOG template with no manual work needed

`scripts/cleanup_design_docs.sh <version> [--dry-run] [--force] [--check-only]`

Move design docs from planned/ to implemented/ after a release.

Features:

Detects duplicates (docs already in implemented/)
Detects misplaced docs (Target: field doesn't match folder version)
Only moves docs with "Status: Implemented" in their frontmatter
Docs with other statuses are flagged for review

Usage:

# Check-only: Report issues without making changes
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --check-only

# Preview what would be moved/deleted/relocated
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --dry-run

# Execute: Move implemented docs, delete duplicates, relocate misplaced
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9

# Force move all docs regardless of status
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --force

Output:

Design Doc Cleanup for v0.5.9
==================================

Checking 5 design doc(s) in design_docs/planned/v0_5_9/:

Phase 1: Detecting issues...

  [DUPLICATE] m-fix-if-else-let-block.md
              Already exists in design_docs/implemented/v0_5_9/
  [MISPLACED] m-codegen-value-types.md
              Target: v0.5.10 (folder: v0_5_9)
              Should be in: design_docs/planned/v0_5_10/

Issues found:
  - 1 duplicate(s) (can be deleted)
  - 1 misplaced doc(s) (wrong version folder)

Phase 2: Processing docs...

  [DELETED] m-fix-if-else-let-block.md (duplicate - already in implemented/)
  [RELOCATED] m-codegen-value-types.md → design_docs/planned/v0_5_10/
  [MOVED] m-dx11-cycles.md → design_docs/implemented/v0_5_9/
  [NEEDS REVIEW] m-unfinished-feature.md
                 Found: **Status**: Planned

Summary:
  ✓ Deleted 1 duplicate(s)
  ✓ Relocated 1 doc(s) to correct version folder
  ✓ Moved 1 doc(s) to design_docs/implemented/v0_5_9/
  ⚠ 1 doc(s) need review (not marked as Implemented)

What it does:

Phase 1 (Detection): Identifies duplicates and misplaced docs
Phase 2 (Processing):
- Deletes duplicates (same file already exists in implemented/)
- Relocates misplaced docs (Target: version doesn't match folder)
- Moves docs with "Status: Implemented" to implemented/
- Flags docs without Implemented status for review
Creates target folders if needed
Removes empty planned folder after cleanup
Use --check-only to only report issues, --dry-run to preview actions, --force to move all

Post-Release Workflow

1. Verify Release Exists

git tag -l vX.X.X
gh release view vX.X.X

If release doesn't exist, run release-manager skill first.

2. Run Eval Baseline

🚨 CRITICAL: ALWAYS use --full for releases!

Correct workflow:

# ✅ RECOMMENDED for releases — adds 4-language Explorer data for ~$7 more
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full --lang-harness

# Minimum acceptable for releases
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full

This runs all 7 production models (extended_suite) with both AILANG and Python.

Tier scope for releases:

Default (release): --tier core,stretch — 33 benchmarks (22 core + 11 stretch)
Dev/fast mode: --tier core — 22 benchmarks (Core is the headline metric)
Full audit: --tier smoke,core,stretch — 48 benchmarks (include smoke for sanity)
vision benchmarks are research-grade and excluded by default — opt in explicitly with --tier vision if you want to publish those numbers.

Override tier via the script's --tier flag (see run_eval_baseline.sh --help). If unsure, the default is tuned to produce a release-ready baseline in ~30–60 minutes.

Cost & time (default tier core,stretch = 33 benchmarks):

Mode	Cost	Time	Use for
`--full`	~$16	~30-40 min	Standard release
`--full --lang-harness`	~$23	~45-60 min	Recommended — adds 4-lang Explorer data
`--full --cross-harness`	~$47	~45-60 min	Major releases (vX.0, quarterly)
dev (no flags)	~$3.50	~10-15 min	Quick validation only — never for releases

❌ WRONG workflow (what happened with v0.3.22):

# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X  # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline

If baseline times out or is interrupted:

# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
  --output eval_results/baselines/X.X.X --skip-existing

The --skip-existing flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".

3. Update Website Dashboard

Use the automation script:

.claude/skills/post-release/scripts/update_dashboard.sh X.X.X

IMPORTANT: This script automatically:

Generates Docusaurus markdown (docs/docs/benchmarks/performance.md)
Updates JSON with history preservation (docs/static/benchmarks/latest.json)
Does NOT overwrite historical data - merges new version into existing history
Validates JSON structure before writing
Clears Docusaurus cache to prevent webpack errors

Test locally (optional but recommended):

cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance

Verify:

Timeline shows vX.X.X
Success rate matches eval results
No errors or warnings

Commit dashboard updates:

git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push

4. Update Axiom Scorecard

Review and update the axiom scorecard:

# View current scorecard
ailang axioms

# The scorecard is at docs/static/benchmarks/axiom_scorecard.json
# Update scores if features were added/improved that affect axiom compliance

When to update scores:

+1 → +2 if a partial implementation becomes complete
New feature aligns with an axiom → update evidence
Gaps were fixed → remove from gaps array
Add to history array to track KPI over time

Update history entry:

{
  "version": "vX.X.X",
  "date": "YYYY-MM-DD",
  "score": 18,
  "maxScore": 24,
  "percentage": 75.0,
  "notes": "Added capability budgets (A9 +1)"
}

5. Extract Metrics for CHANGELOG

Generate metrics template:

.claude/skills/post-release/scripts/extract_changelog_metrics.sh X.X.X

This outputs a formatted template with:

Overall success rate
Standard eval metrics (0-shot, final with repair, repair effectiveness)
Agent eval metrics by language
Automatic comparison with previous version (no manual work!)

Update changelog automatically:

Run the script to generate the template
Insert the "Benchmark Results (M-EVAL)" section into the active changelog file in changelogs/ (find with ls changelogs/ | grep current)
Place it after the feature/fix sections and before the next version
Note: Root CHANGELOG.md is an index file — do NOT write entries there

For CHANGELOG template format, see resources/version_notes.md.

5a. Analyze Agent Evaluation Results

v0.8.0+ (chain-based - recommended):

# Find the chain ID from the latest eval run
ailang eval-chains list

# View per-benchmark pass/fail with cost and turns
ailang eval-chains view <chain-id>

# Pass rate breakdown
ailang eval-chains stats <chain-id>

# Show failures with error details
ailang eval-chains failures <chain-id>

# Generate chain-based report
ailang eval-report --from-chain <chain-id> X.X.X --format=json

Legacy (file-based):

# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X

Target metrics: Avg Turns ≤1.5x gap, Avg Tokens ≤2.0x gap vs Python.

For detailed agent analysis guide, see resources/version_notes.md.

5b. Tag-Based Analysis and Rotation Check (v0.14.0+)

After dashboard + changelog metrics, run the curation analysis primitives. These inform what to keep, demote, or promote for the next release — the suite is curated, not accumulated. See benchmarks/CURATION.md for the full philosophy.

# Tag coverage: which of the 12 canonical tags are thin or over-represented?
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --by-tags

# AILANG-only wins: AILANG beats Python by ≥ 10pp — protect these from regressions
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --ailang-wins

# Dual-mode saturation audit (RECOMMENDED for demotion decisions):
# Lists every benchmark's standard AND agent pass rate per language.
# Demote candidates require ≥95% on ALL 4 dimensions (std-AI, std-Py, agent-AI, agent-Py).
# Standard-only saturation is misleading: many benchmarks are 100/100 in standard
# but drop to 12-50% in agent mode — those are still valuable signal, KEEP IN CORE.
.claude/skills/benchmark-manager/scripts/audit_saturation.sh vX.X.X

# Built-in saturated check (uses topline successRate only — less reliable, prefer the script above):
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --show-saturated

# Optional: compare this release's tag deltas against the previous baseline
ailang eval-report eval_results/baselines/vX.X.X vX.X.X --format=json

Record three things in the release notes (or the design doc retro):

Demote candidates — saturated benchmarks that should move to stretch or be retired in the next sprint.
Keep as value evidence — AILANG-only wins that prove the language's ROI.
Thin tags — taxonomy gaps (<3 benchmarks in a tag) to target with new benchmarks.

This step is cheap (seconds), has no external dependencies, and produces the input for the next release's benchmark-manager / eval-gap-finder work.

6. Move Design Docs to Implemented

Step 1: Check for issues (duplicates, misplaced docs):

.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --check-only

Step 2: Preview all changes:

.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --dry-run

Step 3: Check any flagged docs:

[DUPLICATE] - These will be deleted (already in implemented/)
[MISPLACED] - These will be relocated to correct version folder
[NEEDS REVIEW] - Update **Status**: to Implemented if done, or leave for next version

Step 4: Execute the cleanup:

.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X

Step 5: Commit the changes:

git add design_docs/
git commit -m "docs: cleanup design docs for vX_Y_Z"

The script automatically:

Deletes duplicates (same file already in implemented/)
Relocates misplaced docs (Target: version doesn't match folder)
Moves docs with "Status: Implemented" to implemented/
Flags remaining docs for manual review

7. Update Public Documentation

Update prompts/ with latest AILANG syntax (ailang prompt)
Update prompts/devtools/ with latest toolchain docs (ailang devtools-prompt)
- New CLI commands or flags should be added to the devtools prompt
- Verify with: ailang devtools-prompt | grep "new-command"
Update website docs (docs/) with latest features
Remove outdated examples or references
Add new examples to website
Update docs/guides/evaluation/ if significant benchmark improvements
Update docs/LIMITATIONS.md:
- Remove limitations that were fixed in this release
- Add new known limitations discovered during development/testing
- Update workarounds if they changed
- Update version numbers in "Since" and "Fixed in" fields
- Test examples: Verify that limitations listed still exist and workarounds still work
```
# Test examples from LIMITATIONS.md
# Example: Test Y-combinator still fails (should fail)
echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl

# Example: Test named recursion works (should succeed)
ailang run examples/factorial.ail

# Example: Test polymorphic operator limitation (should panic with floats)
# Create test file and verify behavior matches documentation
```
- Commit changes: git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"

8. Run Documentation Sync Check

Run docs-sync to verify website accuracy:

# Check version constants are correct
.claude/skills/docs-sync/scripts/check_versions.sh

# Audit design docs vs website claims
.claude/skills/docs-sync/scripts/audit_design_docs.sh

# Generate full sync report
.claude/skills/docs-sync/scripts/generate_report.sh

What docs-sync checks:

Version constants in docs/src/constants/version.js match git tag
Teaching prompt references point to latest version
Architecture pages have PLANNED banners for unimplemented features
Design docs status (planned vs implemented) matches website claims
Examples referenced in website actually work

If issues found:

Update version.js if stale
Add PLANNED banners to theoretical feature pages
Move implemented features from roadmap to current sections
Fix broken example references

Commit docs-sync fixes:

git add docs/
git commit -m "docs: sync website with vX.X.X implementation"

See docs-sync skill for full documentation.

9. Verify μRAG Corpus Reindex (REQUIRED)

release-manager runs make brain-index-syntax-reset immediately after the tag pushes. This step verifies that the reset actually populated the brain with chunks tagged with the new release version — protects against silent indexer failures that would leave Claude pulling stale snippets.

Spot-check ≥5 chunks reference the active version:

EXPECTED_VERSION="$(ailang prompt --version-active)"
ailang cache search --namespace ailang-syntax --limit 5 "string" \
  | grep -c "version:${EXPECTED_VERSION}" \
  || { echo "FAIL: μRAG corpus does not reference $EXPECTED_VERSION"; exit 1; }

Quick stat check:

ailang cache stats | grep -E "ailang-(syntax|builtins|examples)"
# Expect: ailang-syntax >= 50, ailang-builtins >= 250, ailang-examples >= 100

Append a one-line audit to release notes (or eval_results/baselines/<version>/notes.md):

μRAG corpus reindex verified: <ailang-syntax count>, <ailang-builtins count>, <ailang-examples count> chunks @ <version>.

If verification fails:

Re-run make brain-index-syntax-reset from the project root.
Check ailang prompt --version-active returns the new release tag.
If it returns nothing, the prompt directory wasn't published in the release — escalate to release-manager rather than masking the issue.

Resources

Post-Release Checklist

See resources/post_release_checklist.md for complete step-by-step checklist.

Prerequisites

Release vX.X.X completed successfully
Git tag vX.X.X exists
GitHub release published with all binaries
ailang binary installed (for eval baseline)
Node.js/npm installed (for dashboard, optional)

Common Issues

Eval Baseline Times Out

Solution: Use --skip-existing flag to resume:

bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X

Dashboard Shows "null" for Aggregates

Cause: Wrong JSON file (performance matrix vs dashboard JSON) Solution: Use update_dashboard.sh script, not manual file copying

Webpack/Cache Errors in Docusaurus

Cause: Stale build cache Solution: Run cd docs && npm run clear && rm -rf .docusaurus build

Dashboard Shows Old Version

Cause: Didn't run update_dashboard.sh with correct version Solution: Re-run update_dashboard.sh X.X.X with correct version

Progressive Disclosure

This skill loads information progressively:

Always loaded: This SKILL.md file (YAML frontmatter + workflow overview)
Execute as needed: Scripts in scripts/ directory (automation)
Load on demand: resources/post_release_checklist.md (detailed checklist)

Scripts execute without loading into context window, saving tokens while providing powerful automation.

Version Notes

For historical improvements and lessons learned, see resources/version_notes.md.

Key points:

All scripts accept version with or without 'v' prefix
Use --validate flag to check configuration before running
Agent eval scopes benchmarks by tier (--tier core,stretch default) — benchmarks are resolved from benchmarks/*.yml by tier, not a hardcoded list
Tag-based curation analysis runs via ailang eval-matrix --by-tags/--show-saturated/ --ailang-wins in Step 5b

Notes

This skill follows Anthropic's Agent Skills specification (Oct 2025)
Scripts handle 100% of automation (eval baseline, dashboard, metrics extraction, design docs)
Can be run hours or even days after release
Dashboard JSON preserves history - never overwrites historical data
Always use --full flag for release baselines (all production models)
Design docs cleanup now handles duplicates, misplaced docs, and status-based moves
- --check-only to report issues without changes
- --dry-run to preview all actions
- --force to move all regardless of status

About

SKILL.md

About

Run automated post-release workflow (eval baselines, dashboard, docs) for AILANG releases...

SKILL.md

AILANG Post-Release Tasks

Run post-release tasks for an AILANG release: evaluation baselines, dashboard updates, and documentation.

Current State

Current version: !'cat std/VERSION'
Latest tag: !'git describe --tags --abbrev=0 2>/dev/null || echo "no tags"'
GitHub release: !'gh release list --limit 3 2>/dev/null | head -3 || echo "gh not available"'
Existing baselines: !'ls eval_results/baselines/ 2>/dev/null | tail -5 || echo "none"'
Dashboard version: !'jq -r .version docs/static/benchmarks/latest.json 2>/dev/null || echo "not found"'
Active changelog: !'ls changelogs/ | grep current 2>/dev/null'

Use the data above first. Only re-run these commands manually if the injected context is empty or you need to refresh after making changes.

Quick Start

Most common usage:

# User says: "Run post-release tasks for v0.3.14"
# This skill will:
# 1. Run eval baseline (extended_suite: 7 production models + lang-harness sweep) - ALWAYS USE --full FOR RELEASES
# 2. Update website dashboard (JSON with history preservation)
# 3. Update axiom scorecard KPI (if features affect axiom compliance)
# 4. Extract metrics and UPDATE CHANGELOG.md automatically
# 5. Move design docs from planned/ to implemented/
# 6. Run docs-sync to verify website accuracy (version constants, PLANNED banners, examples)
# 7. Commit all changes to git

🚨 CRITICAL: For releases, ALWAYS use --full flag by default

Dev models (without --full) are only for quick testing/validation, NOT releases
Users expect full benchmark results when they say "post-release" or "update dashboard"
Never start with dev models and then try to add production models later

When to Use This Skill

Invoke this skill when:

User says "post-release tasks", "update dashboard", "run benchmarks"
After successful release (once GitHub release is published)
User asks about eval baselines or benchmark results
User wants to update documentation after a release

Available Scripts

`scripts/run_eval_baseline.sh <version> [--full] [--cross-harness]`

Run evaluation baseline for a release version.

🚨 CRITICAL: ALWAYS use --full for releases!

Usage:

# ✅ RECOMMENDED release baseline (standard + agent + 4-language Explorer sweep) — ~$23
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0 --full --lang-harness

# Standard + agent only (no 4-lang Explorer sweep) — ~$16
.claude/skills/post-release/scripts/run_eval_baseline.sh v0.15.0 --full

# Major release: includes cross-harness comparison (gpt5-5 + opencode-gpt5-5 etc.) — ~$47
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0 --full --cross-harness

# ❌ Dev only — 3 cheap models, AILANG lang only (quick testing/validation) — ~$3.50
.claude/skills/post-release/scripts/run_eval_baseline.sh 0.15.0

Output:

Running eval baseline for 0.3.14...
Mode: FULL (extended_suite: 7 production models)
Expected cost: ~$16 (FULL) or ~$23 (FULL + lang-harness) or ~$47 (FULL + cross-harness)
Expected time: ~30-60 minutes

[Running benchmarks...]

✓ Baseline complete
  Results: eval_results/baselines/0.3.14
  Files: 726 result files

What it does:

Step 1: Standard eval (0-shot + self-repair)
- Uses extended_suite (--full, 6 models): gpt5-5 (Apr 2026 flagship), gpt5-4-mini, claude-opus-4-7, claude-sonnet-4-6, gemini-3-1-pro, gemini-3-flash
- Or dev_models (default): gpt5-4-mini, claude-haiku-4-5, gemini-3-flash
- Both AILANG and Python; all benchmarks in selected tier(s)
Step 2: Agent eval — flagship models × AILANG+Python
- agent_suite (4 cost-tuned models): claude-sonnet-4-6 (longitudinal anchor), gemini-3-flash, gpt5-4-mini, opencode-sonnet-4-6 (cross-harness pair)
- Tier system (v0.14.0+): smoke (15), core (22), stretch (11), vision (6)
- Default scope: core,stretch — Core is the headline metric, Stretch is harder mixed results
- Expected: core 70%+ for AILANG; vision intentionally low
- Feeds the ailang-vs-python comparison story in the Model Leaderboard page
Step 3 (--lang-harness): Language × Harness sweep — cheapest models × 4 languages
- lang_harness_suite: claude-haiku-4-5, gemini-3-flash, gpt5-4-mini, opencode-haiku
- All 4 languages: ailang, python, javascript, go
- Tier: core only (22 benchmarks) — stretch is skipped here even if --tier core,stretch was set globally
- Note: 4 core benchmarks are AILANG/Python-only (contract_bst_validate, contract_roman_numeral, effect_composition, effect_tracking_io_fs) and auto-skip on JS/Go runs
- Feeds the Agent Harness Explorer language spread and cross-harness comparison data
- Cost: ~$7 extra
--cross-harness: Replaces Step 2 with harness_suite (6 models, paired across harnesses)
- claude-sonnet-4-6 + opencode-sonnet-4-6, gemini-3-flash + opencode-gemini-3-flash, gpt5-5 + opencode-gpt5-5
- Cost: ~$31 extra vs base FULL (3x)
Saves combined results to eval_results/baselines/vX.X.X/
Accepts version with or without 'v' prefix

`scripts/update_dashboard.sh <version>`

Update website benchmark dashboard with new release data.

Usage:

.claude/skills/post-release/scripts/update_dashboard.sh 0.3.14

Output:

Updating dashboard for 0.3.14...

1/5 Generating Docusaurus markdown...
  ✓ Written to docs/docs/benchmarks/performance.md

2/5 Generating dashboard JSON with history...
  ✓ Written to docs/static/benchmarks/latest.json (history preserved)

3/5 Validating JSON...
  ✓ Version: 0.3.14
  ✓ Success rate: 0.627

4/5 Clearing Docusaurus cache...
  ✓ Cache cleared

5/5 Summary
  ✓ Dashboard updated for 0.3.14
  ✓ Markdown: docs/docs/benchmarks/performance.md
  ✓ JSON: docs/static/benchmarks/latest.json

Next steps:
  1. Test locally: cd docs && npm start
  2. Visit: http://localhost:3000/ailang/docs/benchmarks/performance
  3. Verify timeline shows 0.3.14
  4. Commit: git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
  5. Commit: git commit -m 'Update benchmark dashboard for 0.3.14'
  6. Push: git push

What it does:

Generates Docusaurus-formatted markdown
Updates dashboard JSON with history preservation
Validates JSON structure (version matches input exactly)
Clears Docusaurus build cache
Provides next steps for testing and committing
Accepts version with or without 'v' prefix

`scripts/extract_changelog_metrics.sh [json_file]`

Extract benchmark metrics from dashboard JSON for CHANGELOG.

Usage:

.claude/skills/post-release/scripts/extract_changelog_metrics.sh
# Or specify JSON file:
.claude/skills/post-release/scripts/extract_changelog_metrics.sh docs/static/benchmarks/latest.json

Output:

Extracting metrics from docs/static/benchmarks/latest.json...

=== CHANGELOG.md Template ===

### Benchmark Results (M-EVAL)

**Overall Performance**: 59.1% success rate (399 total runs)

**By Language:**
- **AILANG**: 33.0% - New language, learning curve
- **Python**: 87.0% - Baseline for comparison
- **Gap: 54.0 percentage points (expected for new language)

**Comparison**: -15.2% AILANG regression from 0.3.14 (48.2% → 33.0%)

=== End Template ===

Use this template in CHANGELOG.md for 0.3.15

What it does:

Parses dashboard JSON for metrics
Calculates percentages and gap between AILANG/Python
Auto-compares with previous version from history
Formats comparison text automatically (+X% improvement or -X% regression)
Generates ready-to-paste CHANGELOG template with no manual work needed

`scripts/cleanup_design_docs.sh <version> [--dry-run] [--force] [--check-only]`

Move design docs from planned/ to implemented/ after a release.

Features:

Detects duplicates (docs already in implemented/)
Detects misplaced docs (Target: field doesn't match folder version)
Only moves docs with "Status: Implemented" in their frontmatter
Docs with other statuses are flagged for review

Usage:

# Check-only: Report issues without making changes
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --check-only

# Preview what would be moved/deleted/relocated
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --dry-run

# Execute: Move implemented docs, delete duplicates, relocate misplaced
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9

# Force move all docs regardless of status
.claude/skills/post-release/scripts/cleanup_design_docs.sh 0.5.9 --force

Output:

Design Doc Cleanup for v0.5.9
==================================

Checking 5 design doc(s) in design_docs/planned/v0_5_9/:

Phase 1: Detecting issues...

  [DUPLICATE] m-fix-if-else-let-block.md
              Already exists in design_docs/implemented/v0_5_9/
  [MISPLACED] m-codegen-value-types.md
              Target: v0.5.10 (folder: v0_5_9)
              Should be in: design_docs/planned/v0_5_10/

Issues found:
  - 1 duplicate(s) (can be deleted)
  - 1 misplaced doc(s) (wrong version folder)

Phase 2: Processing docs...

  [DELETED] m-fix-if-else-let-block.md (duplicate - already in implemented/)
  [RELOCATED] m-codegen-value-types.md → design_docs/planned/v0_5_10/
  [MOVED] m-dx11-cycles.md → design_docs/implemented/v0_5_9/
  [NEEDS REVIEW] m-unfinished-feature.md
                 Found: **Status**: Planned

Summary:
  ✓ Deleted 1 duplicate(s)
  ✓ Relocated 1 doc(s) to correct version folder
  ✓ Moved 1 doc(s) to design_docs/implemented/v0_5_9/
  ⚠ 1 doc(s) need review (not marked as Implemented)

What it does:

Phase 1 (Detection): Identifies duplicates and misplaced docs
Phase 2 (Processing):
- Deletes duplicates (same file already exists in implemented/)
- Relocates misplaced docs (Target: version doesn't match folder)
- Moves docs with "Status: Implemented" to implemented/
- Flags docs without Implemented status for review
Creates target folders if needed
Removes empty planned folder after cleanup
Use --check-only to only report issues, --dry-run to preview actions, --force to move all

Post-Release Workflow

1. Verify Release Exists

git tag -l vX.X.X
gh release view vX.X.X

If release doesn't exist, run release-manager skill first.

2. Run Eval Baseline

🚨 CRITICAL: ALWAYS use --full for releases!

Correct workflow:

# ✅ RECOMMENDED for releases — adds 4-language Explorer data for ~$7 more
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full --lang-harness

# Minimum acceptable for releases
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X --full

This runs all 7 production models (extended_suite) with both AILANG and Python.

Tier scope for releases:

Default (release): --tier core,stretch — 33 benchmarks (22 core + 11 stretch)
Dev/fast mode: --tier core — 22 benchmarks (Core is the headline metric)
Full audit: --tier smoke,core,stretch — 48 benchmarks (include smoke for sanity)
vision benchmarks are research-grade and excluded by default — opt in explicitly with --tier vision if you want to publish those numbers.

Override tier via the script's --tier flag (see run_eval_baseline.sh --help). If unsure, the default is tuned to produce a release-ready baseline in ~30–60 minutes.

Cost & time (default tier core,stretch = 33 benchmarks):

Mode	Cost	Time	Use for
`--full`	~$16	~30-40 min	Standard release
`--full --lang-harness`	~$23	~45-60 min	Recommended — adds 4-lang Explorer data
`--full --cross-harness`	~$47	~45-60 min	Major releases (vX.0, quarterly)
dev (no flags)	~$3.50	~10-15 min	Quick validation only — never for releases

❌ WRONG workflow (what happened with v0.3.22):

# DON'T do this for releases!
.claude/skills/post-release/scripts/run_eval_baseline.sh X.X.X  # Missing --full!
# Then try to add production models later with --skip-existing
# Result: Confusion, multiple processes, incomplete baseline

If baseline times out or is interrupted:

# Resume with ALL 6 models (maintains --full semantics)
ailang eval-suite --full --langs python,ailang --parallel 5 \
  --output eval_results/baselines/X.X.X --skip-existing

The --skip-existing flag skips benchmarks that already have result files, allowing resumption of interrupted runs. But ONLY use this for recovery, not as a strategy to "add more models later".

3. Update Website Dashboard

Use the automation script:

.claude/skills/post-release/scripts/update_dashboard.sh X.X.X

IMPORTANT: This script automatically:

Generates Docusaurus markdown (docs/docs/benchmarks/performance.md)
Updates JSON with history preservation (docs/static/benchmarks/latest.json)
Does NOT overwrite historical data - merges new version into existing history
Validates JSON structure before writing
Clears Docusaurus cache to prevent webpack errors

Test locally (optional but recommended):

cd docs && npm start
# Visit: http://localhost:3000/ailang/docs/benchmarks/performance

Verify:

Timeline shows vX.X.X
Success rate matches eval results
No errors or warnings

Commit dashboard updates:

git add docs/docs/benchmarks/performance.md docs/static/benchmarks/latest.json
git commit -m "Update benchmark dashboard for vX.X.X"
git push

4. Update Axiom Scorecard

Review and update the axiom scorecard:

# View current scorecard
ailang axioms

# The scorecard is at docs/static/benchmarks/axiom_scorecard.json
# Update scores if features were added/improved that affect axiom compliance

When to update scores:

+1 → +2 if a partial implementation becomes complete
New feature aligns with an axiom → update evidence
Gaps were fixed → remove from gaps array
Add to history array to track KPI over time

Update history entry:

{
  "version": "vX.X.X",
  "date": "YYYY-MM-DD",
  "score": 18,
  "maxScore": 24,
  "percentage": 75.0,
  "notes": "Added capability budgets (A9 +1)"
}

5. Extract Metrics for CHANGELOG

Generate metrics template:

.claude/skills/post-release/scripts/extract_changelog_metrics.sh X.X.X

This outputs a formatted template with:

Overall success rate
Standard eval metrics (0-shot, final with repair, repair effectiveness)
Agent eval metrics by language
Automatic comparison with previous version (no manual work!)

Update changelog automatically:

Run the script to generate the template
Insert the "Benchmark Results (M-EVAL)" section into the active changelog file in changelogs/ (find with ls changelogs/ | grep current)
Place it after the feature/fix sections and before the next version
Note: Root CHANGELOG.md is an index file — do NOT write entries there

For CHANGELOG template format, see resources/version_notes.md.

5a. Analyze Agent Evaluation Results

v0.8.0+ (chain-based - recommended):

# Find the chain ID from the latest eval run
ailang eval-chains list

# View per-benchmark pass/fail with cost and turns
ailang eval-chains view <chain-id>

# Pass rate breakdown
ailang eval-chains stats <chain-id>

# Show failures with error details
ailang eval-chains failures <chain-id>

# Generate chain-based report
ailang eval-report --from-chain <chain-id> X.X.X --format=json

Legacy (file-based):

# Get KPIs (turns, tokens, cost by language)
.claude/skills/eval-analyzer/scripts/agent_kpis.sh eval_results/baselines/X.X.X

Target metrics: Avg Turns ≤1.5x gap, Avg Tokens ≤2.0x gap vs Python.

For detailed agent analysis guide, see resources/version_notes.md.

5b. Tag-Based Analysis and Rotation Check (v0.14.0+)

# Tag coverage: which of the 12 canonical tags are thin or over-represented?
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --by-tags

# AILANG-only wins: AILANG beats Python by ≥ 10pp — protect these from regressions
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --ailang-wins

# Dual-mode saturation audit (RECOMMENDED for demotion decisions):
# Lists every benchmark's standard AND agent pass rate per language.
# Demote candidates require ≥95% on ALL 4 dimensions (std-AI, std-Py, agent-AI, agent-Py).
# Standard-only saturation is misleading: many benchmarks are 100/100 in standard
# but drop to 12-50% in agent mode — those are still valuable signal, KEEP IN CORE.
.claude/skills/benchmark-manager/scripts/audit_saturation.sh vX.X.X

# Built-in saturated check (uses topline successRate only — less reliable, prefer the script above):
ailang eval-matrix eval_results/baselines/vX.X.X vX.X.X --show-saturated

# Optional: compare this release's tag deltas against the previous baseline
ailang eval-report eval_results/baselines/vX.X.X vX.X.X --format=json

Record three things in the release notes (or the design doc retro):

Demote candidates — saturated benchmarks that should move to stretch or be retired in the next sprint.
Keep as value evidence — AILANG-only wins that prove the language's ROI.
Thin tags — taxonomy gaps (<3 benchmarks in a tag) to target with new benchmarks.

This step is cheap (seconds), has no external dependencies, and produces the input for the next release's benchmark-manager / eval-gap-finder work.

6. Move Design Docs to Implemented

Step 1: Check for issues (duplicates, misplaced docs):

.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --check-only

Step 2: Preview all changes:

.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X --dry-run

Step 3: Check any flagged docs:

[DUPLICATE] - These will be deleted (already in implemented/)
[MISPLACED] - These will be relocated to correct version folder
[NEEDS REVIEW] - Update **Status**: to Implemented if done, or leave for next version

Step 4: Execute the cleanup:

.claude/skills/post-release/scripts/cleanup_design_docs.sh X.X.X

Step 5: Commit the changes:

git add design_docs/
git commit -m "docs: cleanup design docs for vX_Y_Z"

The script automatically:

Deletes duplicates (same file already in implemented/)
Relocates misplaced docs (Target: version doesn't match folder)
Moves docs with "Status: Implemented" to implemented/
Flags remaining docs for manual review

7. Update Public Documentation

Update prompts/ with latest AILANG syntax (ailang prompt)
Update prompts/devtools/ with latest toolchain docs (ailang devtools-prompt)
- New CLI commands or flags should be added to the devtools prompt
- Verify with: ailang devtools-prompt | grep "new-command"
Update website docs (docs/) with latest features
Remove outdated examples or references
Add new examples to website
Update docs/guides/evaluation/ if significant benchmark improvements
Update docs/LIMITATIONS.md:
- Remove limitations that were fixed in this release
- Add new known limitations discovered during development/testing
- Update workarounds if they changed
- Update version numbers in "Since" and "Fixed in" fields
- Test examples: Verify that limitations listed still exist and workarounds still work
```
# Test examples from LIMITATIONS.md
# Example: Test Y-combinator still fails (should fail)
echo 'let Y = \f. (\x. f(x(x)))(\x. f(x(x))) in Y' | ailang repl

# Example: Test named recursion works (should succeed)
ailang run examples/factorial.ail

# Example: Test polymorphic operator limitation (should panic with floats)
# Create test file and verify behavior matches documentation
```
- Commit changes: git add docs/LIMITATIONS.md && git commit -m "Update LIMITATIONS.md for vX.X.X"

8. Run Documentation Sync Check

Run docs-sync to verify website accuracy:

# Check version constants are correct
.claude/skills/docs-sync/scripts/check_versions.sh

# Audit design docs vs website claims
.claude/skills/docs-sync/scripts/audit_design_docs.sh

# Generate full sync report
.claude/skills/docs-sync/scripts/generate_report.sh

What docs-sync checks:

Version constants in docs/src/constants/version.js match git tag
Teaching prompt references point to latest version
Architecture pages have PLANNED banners for unimplemented features
Design docs status (planned vs implemented) matches website claims
Examples referenced in website actually work

If issues found:

Update version.js if stale
Add PLANNED banners to theoretical feature pages
Move implemented features from roadmap to current sections
Fix broken example references

Commit docs-sync fixes:

git add docs/
git commit -m "docs: sync website with vX.X.X implementation"

See docs-sync skill for full documentation.

9. Verify μRAG Corpus Reindex (REQUIRED)

Spot-check ≥5 chunks reference the active version:

EXPECTED_VERSION="$(ailang prompt --version-active)"
ailang cache search --namespace ailang-syntax --limit 5 "string" \
  | grep -c "version:${EXPECTED_VERSION}" \
  || { echo "FAIL: μRAG corpus does not reference $EXPECTED_VERSION"; exit 1; }

Quick stat check:

ailang cache stats | grep -E "ailang-(syntax|builtins|examples)"
# Expect: ailang-syntax >= 50, ailang-builtins >= 250, ailang-examples >= 100

Append a one-line audit to release notes (or eval_results/baselines/<version>/notes.md):

μRAG corpus reindex verified: <ailang-syntax count>, <ailang-builtins count>, <ailang-examples count> chunks @ <version>.

If verification fails:

Re-run make brain-index-syntax-reset from the project root.
Check ailang prompt --version-active returns the new release tag.
If it returns nothing, the prompt directory wasn't published in the release — escalate to release-manager rather than masking the issue.

Resources

Post-Release Checklist

See resources/post_release_checklist.md for complete step-by-step checklist.

Prerequisites

Release vX.X.X completed successfully
Git tag vX.X.X exists
GitHub release published with all binaries
ailang binary installed (for eval baseline)
Node.js/npm installed (for dashboard, optional)

Common Issues

Eval Baseline Times Out

Solution: Use --skip-existing flag to resume:

bin/ailang eval-suite --full --skip-existing --output eval_results/baselines/vX.X.X

Dashboard Shows "null" for Aggregates

Cause: Wrong JSON file (performance matrix vs dashboard JSON) Solution: Use update_dashboard.sh script, not manual file copying

Webpack/Cache Errors in Docusaurus

Cause: Stale build cache Solution: Run cd docs && npm run clear && rm -rf .docusaurus build

Dashboard Shows Old Version

Cause: Didn't run update_dashboard.sh with correct version Solution: Re-run update_dashboard.sh X.X.X with correct version

Progressive Disclosure

This skill loads information progressively:

Always loaded: This SKILL.md file (YAML frontmatter + workflow overview)
Execute as needed: Scripts in scripts/ directory (automation)
Load on demand: resources/post_release_checklist.md (detailed checklist)

Scripts execute without loading into context window, saving tokens while providing powerful automation.

Version Notes

For historical improvements and lessons learned, see resources/version_notes.md.

Key points:

All scripts accept version with or without 'v' prefix
Use --validate flag to check configuration before running
Agent eval scopes benchmarks by tier (--tier core,stretch default) — benchmarks are resolved from benchmarks/*.yml by tier, not a hardcoded list
Tag-based curation analysis runs via ailang eval-matrix --by-tags/--show-saturated/ --ailang-wins in Step 5b

Notes

This skill follows Anthropic's Agent Skills specification (Oct 2025)
Scripts handle 100% of automation (eval baseline, dashboard, metrics extraction, design docs)
Can be run hours or even days after release
Dashboard JSON preserves history - never overwrites historical data
Always use --full flag for release baselines (all production models)
Design docs cleanup now handles duplicates, misplaced docs, and status-based moves
- --check-only to report issues without changes
- --dry-run to preview all actions
- --force to move all regardless of status