Pre-ingestion verification for epistemic quality in RAG systems. Ensures documents are properly qualified before entering knowledge bases.
Purpose: Pre-ingestion verification system that enforces epistemic quality before documents enter RAG knowledge bases. Produces Clarity-Gated Documents (CGD) compliant with the Clarity Gate Format Specification v2.1.
Core Question: "If another LLM reads this document, will it mistake assumptions for facts?"
Core Principle: "Detection finds what is; enforcement ensures what should be. In practice: find the missing uncertainty markers before they become confident hallucinations."
| Feature | Description |
|---|---|
| Claim Completion Status | PENDING/VERIFIED determined by field presence (no explicit status field) |
| Source Field Semantics | Actionable source (PENDING) vs. what-was-found (VERIFIED) |
| Claim ID Format Guidance | Hash-based IDs preferred, collision analysis for scale |
| Body Structure Requirements | HITL Verification Record section mandatory when claims exist |
| New Validation Codes | E-ST10, W-ST11, W-HC01, W-HC02, E-SC06 (FORMAT_SPEC); E-TB01-07 (SOT validation) |
| Bundled Scripts | claim_id.py and document_hash.py for deterministic computations |
This skill implements and references:
| Specification | Version | Location |
|---|---|---|
| Clarity Gate Format (Unified) | v2.1 | docs/CLARITY_GATE_FORMAT_SPEC.md |
Note: v2.0 unifies CGD and SOT into a single .cgd.md format. SOT is now a CGD with an optional tier: block.
Clarity Gate defines validation codes for structural and semantic checks per FORMAT_SPEC v2.1:
| Code | Check | Severity |
|---|---|---|
| W-HC01 | Partial confirmed-by/confirmed-date fields |
WARNING |
| W-HC02 | Vague source (e.g., "industry reports", "TBD") | WARNING |
| E-SC06 | Schema error in hitl-claims structure |
ERROR |
| Code | Check | Severity |
|---|---|---|
| E-ST10 | Missing ## HITL Verification Record when claims exist |
ERROR |
| W-ST11 | Table rows don't match hitl-claims count |
WARNING |
| Code | Check | Severity |
|---|---|---|
| E-TB01 | No ## Verified Claims section |
ERROR |
| E-TB02 | Table has no data rows | ERROR |
| E-TB03 | Required columns missing | ERROR |
| E-TB04 | Column order wrong | ERROR |
| E-TB05 | Empty cell in required column | ERROR |
| E-TB06 | Invalid date format in Verified column | ERROR |
| E-TB07 | Verified date in future (beyond 24h grace) | ERROR |
Note: Additional validation codes may be defined in RFC-001 (clarification document) but are not part of the normative FORMAT_SPEC.
This skill includes Python scripts for deterministic computations per FORMAT_SPEC.
Computes stable, hash-based claim IDs for HITL tracking (per §1.3.4).
# Generate claim ID
python scripts/claim_id.py "Base price is $99/mo" "api-pricing/1"
# Output: claim-75fb137a
# Run test vectors
python scripts/claim_id.py --test
Algorithm:
Test vectors:
claim_id("Base price is $99/mo", "api-pricing/1") → claim-75fb137aclaim_id("The API supports GraphQL", "features/1") → claim-eb357742Computes document SHA-256 hash per FORMAT_SPEC §2.2-2.4 with full canonicalization.
# Compute hash
python scripts/document_hash.py my-doc.cgd.md
# Output: 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
# Verify existing hash
python scripts/document_hash.py --verify my-doc.cgd.md
# Output: PASS: Hash verified: 7d865e...
# Run normalization tests
python scripts/document_hash.py --test
Algorithm (per §2.2-2.4):
---\n and <!-- CLARITY_GATE_END -->document-sha256 line from YAML frontmatter ONLY (with multiline continuation support)Cross-platform normalization:
Existing tools like UnScientify and HedgeHunter (CoNLL-2010) detect uncertainty markers already present in text ("Is uncertainty expressed?").
Clarity Gate enforces their presence where epistemically required ("Should uncertainty be expressed but isn't?").
| Tool Type | Question | Example |
|---|---|---|
| Detection | "Does this text contain hedges?" | UnScientify/HedgeHunter find "may", "possibly" |
| Enforcement | "Should this claim be hedged but isn't?" | Clarity Gate flags "Revenue will be $50M" |
Clarity Gate verifies FORM, not TRUTH.
This skill checks whether claims are properly marked as uncertain—it cannot verify if claims are actually true.
Risk: An LLM can hallucinate facts INTO a document, then "pass" Clarity Gate by adding source markers to false claims.
Solution: HITL (Human-In-The-Loop) verification is MANDATORY before declaring PASS.
The 9 Verification Points guide semantic review — content quality checks that require judgment (human or AI). They answer questions like "Should this claim be hedged?" and "Are these numbers consistent?"
When review completes, output a CGD file conforming to CLARITY_GATE_FORMAT_SPEC.md. The C/S rules in CLARITY_GATE_FORMAT_SPEC.md validate file structure, not semantic content.
The connection:
clarity-status, hitl-status, hitl-pending-count)Example: If Point 5 (Data Consistency) finds conflicting numbers, you'd mark clarity-status: UNCLEAR until resolved. Rule C7 then ensures you can't claim REVIEWED while still UNCLEAR.
1. HYPOTHESIS vs FACT LABELING Every claim must be clearly marked as validated or hypothetical.
| Fails | Passes |
|---|---|
| "Our architecture outperforms competitors" | "Our architecture outperforms competitors [benchmark data in Table 3]" |
| "The model achieves 40% improvement" | "The model achieves 40% improvement [measured on dataset X]" |
Fix: Add markers: "PROJECTED:", "HYPOTHESIS:", "UNTESTED:", "(estimated)", "~", "?"
2. UNCERTAINTY MARKER ENFORCEMENT Forward-looking statements require qualifiers.
| Fails | Passes |
|---|---|
| "Revenue will be $50M by Q4" | "Revenue is projected to be $50M by Q4" |
| "The feature will reduce churn" | "The feature is expected to reduce churn" |
Fix: Add "projected", "estimated", "expected", "designed to", "intended to"
3. ASSUMPTION VISIBILITY Implicit assumptions that affect interpretation must be explicit.
| Fails | Passes |
|---|---|
| "The system scales linearly" | "The system scales linearly [assuming <1000 concurrent users]" |
| "Response time is 50ms" | "Response time is 50ms [under standard load conditions]" |
Fix: Add bracketed conditions: "[assuming X]", "[under conditions Y]", "[when Z]"
4. AUTHORITATIVE-LOOKING UNVALIDATED DATA Tables with specific percentages and checkmarks look like measured data.
Red flag: Tables with specific numbers (89%, 95%, 100%) without sources
Fix: Add "(guess)", "(est.)", "?" to numbers. Add explicit warning: "PROJECTED VALUES - NOT MEASURED"
5. DATA CONSISTENCY Scan for conflicting numbers, dates, or facts within the document.
Red flag: "500 users" in one section, "750 users" in another
Fix: Reconcile conflicts or explicitly note the discrepancy with explanation.
6. IMPLICIT CAUSATION Claims that imply causation without evidence.
Red flag: "Shorter prompts improve response quality" (plausible but unproven)
Fix: Reframe as hypothesis: "Shorter prompts MAY improve response quality (hypothesis, not validated)"
7. FUTURE STATE AS PRESENT Describing planned/hoped outcomes as if already achieved.
Red flag: "The system processes 10,000 requests per second" (when it hasn't been built)
Fix: Use future/conditional: "The system is DESIGNED TO process..." or "TARGET: 10,000 rps"
8. TEMPORAL COHERENCE Document dates and timestamps must be internally consistent and plausible.
| Fails | Passes |
|---|---|
| "Last Updated: December 2024" (when current is 2026) | "Last Updated: January 2026" |
| v1.0.0 dated 2024-12-23, v1.1.0 dated 2024-12-20 | Versions in chronological order |
Sub-checks:
Fix: Update dates, add "as of [date]" qualifiers, flag stale claims
9. EXTERNALLY VERIFIABLE CLAIMS Specific numbers that could be fact-checked should be flagged for verification.
| Type | Example | Risk |
|---|---|---|
| Pricing | "Costs ~$0.005 per call" | API pricing changes |
| Statistics | "Papers average 15-30 equations" | May be wildly off |
| Rates/ratios | "40% of researchers use X" | Needs citation |
| Competitor claims | "No competitor offers Y" | May be outdated |
Fix options:
Claim Extracted --> Does Source of Truth Exist?
|
+---------------+---------------+
YES NO
| |
Tier 1: Automated Tier 2: HITL
Consistency & Verification Two-Round Verification
| |
PASS / BLOCK Round A → Round B → APPROVE / REJECT
A. Internal Consistency
B. External Verification (Extension Interface)
Round A: Derived Data Confirmation
Round B: True HITL Verification
When producing a Clarity-Gated Document, use this format per CLARITY_GATE_FORMAT_SPEC.md v2.1:
---
clarity-gate-version: 2.1
processed-date: 2026-01-12
processed-by: Claude + Human Review
clarity-status: CLEAR
hitl-status: REVIEWED
hitl-pending-count: 0
points-passed: 1-9
rag-ingestable: true # computed by validator - do not set manually
document-sha256: 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
hitl-claims:
- id: claim-75fb137a
text: "Revenue projection is $50M"
value: "$50M"
source: "Q3 planning doc"
location: "revenue-projections/1"
round: B
confirmed-by: Francesco
confirmed-date: 2026-01-12
---
# Document Title
[Document body with epistemic markers applied]
Claims like "Revenue will be $50M" become "Revenue is **projected** to be $50M *(unverified projection)*"
---
## HITL Verification Record
### Round A: Derived Data Confirmation
- Claim 1 (source) ✓
- Claim 2 (source) ✓
### Round B: True HITL Verification
| # | Claim | Status | Verified By | Date |
|---|-------|--------|-------------|------|
| 1 | [claim] | ✓ Confirmed | [name] | [date] |
<!-- CLARITY_GATE_END -->
Clarity Gate: CLEAR | REVIEWED
Required CGD Elements (per spec):
clarity-gate-version — Tool version (no "v" prefix)processed-date — YYYY-MM-DD formatprocessed-by — Processor nameclarity-status — CLEAR or UNCLEARhitl-status — PENDING, REVIEWED, or REVIEWED_WITH_EXCEPTIONShitl-pending-count — Integer ≥ 0points-passed — e.g., 1-9 or 1-4,7,9hitl-claims — List of verified claims (may be empty [])<!-- CLARITY_GATE_END -->
Clarity Gate: <clarity-status> | <hitl-status>
Optional/Computed Fields:
rag-ingestable — Computed by validators, not manually set. Shows true only when CLEAR | REVIEWED with no exclusion blocks.document-sha256 — Required. 64-char lowercase hex hash for integrity verification. See spec §2 for computation rules.exclusions-coverage — Optional. Fraction of body inside exclusion blocks (0.0–1.0).Escape Mechanism: To write about markers like *(estimated)* without triggering parsing, wrap in backticks: `*(estimated)*`
Claim verification status is determined by field presence, not an explicit status field:
| State | confirmed-by |
confirmed-date |
Meaning |
|---|---|---|---|
| PENDING | absent | absent | Awaiting human verification |
| VERIFIED | present | present | Human has confirmed |
| (invalid) | present | absent | W-HC01: partial fields |
| (invalid) | absent | present | W-HC01: partial fields |
Why no explicit status field? Field presence is self-enforcing—you can't accidentally set status without providing who/when.
The source field meaning changes based on claim state:
| State | source Contains |
Example |
|---|---|---|
| PENDING | Where to verify (actionable) | "Check Q3 planning doc" |
| VERIFIED | What was found (evidence) | "Q3 planning doc, page 12" |
Vague source detection (W-HC02): Sources like "industry reports", "research", "TBD" trigger warnings.
General pattern: claim-[a-z0-9._-]{1,64} (alphanumeric, dots, underscores, hyphens)
| Approach | Pattern | Example | Use Case |
|---|---|---|---|
| Hash-based (preferred) | claim-[a-f0-9]{8,} |
claim-75fb137a |
Deterministic, collision-resistant |
| Sequential | claim-[0-9]+ |
claim-1, claim-2 |
Simple documents |
| Semantic | claim-[a-z0-9-]+ |
claim-revenue-q3 |
Human-friendly |
Collision probability: At 1,000 claims with 8-char hex IDs: ~0.012%. For >1,000 claims, use 12+ hex characters.
Recommendation: Use hash-based IDs generated by scripts/claim_id.py for consistency and collision resistance.
When content cannot be resolved (no SME available, legacy prose, etc.), mark it as excluded rather than leaving it ambiguous:
<!-- CG-EXCLUSION:BEGIN id=auth-legacy-1 -->
Legacy authentication details that require SME review...
<!-- CG-EXCLUSION:END id=auth-legacy-1 -->
Rules:
[A-Za-z0-9][A-Za-z0-9._-]{0,63}hitl-status: REVIEWED_WITH_EXCEPTIONSexceptions-reason and exceptions-ids in frontmatterImportant: Documents with exclusion blocks are not RAG-ingestable. They're rejected entirely (no partial ingestion).
See CLARITY_GATE_FORMAT_SPEC.md §4 for complete rules.
When validating a Source of Truth file, the skill checks both format compliance (per CLARITY_GATE_FORMAT_SPEC.md) and content quality (the 9 points).
SOT documents are CGDs with a tier: block. They require a ## Verified Claims section with a valid table.
| Code | Check | Severity |
|---|---|---|
| E-TB01 | No ## Verified Claims section |
ERROR |
| E-TB02 | Table has no data rows | ERROR |
| E-TB03 | Required columns missing (Claim, Value, Source, Verified) | ERROR |
| E-TB04 | Column order wrong (Claim not first or Verified not last) | ERROR |
| E-TB05 | Empty cell in required column | ERROR |
| E-TB06 | Invalid date format in Verified column | ERROR |
| E-TB07 | Verified date in future (beyond 24h grace) | ERROR |
The 9 Verification Points apply to SOT content:
| Point | SOT Application |
|---|---|
| 1-4 | Check claims in ## Verified Claims are actually verified |
| 5 | Check for conflicting values across tables |
| 6 | Check claims don't imply unsupported causation |
| 7 | Check table doesn't state futures as present |
| 8 | Check dates are chronologically consistent |
| 9 | Flag specific numbers for external check |
tier: block containing level, owner, version, promoted-date, promoted-by## Verified Claims section with columns: Claim, Value, Source, Verified[STABLE], [CHECK], [VOLATILE], [SNAPSHOT] in content[STABLE] — Safe to cite without rechecking[CHECK] — Verify before citing[VOLATILE] — Changes frequently; always verify[SNAPSHOT] — Point-in-time data; include date when citingAfter running Clarity Gate, report:
## Clarity Gate Results
**Document:** [filename]
**Issues Found:** [number]
### Critical (will cause hallucination)
- [issue + location + fix]
### Warning (could cause equivocation)
- [issue + location + fix]
### Temporal (date/time issues)
- [issue + location + fix]
### Externally Verifiable Claims
| # | Claim | Type | Suggested Verification |
|---|-------|------|------------------------|
| 1 | [claim] | Pricing | [where to verify] |
---
## Round A: Derived Data Confirmation
- [claim] ([source])
Reply "confirmed" or flag any I misread.
---
## Round B: HITL Verification Required
| # | Claim | Why HITL Needed | Human Confirms |
|---|-------|-----------------|----------------|
| 1 | [claim] | [reason] | [ ] True / [ ] False |
---
**Would you like me to produce an annotated CGD version?**
---
**Verdict:** PENDING CONFIRMATION
| Level | Definition | Action |
|---|---|---|
| CRITICAL | LLM will likely treat hypothesis as fact | Must fix before use |
| WARNING | LLM might misinterpret | Should fix |
| TEMPORAL | Date/time inconsistency detected | Verify and update |
| VERIFIABLE | Specific claim that could be fact-checked | Route to HITL or external search |
| ROUND A | Derived from witnessed source | Quick confirmation |
| ROUND B | Requires true verification | Cannot pass without confirmation |
| PASS | Clearly marked, no ambiguity, verified | No action needed |
| Pattern | Action |
|---|---|
| Specific percentages (89%, 73%) | Add source or mark as estimate |
| Comparison tables | Add "PROJECTED" header |
| "Achieves", "delivers", "provides" | Use "designed to", "intended to" if not validated |
| Checkmarks | Verify these are confirmed |
| "100%" anything | Almost always needs qualification |
| "Last Updated: [date]" | Check against current date |
| Version numbers with dates | Verify chronological order |
| "$X.XX" or "~$X" (pricing) | Flag for external verification |
| "averages", "typically" | Flag for source/citation |
| Competitor capability claims | Flag for external verification |
| Project | Purpose | URL |
|---|---|---|
| Source of Truth Creator | Create epistemically calibrated docs | github.com/frmoretto/source-of-truth-creator |
| Stream Coding | Documentation-first methodology | github.com/frmoretto/stream-coding |
| ArXiParse | Scientific paper verification | arxiparse.org |
document_hash.py now implements full FORMAT_SPEC §2.1-2.4 compliancecanonicalize() function: trailing whitespace stripping, newline collapsing, NFC normalizationdocument-sha256 removal with multiline continuation support (§2.2)claim_id.py, document_hash.py<!-- CLARITY_GATE_END --> markershitl-claims format to v2.0 schema (id, text, value, source, location, round).cgd.md extension)Version: 2.1.3 Spec Version: 2.1 Author: Francesco Marinoni Moretto License: CC-BY-4.0