NewFlame, an assistant that learns and improves. Available on

review-paper

pedrohcgs/review-paper

Writing

117

1 installs

About

SKILL.md

review-paper

pedrohcgs/review-paper

Writing

117

1 installs

About

Comprehensive manuscript review covering argument structure, econometric specification, citation completeness, and potential referee objections

SKILL.md

name: review-paper description: Comprehensive manuscript review with three modes: single-pass (default), --adversarial critic-fixer loop, and --peer [journal] simulated peer-review pipeline (editor + 2 dispositioned referees + editorial decision, calibrated to a target journal). R&R continuation via --peer --r2/--r3; hostile-editor stress test via --peer --stress; reviewer-disposition variance reporting via --peer --variance N. Auto-invokes /review-r + /audit-reproducibility on referenced scripts unless --no-cross-artifact. argument-hint: "[paper path] [--adversarial | --peer [--r2 | --r3 | --stress | --variance N] [--no-novelty-check]] [--no-cross-artifact]" allowed-tools: ["Read", "Grep", "Glob", "Write", "Edit", "Bash", "Task"]

Manuscript Review

Produce a thorough, constructive review of an academic manuscript — the kind of report a top-journal referee would write.

Which review skill do I want?

/review-paper (this skill) — single comprehensive report, optional --adversarial critic-fixer loop, or --peer <journal> simulated peer-review pipeline. Best for most drafts.

/seven-pass-review — seven independent lenses in parallel (abstract, intro, methods, results, robustness, prose, citations) then synthesized. Heavier (7× token cost). Best for submission-ready drafts or R&R stage where you need maximum coverage.

/respond-to-referees — if you already have referee comments and need a response document, not another review.

/slide-excellence — for lecture slides, not papers.

Input: $ARGUMENTS — path to a paper (.tex, .pdf, or .qmd), or a filename in master_supporting_docs/. Optional flags:

--adversarial — critic-fixer loop (max 5 rounds).
--peer <JOURNAL> — simulated peer review pipeline calibrated to <JOURNAL> (see .claude/references/journal-profiles.md for available short names).
--r2 / --r3 — R&R continuation mode (requires --peer). Reloads prior round, classifies concerns Resolved / Partial / Not addressed.
--stress — hostile-editor stress test (requires --peer). Forces SKEPTIC dispositions, doubles critical peeves.
--variance (followed by integer N, default 3) — reviewer-disposition variance mode (requires --peer). Runs N referees with independently sampled dispositions from the 6-way taxonomy. Editor aggregates into a decision distribution, not a point estimate. Mutually exclusive with --stress and --r2/--r3.
--no-novelty-check — skip editor's WebSearch novelty probe (default is ON).
--no-cross-artifact — skip auto-invocation of /review-r + /audit-reproducibility on referenced scripts.

Already received referee comments? Use /respond-to-referees instead. That skill cross-references each referee concern against the revised manuscript and drafts a complete response document.

Modes

Default mode (single-pass)

One comprehensive review report. Fast, low token cost, suitable for early drafts where the author wants feedback and will iterate manually.

Adversarial mode (`--adversarial`)

Iterative critic-fixer loop modeled on /qa-quarto. The critic identifies issues, the fixer proposes and applies edits (with user approval), and the critic re-audits. Loops until APPROVED or max 5 rounds.

Use when: preparing a pre-submission draft, responding to a journal-desk rejection with substantive revisions, or after your own major rewrite. Costs more tokens but produces a manuscript the critic has signed off on.

Peer-review mode (`--peer <JOURNAL>`)

Simulated editorial pipeline: editor desk review → referee selection → 2 blind referees with different dispositions → editorial synthesis. Calibrated to a target journal from .claude/references/journal-profiles.md. Use when: pre-submission dress rehearsal, choosing between target journals, R&R planning.

This mode is materially different from --adversarial: adversarial runs the same critic 5× with fresh context; --peer runs different personas (editor + 2 dispositioned referees drawn from 6-way taxonomy: STRUCTURAL / CREDIBILITY / MEASUREMENT / POLICY / THEORY / SKEPTIC) whose priors are deliberately different and who are blind to each other.

Agents used (all reimplemented in this template; adapted from Hugo Sant'Anna's clo-author with permission):

.claude/agents/editor.md — editor (desk review, referee selection, synthesis).
.claude/agents/domain-referee.md — substance referee.
.claude/agents/methods-referee.md — methodology referee (paper-type-aware).

Sub-flags:

--r2 / --r3 — R&R mode. Skips fresh desk review; reloads prior round's reports; same referees + dispositions + peeves; classifies each prior concern as Resolved / Partial / Not addressed. Hard cap at --r3 (no round 4+).
--stress — Hostile editor. Forces both referees to SKEPTIC disposition, doubles critical peeves, framing: "you are looking for reasons to reject this paper." Output is a concern-list gauntlet, not a decision letter.
--variance (with integer N, default 3) — Reviewer-disposition variance mode. Runs N referees with independently sampled dispositions from the 6-way taxonomy (STRUCTURAL / CREDIBILITY / MEASUREMENT / POLICY / THEORY / SKEPTIC). Editor synthesizes into a distribution of decisions, not a single verdict. See "Variance mode" below.
--no-novelty-check — Disables the editor's WebSearch novelty probes (default is ON). Use in offline or hallucination-sensitive contexts. Novelty-check caveat (document this to users): WebSearch can return hallucinated citations or miss paywalled recent work. Always surface novelty-probe results as flags for manual verification, not verdicts.

Variance mode (`--peer --variance N`)

Why this mode exists. Default --peer runs an editor + 2 referees with dispositions sampled once. A single peer-review pass is a point estimate of how the paper would fare — but the AgentReview ACL 2024 study (arXiv:2406.12708) found that ~37% of paper decisions vary purely from reviewer-disposition sampling and another 27.7% from partial author-identity disclosure. A point estimate hides this variance.

Variance mode runs N independent referees (default N=3, max N=5 for token-cost discipline) with disposition sampling, then reports a decision distribution that surfaces this variance to the author.

How it works:

Editor performs desk review once (shared across the N referees).
The editor samples N dispositions from the 6-way taxonomy with replacement. Stratification rule: if N ≥ 3, at least one SKEPTIC is always sampled (avoids drawing N friendly referees by chance).
Each of the N referees runs in an isolated context (Task with context: fork) — same manuscript, same paper-type rubric, different disposition. Referees are blind to each other.
Editor receives N independent reports and produces:
- A decision-distribution table (e.g., 2/3 R&R, 1/3 Reject with the modal verdict highlighted).
- A concern-frequency table showing which concerns appeared across multiple referees (high frequency = robust criticism; low frequency = disposition-dependent).
- An editorial recommendation that explicitly references the variance ("modal verdict R&R, with one SKEPTIC dissent on identification — author should address the identification concern even though it's not the majority position").

Output files:

quality_reports/peer_review_<paper>/referee_1.md … referee_N.md (per-referee reports)
quality_reports/peer_review_<paper>/decision_distribution.md (aggregate table + concern-frequency analysis)
quality_reports/peer_review_<paper>/editor_synthesis.md (final editorial letter)

Cost discipline. Variance mode multiplies referee-tier cost by N relative to default --peer (which runs 2 referees). The Cost-Conscious Composition section of the workflow guide recommends keeping referees on Sonnet (mid-tier) for variance runs and reserving Opus for the editor synthesis. Hard cap at N=5; for higher variance estimates, run --variance 5 twice and combine offline.

Mutual exclusivity. Variance mode cannot combine with --stress (which forces SKEPTIC×2 and would defeat the sampling purpose) or --r2/--r3 (which reuses prior-round dispositions for continuity). The skill halts with an error if mutually-exclusive flags are combined.

When to reach for it:

Pre-submission dress rehearsal where you want to know not just "will this paper survive review" but "how confidently will it survive."
Deciding between target journals — run --variance 3 against two journal profiles, compare distributions.
Responding to a rejection where the referee panel felt unrepresentative — --variance 5 against the same journal profile gives an empirical sense of whether the original referees were typical.

Steps (both modes)

Locate and read the manuscript. First strip flags (--adversarial, --no-cross-artifact) from $ARGUMENTS to get the bare manuscript path. Check:
- Direct path (bare path from step 1)
- master_supporting_docs/supporting_papers/$ARGUMENTS
- Glob for partial matches
Read the full paper end-to-end with the Read tool — a 1M-token window holds a full paper. For long PDFs, page through with the pages parameter (up to 20 pages per request).
Evaluate across 6 dimensions (see below).
Generate 3–5 "referee objections" — the tough questions a top referee would ask.
Produce the review report.
Save to quality_reports/paper_review_[sanitized_name]_round[N].md (N=1 in default mode; N increments in adversarial mode).

6b. Cross-artifact integration. Unless $ARGUMENTS contains --no-cross-artifact, and if the manuscript references analysis scripts (detected via \input{scripts/...}, %% source: comments, or matching scripts/R/_outputs/ filenames), auto-invoke:

/review-r on each referenced script (forked subagent, results to quality_reports/cross_artifact_[paper]/review_r_*.md)
/audit-reproducibility on the manuscript + outputs dir (results to quality_reports/cross_artifact_[paper]/reproducibility.md)

Merge critical cross-artifact findings (code bug invalidates paper claim, reproducibility FAIL) into a new "Cross-Artifact Findings" section at the top of the paper review report. See .claude/rules/cross-artifact-review.md for the full protocol.

If --adversarial is in $ARGUMENTS: invoke the critic-fixer loop defined in the next section. Otherwise stop here.

Review Dimensions

1. Argument Structure

Is the research question clearly stated?
Does the introduction motivate the question effectively?
Is the logical flow sound (question → method → results → conclusion)?
Are the conclusions supported by the evidence?
Are limitations acknowledged?

2. Identification Strategy

Is the causal claim credible?
What are the key identifying assumptions? Are they stated explicitly?
Are there threats to identification (omitted variables, reverse causality, measurement error)?
Are robustness checks adequate?
Is the estimator appropriate for the research design?

3. Econometric Specification

Correct standard errors (clustered? robust? bootstrap?)?
Appropriate functional form?
Sample selection issues?
Multiple testing concerns?
Are point estimates economically meaningful (not just statistically significant)?

4. Literature Positioning

Are the key papers cited?
Is prior work characterized accurately?
Is the contribution clearly differentiated from existing work?
Any missing citations that a referee would flag?

5. Writing Quality

Clarity and concision
Academic tone
Consistent notation throughout
Abstract effectively summarizes the paper
Tables and figures are self-contained (clear labels, notes, sources)

6. Presentation

Are tables and figures well-designed?
Is notation consistent throughout?
Are there any typos, grammatical errors, or formatting issues?
Is the paper the right length for the contribution?

Output Format

# Manuscript Review: [Paper Title]

**Date:** [YYYY-MM-DD]
**Reviewer:** review-paper skill
**File:** [path to manuscript]

## Summary Assessment

**Overall recommendation:** [Strong Accept / Accept / Revise & Resubmit / Reject]

[2-3 paragraph summary: main contribution, strengths, and key concerns]

## Strengths

1. [Strength 1]
2. [Strength 2]
3. [Strength 3]

## Major Concerns

### MC1: [Title]
- **Dimension:** [Identification / Econometrics / Argument / Literature / Writing / Presentation]
- **Issue:** [Specific description]
- **Suggestion:** [How to address it]
- **Location:** [Section/page/table if applicable]

[Repeat for each major concern]

## Minor Concerns

### mc1: [Title]
- **Issue:** [Description]
- **Suggestion:** [Fix]

[Repeat]

## Referee Objections

These are the tough questions a top referee would likely raise:

### RO1: [Question]
**Why it matters:** [Why this could be fatal]
**How to address it:** [Suggested response or additional analysis]

[Repeat for 3-5 objections]

## Specific Comments

[Line-by-line or section-by-section comments, if any]

## Summary Statistics

| Dimension | Rating (1-5) |
|-----------|-------------|
| Argument Structure | [N] |
| Identification | [N] |
| Econometrics | [N] |
| Literature | [N] |
| Writing | [N] |
| Presentation | [N] |
| **Overall** | **[N]** |

Principles

Be constructive. Every criticism should come with a suggestion.
Be specific. Reference exact sections, equations, tables.
Think like a referee at a top-5 journal. What would make them reject?
Distinguish fatal flaws from minor issues. Not everything is equally important.
Acknowledge what's done well. Good research deserves recognition.
Do NOT fabricate details. If you can't read a section clearly, say so.

Adversarial Mode — Critic-Fixer Loop

Only runs if --adversarial is in $ARGUMENTS.

Pattern adapted from /qa-quarto, which uses the same loop to iterate on slide quality. Papers get it now because the single-pass review leaves authors doing manual fix-and-resubmit cycles.

Flow

Phase 0: Pre-flight
  │
  ├─ Verify the manuscript compiles (xelatex / quarto render) if applicable
  ├─ Snapshot the pre-review version: git stash OR copy to a .review-backup/
  │
Phase 1: Critic audit (round N=1,2,3,...)
  │
  ├─ Run the default review above, producing a round-N report
  ├─ If the report has ZERO Major Concerns and ZERO Referee Objections
  │  rated "fatal":
  │     → VERDICT = APPROVED. Stop the loop. Write final summary.
  │  Else: continue.
  │
Phase 2: Fixer
  │
  ├─ For each Major Concern in the round-N report, produce a concrete
  │  proposed edit (diff or new text block).
  ├─ Present proposed edits to the user grouped by severity (Critical →
  │  Major → Minor). Ask for approval: "apply all", "apply critical+major
  │  only", "review each", or "abort".
  ├─ Apply approved edits with Edit / Edit tools.
  ├─ If the manuscript is a compile target (`.tex` / `.qmd`), re-compile
  │  and verify it still builds.
  │
Phase 3: Re-audit
  │
  └─ Spawn a FRESH-CONTEXT subagent (via Task, `subagent_type` set to
     general-purpose) to re-read the paper and produce a round-(N+1)
     report. Fresh context prevents anchoring bias — the new reviewer
     sees the edited paper, not the diff.
     → Jump back to Phase 1.

Iteration limits — loop-until-dry

Same loop-until-dry primitive as /qa-quarto (orchestrator-protocol.md): the critic returns FINDINGs in the shared schema (orchestration-schemas.md) and the loop converges when a round adds 0 new CRITICAL/MAJOR concerns (deduped on location+finding), not at a fixed count.

Convergence: APPROVED when a round produces zero Major Concerns and zero fatal Referee Objections.
Fallback cap: 5 rounds bounds a non-converging loop; after round 5, halt and list remaining concerns.
Two-strikes: if the same Concern label appears in rounds N and N+2, flag as "author disagreement" and let the user decide (keep-as-is with rationale vs. another fix attempt) — see summary-parity.md.
Budget escape: if cumulative token cost exceeds the spend cap (default ~500k — a spend ceiling, not a context-window limit, since each re-audit runs in fresh context), warn and let the user cap further rounds.

Stopping criteria

Condition	Action
Zero Major Concerns, zero fatal Referee Objections	APPROVED — final summary
Max 5 rounds reached	HALTED — list remaining concerns, user decides
User approves zero fixes in a round	HALTED — user signals "I disagree with this review"
Compile fails after applied fixes	ROLLED BACK to pre-round-N snapshot, report compile error, user decides

Final report

After the loop ends, write quality_reports/paper_review_[sanitized_name]_FINAL.md:

# Final Review: [Paper Title]

**Rounds:** N
**Verdict:** APPROVED | HALTED (max rounds) | HALTED (user override) | ROLLED BACK
**Token cost estimate:** ~XXk

## Round Summary
| Round | Major Concerns | Fatal Objections | Status |
|---|---|---|---|
| 1 | 7 | 2 | Fixed 5, deferred 2 |
| 2 | 3 | 1 | ...              |
| ... | ... | ... | ...         |
| N | 0 | 0 | APPROVED        |

## Changes Applied
[link to git diff between the pre-round-1 snapshot and HEAD]

## Remaining Concerns (if HALTED)
[list with severity + rationale]

## Next Steps
[recommended action: submit / one more pass / substantial revision]

When NOT to use adversarial mode

Early exploratory drafts (the loop forces premature polish on ideas still being shaped)
Papers you don't yet have compilable source for (can't verify edits)
When you'd rather get ONE opinion and decide for yourself (adversarial-mode enforces "critic signed off" semantics — that's sometimes the wrong frame)

`--peer [journal]` workflow detail

Phase 0: Cross-artifact pre-flight (runs BEFORE desk review in --peer mode)

Unless --no-cross-artifact is set, auto-invoke /audit-reproducibility on the manuscript + its outputs directory first. Any reproducibility FAIL becomes desk-reject-worthy evidence the editor can cite. See .claude/rules/cross-artifact-review.md.

Reports: quality_reports/cross_artifact_[paper]/reproducibility.md.

Novelty-probe Post-Flight (new in v1.7.0). The editor's novelty probe uses WebSearch to check whether the paper's contribution has been made before. WebSearch results can be hallucinated — fabricated prior work, misattributed findings, wrong years. Before the editor's desk review incorporates novelty-probe claims into its decision, those claims must pass Post-Flight Verification per .claude/rules/post-flight-verification.md:

The editor collects novelty-probe claims (e.g., "Smith 2022 already showed this exact result").
Spawn claim-verifier via Task with subagent_type=claim-verifier and context=fork, passing the claims + verification questions + candidate source URLs. Forked fresh context is the CoVe independence trick.
Only verified claims are allowed into the desk-review narrative. Unverified claims are surfaced separately as "editor could not verify — manual check recommended" rather than presented as established prior work.

Opt-out: --no-novelty-check already skips the probe entirely. If the probe runs, Post-Flight is mandatory.

Pre-Flight Report (required before Phase 1). This is the RUN_CONFIG echo from orchestrator-protocol.md — every interactive choice (journal, dispositions, peeve budget, N referees, cross-artifact/novelty toggles, round) is resolved before the forked editor/referees spawn, because a forked subagent cannot stop to ask. Output it so the user can verify inputs, and halt here on any unresolved required field (unknown journal, missing script) rather than mid-run:

## Pre-Flight Report — /review-paper --peer

**Manuscript:** [path] — [page count, last modified]
**Target journal:** [JOURNAL_SHORT] → [full name from `.claude/references/journal-profiles.md`]
**Journal profile loaded:** [yes/no; resolved from `.claude/references/journal-profiles.md`; key adjustments: e.g., "Identification 35 → 40"]
**Cross-artifact scripts found:** [list referenced .R / .py / .do files]
**Reproducibility status:** [PASS / FAIL from Phase 0] — [N of M claims within tolerance]
**Round:** [fresh / r2 / r3 / stress]

If the manuscript path doesn't exist, the target journal isn't in .claude/references/journal-profiles.md, or a cross-artifact script is missing, stop and surface the issue before proceeding.

Phase 1: Editor desk review

Spawn forked subagent editor with the manuscript path and --peer <JOURNAL> context. Editor:

Reads journal profile from .claude/references/journal-profiles.md → states "Calibrated to: [journal]".
Reads abstract + intro + methods overview + headline results.
Runs novelty probes (unless --no-novelty-check).
Either DESK REJECT (pipeline terminates with rejection letter) or SEND OUT.

Report: quality_reports/peer_review_[paper]/desk_review.md.

Phase 1b: Referee selection (inside editor)

Editor draws 2 DIFFERENT dispositions from journal's Referee-pool weights and assigns each referee 1 critical + 1 constructive peeve (stress mode: 2 critical + 1 constructive). Appended to desk_review.md.

Phase 2: Two parallel referees, blind to each other

Spawn in parallel:

Forked subagent domain-referee with disposition D1, peeves P1 → referee_domain.md.
Forked subagent methods-referee with disposition D2, peeves P2 → referee_methods.md.

Each referee must include "What would change my mind: [specific ask]" on every MAJOR concern.

Phase 3: Editor synthesis (reduce → judge, with the hallucination gate)

Read both referee reports. Reduce their FINDINGs, classify each MAJOR concern as FATAL / ADDRESSABLE / TASTE, and produce the editorial decision using the decision rule table in editor.md.

Post-judge hallucination gate (orchestration-schemas.md §4): the editor reduces the referees — it must not desk-reject or escalate on a CRITICAL reason neither referee raised. Any editor-introduced blocker that is not traceable to a referee finding is re-verified in a fresh claim-verifier fork or dropped to [JUDGE-HALLUCINATED] and the decision recomputed. (The editor may always downgrade or de-duplicate referee concerns.)

Report: quality_reports/peer_review_[paper]/editorial_decision.md.

Phase 4: Summary

Tell the user:

Final decision (Accept / Minor / Major / Reject / Desk Reject)
Token usage + wall-clock time
Paths to all 4 reports (desk_review, referee_domain, referee_methods, editorial_decision)

Output layout for `--peer` mode

quality_reports/
  peer_review_[sanitized_paper_name]/
    desk_review.md                       # Phase 1 + Phase 1b
    referee_domain.md                    # Phase 2 (parallel)
    referee_methods.md                   # Phase 2 (parallel)
    editorial_decision.md                # Phase 3
    (R&R rounds: desk_review_r2.md, referee_domain_r2.md, ...)
  cross_artifact_[sanitized_paper_name]/
    reproducibility.md                   # Phase 0
    review_r_*.md                        # Phase 0 (one per referenced script)

Field adaptation

The shipped journal-profiles.md covers 5 econ journals (AER, QJE, JPE, ECMA, ReStud). For other fields (finance, political science, biology, CS, etc.), copy templates/journal-profile-template.md into a new section of journal-profiles.md and fill in the schema. See the "Field adaptation" section at the end of journal-profiles.md for detailed guidance. The pipeline itself is field-agnostic; only the calibration data changes.

For non-econ paper types in methods-referee.md, extend the paper-type list (e.g., biology: observational / experimental / computational / review).

Cross-references

.claude/skills/audit-reproducibility/SKILL.md — numeric-claim verification (auto-invoked on referenced scripts).
.claude/skills/replication-package/SKILL.md — assemble the AEA DCAS deposit once the paper passes review.
.claude/skills/capture-environment/SKILL.md · .claude/skills/disclosure-check/SKILL.md — environment capture + restricted-data screening for the deposit.
.claude/skills/seven-pass-review/SKILL.md — heavier 7-lens pass for submission-ready drafts.

About

SKILL.md

About

Comprehensive manuscript review covering argument structure, econometric specification, citation completeness, and potential referee objections

SKILL.md

name: review-paper description: Comprehensive manuscript review with three modes: single-pass (default), --adversarial critic-fixer loop, and --peer [journal] simulated peer-review pipeline (editor + 2 dispositioned referees + editorial decision, calibrated to a target journal). R&R continuation via --peer --r2/--r3; hostile-editor stress test via --peer --stress; reviewer-disposition variance reporting via --peer --variance N. Auto-invokes /review-r + /audit-reproducibility on referenced scripts unless --no-cross-artifact. argument-hint: "[paper path] [--adversarial | --peer [--r2 | --r3 | --stress | --variance N] [--no-novelty-check]] [--no-cross-artifact]" allowed-tools: ["Read", "Grep", "Glob", "Write", "Edit", "Bash", "Task"]

Manuscript Review

Produce a thorough, constructive review of an academic manuscript — the kind of report a top-journal referee would write.

Which review skill do I want?

/review-paper (this skill) — single comprehensive report, optional --adversarial critic-fixer loop, or --peer <journal> simulated peer-review pipeline. Best for most drafts.

/seven-pass-review — seven independent lenses in parallel (abstract, intro, methods, results, robustness, prose, citations) then synthesized. Heavier (7× token cost). Best for submission-ready drafts or R&R stage where you need maximum coverage.

/respond-to-referees — if you already have referee comments and need a response document, not another review.

/slide-excellence — for lecture slides, not papers.

Input: $ARGUMENTS — path to a paper (.tex, .pdf, or .qmd), or a filename in master_supporting_docs/. Optional flags:

--adversarial — critic-fixer loop (max 5 rounds).
--peer <JOURNAL> — simulated peer review pipeline calibrated to <JOURNAL> (see .claude/references/journal-profiles.md for available short names).
--r2 / --r3 — R&R continuation mode (requires --peer). Reloads prior round, classifies concerns Resolved / Partial / Not addressed.
--stress — hostile-editor stress test (requires --peer). Forces SKEPTIC dispositions, doubles critical peeves.
--variance (followed by integer N, default 3) — reviewer-disposition variance mode (requires --peer). Runs N referees with independently sampled dispositions from the 6-way taxonomy. Editor aggregates into a decision distribution, not a point estimate. Mutually exclusive with --stress and --r2/--r3.
--no-novelty-check — skip editor's WebSearch novelty probe (default is ON).
--no-cross-artifact — skip auto-invocation of /review-r + /audit-reproducibility on referenced scripts.

Already received referee comments? Use /respond-to-referees instead. That skill cross-references each referee concern against the revised manuscript and drafts a complete response document.

Modes

Default mode (single-pass)

One comprehensive review report. Fast, low token cost, suitable for early drafts where the author wants feedback and will iterate manually.

Adversarial mode (`--adversarial`)

Peer-review mode (`--peer <JOURNAL>`)

Agents used (all reimplemented in this template; adapted from Hugo Sant'Anna's clo-author with permission):

.claude/agents/editor.md — editor (desk review, referee selection, synthesis).
.claude/agents/domain-referee.md — substance referee.
.claude/agents/methods-referee.md — methodology referee (paper-type-aware).

Sub-flags:

--r2 / --r3 — R&R mode. Skips fresh desk review; reloads prior round's reports; same referees + dispositions + peeves; classifies each prior concern as Resolved / Partial / Not addressed. Hard cap at --r3 (no round 4+).
--stress — Hostile editor. Forces both referees to SKEPTIC disposition, doubles critical peeves, framing: "you are looking for reasons to reject this paper." Output is a concern-list gauntlet, not a decision letter.
--variance (with integer N, default 3) — Reviewer-disposition variance mode. Runs N referees with independently sampled dispositions from the 6-way taxonomy (STRUCTURAL / CREDIBILITY / MEASUREMENT / POLICY / THEORY / SKEPTIC). Editor synthesizes into a distribution of decisions, not a single verdict. See "Variance mode" below.
--no-novelty-check — Disables the editor's WebSearch novelty probes (default is ON). Use in offline or hallucination-sensitive contexts. Novelty-check caveat (document this to users): WebSearch can return hallucinated citations or miss paywalled recent work. Always surface novelty-probe results as flags for manual verification, not verdicts.

Variance mode (`--peer --variance N`)

How it works:

Editor performs desk review once (shared across the N referees).
The editor samples N dispositions from the 6-way taxonomy with replacement. Stratification rule: if N ≥ 3, at least one SKEPTIC is always sampled (avoids drawing N friendly referees by chance).
Each of the N referees runs in an isolated context (Task with context: fork) — same manuscript, same paper-type rubric, different disposition. Referees are blind to each other.
Editor receives N independent reports and produces:
- A decision-distribution table (e.g., 2/3 R&R, 1/3 Reject with the modal verdict highlighted).
- A concern-frequency table showing which concerns appeared across multiple referees (high frequency = robust criticism; low frequency = disposition-dependent).
- An editorial recommendation that explicitly references the variance ("modal verdict R&R, with one SKEPTIC dissent on identification — author should address the identification concern even though it's not the majority position").

Output files:

quality_reports/peer_review_<paper>/referee_1.md … referee_N.md (per-referee reports)
quality_reports/peer_review_<paper>/decision_distribution.md (aggregate table + concern-frequency analysis)
quality_reports/peer_review_<paper>/editor_synthesis.md (final editorial letter)

When to reach for it:

Pre-submission dress rehearsal where you want to know not just "will this paper survive review" but "how confidently will it survive."
Deciding between target journals — run --variance 3 against two journal profiles, compare distributions.
Responding to a rejection where the referee panel felt unrepresentative — --variance 5 against the same journal profile gives an empirical sense of whether the original referees were typical.

Steps (both modes)

Locate and read the manuscript. First strip flags (--adversarial, --no-cross-artifact) from $ARGUMENTS to get the bare manuscript path. Check:
- Direct path (bare path from step 1)
- master_supporting_docs/supporting_papers/$ARGUMENTS
- Glob for partial matches
Read the full paper end-to-end with the Read tool — a 1M-token window holds a full paper. For long PDFs, page through with the pages parameter (up to 20 pages per request).
Evaluate across 6 dimensions (see below).
Generate 3–5 "referee objections" — the tough questions a top referee would ask.
Produce the review report.
Save to quality_reports/paper_review_[sanitized_name]_round[N].md (N=1 in default mode; N increments in adversarial mode).

/review-r on each referenced script (forked subagent, results to quality_reports/cross_artifact_[paper]/review_r_*.md)
/audit-reproducibility on the manuscript + outputs dir (results to quality_reports/cross_artifact_[paper]/reproducibility.md)

If --adversarial is in $ARGUMENTS: invoke the critic-fixer loop defined in the next section. Otherwise stop here.

Review Dimensions

1. Argument Structure

Is the research question clearly stated?
Does the introduction motivate the question effectively?
Is the logical flow sound (question → method → results → conclusion)?
Are the conclusions supported by the evidence?
Are limitations acknowledged?

2. Identification Strategy

Is the causal claim credible?
What are the key identifying assumptions? Are they stated explicitly?
Are there threats to identification (omitted variables, reverse causality, measurement error)?
Are robustness checks adequate?
Is the estimator appropriate for the research design?

3. Econometric Specification

Correct standard errors (clustered? robust? bootstrap?)?
Appropriate functional form?
Sample selection issues?
Multiple testing concerns?
Are point estimates economically meaningful (not just statistically significant)?

4. Literature Positioning

Are the key papers cited?
Is prior work characterized accurately?
Is the contribution clearly differentiated from existing work?
Any missing citations that a referee would flag?

5. Writing Quality

Clarity and concision
Academic tone
Consistent notation throughout
Abstract effectively summarizes the paper
Tables and figures are self-contained (clear labels, notes, sources)

6. Presentation

Are tables and figures well-designed?
Is notation consistent throughout?
Are there any typos, grammatical errors, or formatting issues?
Is the paper the right length for the contribution?

Output Format

# Manuscript Review: [Paper Title]

**Date:** [YYYY-MM-DD]
**Reviewer:** review-paper skill
**File:** [path to manuscript]

## Summary Assessment

**Overall recommendation:** [Strong Accept / Accept / Revise & Resubmit / Reject]

[2-3 paragraph summary: main contribution, strengths, and key concerns]

## Strengths

1. [Strength 1]
2. [Strength 2]
3. [Strength 3]

## Major Concerns

### MC1: [Title]
- **Dimension:** [Identification / Econometrics / Argument / Literature / Writing / Presentation]
- **Issue:** [Specific description]
- **Suggestion:** [How to address it]
- **Location:** [Section/page/table if applicable]

[Repeat for each major concern]

## Minor Concerns

### mc1: [Title]
- **Issue:** [Description]
- **Suggestion:** [Fix]

[Repeat]

## Referee Objections

These are the tough questions a top referee would likely raise:

### RO1: [Question]
**Why it matters:** [Why this could be fatal]
**How to address it:** [Suggested response or additional analysis]

[Repeat for 3-5 objections]

## Specific Comments

[Line-by-line or section-by-section comments, if any]

## Summary Statistics

| Dimension | Rating (1-5) |
|-----------|-------------|
| Argument Structure | [N] |
| Identification | [N] |
| Econometrics | [N] |
| Literature | [N] |
| Writing | [N] |
| Presentation | [N] |
| **Overall** | **[N]** |

Principles

Be constructive. Every criticism should come with a suggestion.
Be specific. Reference exact sections, equations, tables.
Think like a referee at a top-5 journal. What would make them reject?
Distinguish fatal flaws from minor issues. Not everything is equally important.
Acknowledge what's done well. Good research deserves recognition.
Do NOT fabricate details. If you can't read a section clearly, say so.

Adversarial Mode — Critic-Fixer Loop

Only runs if --adversarial is in $ARGUMENTS.

Pattern adapted from /qa-quarto, which uses the same loop to iterate on slide quality. Papers get it now because the single-pass review leaves authors doing manual fix-and-resubmit cycles.

Flow

Phase 0: Pre-flight
  │
  ├─ Verify the manuscript compiles (xelatex / quarto render) if applicable
  ├─ Snapshot the pre-review version: git stash OR copy to a .review-backup/
  │
Phase 1: Critic audit (round N=1,2,3,...)
  │
  ├─ Run the default review above, producing a round-N report
  ├─ If the report has ZERO Major Concerns and ZERO Referee Objections
  │  rated "fatal":
  │     → VERDICT = APPROVED. Stop the loop. Write final summary.
  │  Else: continue.
  │
Phase 2: Fixer
  │
  ├─ For each Major Concern in the round-N report, produce a concrete
  │  proposed edit (diff or new text block).
  ├─ Present proposed edits to the user grouped by severity (Critical →
  │  Major → Minor). Ask for approval: "apply all", "apply critical+major
  │  only", "review each", or "abort".
  ├─ Apply approved edits with Edit / Edit tools.
  ├─ If the manuscript is a compile target (`.tex` / `.qmd`), re-compile
  │  and verify it still builds.
  │
Phase 3: Re-audit
  │
  └─ Spawn a FRESH-CONTEXT subagent (via Task, `subagent_type` set to
     general-purpose) to re-read the paper and produce a round-(N+1)
     report. Fresh context prevents anchoring bias — the new reviewer
     sees the edited paper, not the diff.
     → Jump back to Phase 1.

Iteration limits — loop-until-dry

Convergence: APPROVED when a round produces zero Major Concerns and zero fatal Referee Objections.
Fallback cap: 5 rounds bounds a non-converging loop; after round 5, halt and list remaining concerns.
Two-strikes: if the same Concern label appears in rounds N and N+2, flag as "author disagreement" and let the user decide (keep-as-is with rationale vs. another fix attempt) — see summary-parity.md.
Budget escape: if cumulative token cost exceeds the spend cap (default ~500k — a spend ceiling, not a context-window limit, since each re-audit runs in fresh context), warn and let the user cap further rounds.

Stopping criteria

Condition	Action
Zero Major Concerns, zero fatal Referee Objections	APPROVED — final summary
Max 5 rounds reached	HALTED — list remaining concerns, user decides
User approves zero fixes in a round	HALTED — user signals "I disagree with this review"
Compile fails after applied fixes	ROLLED BACK to pre-round-N snapshot, report compile error, user decides

Final report

After the loop ends, write quality_reports/paper_review_[sanitized_name]_FINAL.md:

# Final Review: [Paper Title]

**Rounds:** N
**Verdict:** APPROVED | HALTED (max rounds) | HALTED (user override) | ROLLED BACK
**Token cost estimate:** ~XXk

## Round Summary
| Round | Major Concerns | Fatal Objections | Status |
|---|---|---|---|
| 1 | 7 | 2 | Fixed 5, deferred 2 |
| 2 | 3 | 1 | ...              |
| ... | ... | ... | ...         |
| N | 0 | 0 | APPROVED        |

## Changes Applied
[link to git diff between the pre-round-1 snapshot and HEAD]

## Remaining Concerns (if HALTED)
[list with severity + rationale]

## Next Steps
[recommended action: submit / one more pass / substantial revision]

When NOT to use adversarial mode

Early exploratory drafts (the loop forces premature polish on ideas still being shaped)
Papers you don't yet have compilable source for (can't verify edits)
When you'd rather get ONE opinion and decide for yourself (adversarial-mode enforces "critic signed off" semantics — that's sometimes the wrong frame)

`--peer [journal]` workflow detail

Phase 0: Cross-artifact pre-flight (runs BEFORE desk review in --peer mode)

Reports: quality_reports/cross_artifact_[paper]/reproducibility.md.

The editor collects novelty-probe claims (e.g., "Smith 2022 already showed this exact result").
Spawn claim-verifier via Task with subagent_type=claim-verifier and context=fork, passing the claims + verification questions + candidate source URLs. Forked fresh context is the CoVe independence trick.
Only verified claims are allowed into the desk-review narrative. Unverified claims are surfaced separately as "editor could not verify — manual check recommended" rather than presented as established prior work.

Opt-out: --no-novelty-check already skips the probe entirely. If the probe runs, Post-Flight is mandatory.

## Pre-Flight Report — /review-paper --peer

**Manuscript:** [path] — [page count, last modified]
**Target journal:** [JOURNAL_SHORT] → [full name from `.claude/references/journal-profiles.md`]
**Journal profile loaded:** [yes/no; resolved from `.claude/references/journal-profiles.md`; key adjustments: e.g., "Identification 35 → 40"]
**Cross-artifact scripts found:** [list referenced .R / .py / .do files]
**Reproducibility status:** [PASS / FAIL from Phase 0] — [N of M claims within tolerance]
**Round:** [fresh / r2 / r3 / stress]

If the manuscript path doesn't exist, the target journal isn't in .claude/references/journal-profiles.md, or a cross-artifact script is missing, stop and surface the issue before proceeding.

Phase 1: Editor desk review

Spawn forked subagent editor with the manuscript path and --peer <JOURNAL> context. Editor:

Reads journal profile from .claude/references/journal-profiles.md → states "Calibrated to: [journal]".
Reads abstract + intro + methods overview + headline results.
Runs novelty probes (unless --no-novelty-check).
Either DESK REJECT (pipeline terminates with rejection letter) or SEND OUT.

Report: quality_reports/peer_review_[paper]/desk_review.md.

Phase 1b: Referee selection (inside editor)

Phase 2: Two parallel referees, blind to each other

Spawn in parallel:

Forked subagent domain-referee with disposition D1, peeves P1 → referee_domain.md.
Forked subagent methods-referee with disposition D2, peeves P2 → referee_methods.md.

Each referee must include "What would change my mind: [specific ask]" on every MAJOR concern.

Phase 3: Editor synthesis (reduce → judge, with the hallucination gate)

Read both referee reports. Reduce their FINDINGs, classify each MAJOR concern as FATAL / ADDRESSABLE / TASTE, and produce the editorial decision using the decision rule table in editor.md.

Report: quality_reports/peer_review_[paper]/editorial_decision.md.

Phase 4: Summary

Tell the user:

Final decision (Accept / Minor / Major / Reject / Desk Reject)
Token usage + wall-clock time
Paths to all 4 reports (desk_review, referee_domain, referee_methods, editorial_decision)

Output layout for `--peer` mode

quality_reports/
  peer_review_[sanitized_paper_name]/
    desk_review.md                       # Phase 1 + Phase 1b
    referee_domain.md                    # Phase 2 (parallel)
    referee_methods.md                   # Phase 2 (parallel)
    editorial_decision.md                # Phase 3
    (R&R rounds: desk_review_r2.md, referee_domain_r2.md, ...)
  cross_artifact_[sanitized_paper_name]/
    reproducibility.md                   # Phase 0
    review_r_*.md                        # Phase 0 (one per referenced script)

Field adaptation

For non-econ paper types in methods-referee.md, extend the paper-type list (e.g., biology: observational / experimental / computational / review).

Cross-references

.claude/skills/audit-reproducibility/SKILL.md — numeric-claim verification (auto-invoked on referenced scripts).
.claude/skills/replication-package/SKILL.md — assemble the AEA DCAS deposit once the paper passes review.
.claude/skills/capture-environment/SKILL.md · .claude/skills/disclosure-check/SKILL.md — environment capture + restricted-data screening for the deposit.
.claude/skills/seven-pass-review/SKILL.md — heavier 7-lens pass for submission-ready drafts.

review-paper

About

SKILL.md

review-paper

About

SKILL.md

Manuscript Review

Modes

Default mode (single-pass)

Adversarial mode (--adversarial)

Peer-review mode (--peer <JOURNAL>)

Variance mode (--peer --variance N)

Steps (both modes)

Review Dimensions

1. Argument Structure

2. Identification Strategy

3. Econometric Specification

4. Literature Positioning

5. Writing Quality

6. Presentation

Output Format

Principles

Adversarial Mode — Critic-Fixer Loop

Flow

Iteration limits — loop-until-dry

Stopping criteria

Final report

When NOT to use adversarial mode

--peer [journal] workflow detail

Phase 0: Cross-artifact pre-flight (runs BEFORE desk review in --peer mode)

Phase 1: Editor desk review

Phase 1b: Referee selection (inside editor)

Phase 2: Two parallel referees, blind to each other

Phase 3: Editor synthesis (reduce → judge, with the hallucination gate)

Phase 4: Summary

Output layout for --peer mode

Field adaptation

Cross-references

About

SKILL.md

About

SKILL.md

Manuscript Review

Modes

Default mode (single-pass)

Adversarial mode (--adversarial)

Peer-review mode (--peer <JOURNAL>)

Variance mode (--peer --variance N)

Steps (both modes)

Review Dimensions

1. Argument Structure

2. Identification Strategy

3. Econometric Specification

4. Literature Positioning

5. Writing Quality

6. Presentation

Output Format

Principles

Adversarial Mode — Critic-Fixer Loop

Flow

Iteration limits — loop-until-dry

Stopping criteria

Final report

When NOT to use adversarial mode

--peer [journal] workflow detail

Phase 0: Cross-artifact pre-flight (runs BEFORE desk review in --peer mode)

Phase 1: Editor desk review

Phase 1b: Referee selection (inside editor)

Phase 2: Two parallel referees, blind to each other

Phase 3: Editor synthesis (reduce → judge, with the hallucination gate)

Phase 4: Summary

Output layout for --peer mode

Field adaptation

Cross-references

Adversarial mode (`--adversarial`)

Peer-review mode (`--peer <JOURNAL>`)

Variance mode (`--peer --variance N`)

`--peer [journal]` workflow detail

Output layout for `--peer` mode

Adversarial mode (`--adversarial`)

Peer-review mode (`--peer <JOURNAL>`)

Variance mode (`--peer --variance N`)

`--peer [journal]` workflow detail

Output layout for `--peer` mode