pdf-translator

nebu1eto/pdf-translator

Writing

1 installs

About

SKILL.md

pdf-translator

nebu1eto/pdf-translator

Writing

1 installs

About

Translates PDF documents to any target language with layout preservation. Supports academic papers, books, manuals, and general documents...

SKILL.md

PDF Translation Skill

Translate PDF documents between any language pair with optimized support for academic papers and documents with complex layouts.

When to Use This Skill

Use this skill when:

User wants to translate a PDF document to another language
User has academic papers, research documents, or technical manuals to translate
User needs to preserve document structure (tables, headings, lists) during translation
User wants both Markdown and PDF output formats
User mentions translating documents with tables, figures, or complex layouts

Usage

/pdf-translator <pdf_path> [options]

Arguments

<pdf_path>: PDF file or directory containing PDFs

Options

Option	Description	Default
`--source-lang`	Source language code (auto for detection)	`auto`
`--target-lang`	Target language code	`ko`
`--output-format`	Output format (markdown/pdf/both)	`both`
`--output-dir`	Output directory	`./translated`
`--parallel`	Concurrent agents	`5`
`--dict`	Custom dictionary (JSON)	none
`--high-quality`	Prefer the runtime's stronger model for translation	`false`
`--academic`	Academic document mode	`false`
`--term-style`	Term annotation style (parenthesis/footnote/inline)	`parenthesis`
`--first-occurrence`	Annotate terms only on first occurrence	`true`
`--describe-images`	Add AI-generated image descriptions	`false`

Language Codes

ja (Japanese), en (English), ko (Korean), zh (Chinese), es (Spanish), fr (French), de (German), ru (Russian), ar (Arabic), he (Hebrew), or any ISO 639-1 code.

Examples

# Basic translation (English PDF to Korean)
/pdf-translator "/docs/manual.pdf"

# Japanese academic paper to Korean with terminology annotations
/pdf-translator "/papers/research.pdf" --source-lang ja --academic

# High-quality translation using the runtime's stronger model
/pdf-translator "/papers/important.pdf" --high-quality

# Academic mode with footnote-style term annotations
/pdf-translator "/papers/thesis.pdf" --academic --term-style footnote

# Batch translation of a directory
/pdf-translator "/docs/" --target-lang ko --parallel 10

# Markdown output only
/pdf-translator "/books/novel.pdf" --output-format markdown

# Arabic RTL document to English
/pdf-translator "/docs/arabic.pdf" --source-lang ar --target-lang en

Architecture

flowchart LR
    PDF[PDF] --> Extract[extract_to_markdown.py]
    Extract --> Source[source.md + images/]
    Source --> Split[split_markdown.py]
    Split --> Sections[sections/]
    Sections --> Translate[Translator Agents]
    Translate --> Merged[translated.md]
    Merged --> GenPDF[generate_pdf.py]
    GenPDF --> Output[translated.pdf]

Key Features:

Full document context preserved during translation
Clean Markdown intermediate format (human-readable, editable)
Section-based parallel translation for large documents
pandoc + weasyprint for high-quality PDF output

Execution Workflow

Phase 0: Environment Setup

bash scripts/setup_env.sh

This installs pandoc and creates .venv/ with Python dependencies (pymupdf, pdfplumber, weasyprint).

PYTHON=".venv/bin/python"

Phase 1: Extract PDF to Markdown

WORK_DIR="/tmp/pdf_translate_$(date +%s)"
$PYTHON scripts/extract_to_markdown.py \
  --pdf "{PDF_PATH}" \
  --output-dir "$WORK_DIR" \
  --source-lang en \
  --target-lang ko

Output:

$WORK_DIR/source.md - Original Markdown (preserves structure)
$WORK_DIR/images/ - Extracted images
$WORK_DIR/metadata.json - Document metadata

Phase 2: Split Markdown (if needed)

For large documents (>6000 tokens), split into sections:

$PYTHON scripts/split_markdown.py \
  --input "$WORK_DIR/source.md" \
  --output-dir "$WORK_DIR/sections" \
  --max-tokens 6000

Output:

$WORK_DIR/sections/section_001.md
$WORK_DIR/sections/section_002.md
$WORK_DIR/sections/sections_manifest.json

Phase 3: Translation

Translate each section using the Markdown translator guide (references/translator_markdown.md).

Option A: Direct Translation (Small documents)

The orchestrator translates the Markdown directly:

Read source.md (or each section)
Translate following translator_markdown.md guidelines
Write to translated.md

Option B: Parallel Translation (Large documents)

Dispatch translation work for each section. Use parallel agents if the active runtime supports them; otherwise process sections sequentially:

Translation job:
  prompt: "Read references/translator_markdown.md for guidelines.
           Translate $WORK_DIR/sections/section_001.md from English to Korean.
           Write output to $WORK_DIR/translated/section_001.md"

Phase 4: Merge Translated Sections

If split, merge translated sections:

cat $WORK_DIR/translated/section_*.md > $WORK_DIR/translated.md

Phase 5: Generate PDF

$PYTHON scripts/generate_pdf.py \
  --markdown "$WORK_DIR/translated.md" \
  --output "$OUTPUT_DIR/{filename}_translated.pdf"

Phase 6: Validation (Optional)

Review output for:

Markdown formatting preserved
Tables rendered correctly
Images referenced properly
No untranslated text

Model Selection

Default (no flags)

Task	Model
Markdown translation	Standard translation-capable model
Validation	Fast model

With `--high-quality`

Task	Model
Markdown translation	Strongest practical model
Validation	Stronger model

Academic Mode (`--academic`)

When enabled:

Technical terms include original language in parentheses
Abbreviations expanded on first occurrence
Citations and references preserved
Formal academic writing style maintained

Term Annotation Styles (`--term-style`)

Style	Example
`parenthesis`	기계 학습(Machine Learning)
`footnote`	기계 학습¹
`inline`	기계 학습/Machine Learning

First Occurrence (`--first-occurrence`)

When true (default):

First mention: 기계 학습(Machine Learning)
Subsequent: 기계 학습

When false:

All mentions include original term

Language-Specific Processing

Source	Special Handling
Japanese	Vertical→horizontal writing, ruby tag removal
Chinese	Traditional/simplified handling, vertical→horizontal
Arabic/Hebrew	RTL→LTR conversion, text direction adjustment
English	Standard processing

Target	Special Handling
Korean	Translationese removal, natural expression check

Custom Dictionary (Optional)

The translator works without external dictionary files. It naturally translates based on context.

Use custom dictionaries ONLY for:

Proper nouns: names, places, organizations, brands
Document-specific terms: proprietary terms unique to this document

Do NOT add common words - let the translator handle them naturally.

Creating a Custom Dictionary

Use the --dict option with a JSON file:

{
  "metadata": {
    "source_language": "en",
    "target_language": "ko",
    "document_title": "Annual Report 2024"
  },
  "proper_nouns": {
    "names": { "John Smith": "존 스미스" },
    "places": { "Silicon Valley": "실리콘밸리" },
    "organizations": { "OpenAI": "OpenAI" }
  },
  "domain_terms": {
    "ProprietaryTech": "고유 기술명"
  },
  "preserve_original": {
    "terms": ["API", "GPU", "URL"]
  },
  "abbreviations": {
    "LLM": "Large Language Model"
  },
  "style_notes": {
    "notes": ""
  }
}

Templates

Template	Use Case
assets/template.json	General documents
assets/template_academic.json	Academic papers, technical documents

Work Directory Structure

$WORK_DIR/
├── source.md               # Original Markdown (extracted from PDF)
├── metadata.json           # Document metadata (title, pages, languages)
├── images/                 # Extracted images
│   ├── page001_img000.png
│   ├── page002_img000.png
│   └── ...
├── sections/               # Split sections (for large documents)
│   ├── section_001.md
│   ├── section_002.md
│   └── sections_manifest.json
├── translated/             # Translated sections
│   ├── section_001.md
│   ├── section_002.md
│   └── ...
└── translated.md           # Final merged translation

Error Handling

Error	Action
PDF extraction failure	Skip corrupted file, report
Translation timeout	Retry with smaller chunks
Table extraction failure	Treat as text block
Layout preservation failure	Fallback to Markdown only
Low quality score	Re-translate with a stronger model

Text Processing

The following automatic text cleanup is applied during extraction and output generation:

Issue	Fix Applied
Corrupted characters (●)	Restored to parentheses
Broken URLs (spaces)	Spaces removed, domains fixed
Missing @ in emails	Restored based on pattern
Artifact text (a1111111111)	Filtered out
Small images (logos, icons)	Filtered (min 200x100)
Page headers/footers	Auto-detected and removed
Superscript numbers	Converted to `^[n]` format
Reference section	Auto-formatted with merged entries
Concatenated words	Split using wordninja (`thepatient` → `the patient`)
Reversed text	Detected and corrected (`rewol` → `lower`)
Missing punctuation spaces	Added (`text.Next` → `text. Next`)
Table text spacing	Improved with x_tolerance parameter

PDF Extraction Error Correction

Some complex PDF artifacts cannot be fully corrected during extraction. The translator prompts include instructions to recognize and correct remaining errors:

Split medical/scientific terms: broncho alveolar → bronchoalveolar
Single-letter fragments: Diaphragm a tic → Diaphragmatic

See references/translator_markdown.md and references/translator_academic.md for details.

Header/Footer Detection

Automatically detects and removes common header/footer patterns:

DOI links (https://doi.org/...)
Journal volume/issue patterns (Journal| (2024) 16:642)
Page numbers (standalone numbers at page boundaries)
Date stamps (Received: 23 July 2024)
Copyright notices

Superscript Handling

Reference numbers and author affiliations are detected by font size and converted to standard format:

word¹ → word^[1]
Author¹,² → Author^[1,2]

Reference Section Processing

Multi-language support for reference section headers:

English: References, Bibliography, Works Cited
Korean: 참고문헌
German: Literatur, Literaturverzeichnis
French: Références, Bibliographie
Chinese/Japanese: 参考文献

List Detection

Automatically detects and formats various list styles:

Bullet: •, ·, -, *, ▪, ▸, ►
Numbered: 1., 1), (1), ①
Roman: i., ii., iii.
Letter: a., a), (a)

Output Formats

Markdown Output

Preserves document structure (headings, paragraphs, lists)
Tables converted to Markdown tables
URLs converted to clickable links
Metadata in YAML frontmatter

PDF Output

Generated via pandoc + weasyprint from Markdown
Clean text rendering with system fonts (Pretendard preferred)
Proper table rendering with borders and headers
Clickable links with styling
Page numbers at bottom

File Reference

Path	Description
`SKILL.md`	This file
`references/orchestrator.md`	Orchestrator workflow guide
`references/translator_markdown.md`	Markdown translation guidelines
`references/translator_academic.md`	Academic document translation
`references/validator_generic.md`	Generic validation instruction
`references/validator_ko.md`	Korean-specific validation
`scripts/setup_env.sh`	Environment setup (installs pandoc, Python dependencies)
`scripts/extract_to_markdown.py`	PDF extraction to Markdown with images
`scripts/split_markdown.py`	Split large Markdown into sections by token count
`scripts/generate_pdf.py`	PDF output generation (Markdown → PDF via pandoc + weasyprint)
`assets/template.json`	Dictionary template for general documents
`assets/template_academic.json`	Dictionary template for academic documents

Known Limitations

PDF extraction has inherent limitations due to the format's nature:

Limitation	Description	Workaround
Figure text extraction	Text inside charts/graphs/diagrams may be extracted as body text	Manual review of figure areas
Complex table structures	Tables with merged cells or nested structures may not parse correctly	Tables extracted as best-effort Markdown
Multi-column layouts	Two-column academic papers may have text order issues	Usually handled correctly, but verify flow
Scanned PDFs	Image-based PDFs require OCR (not included)	Use OCR tools first, then translate
Mathematical formulas	LaTeX/MathML may not render perfectly	Formulas preserved as-is when possible

Quality Expectations

Academic papers: 85-95% accuracy on text extraction
Technical manuals: 80-90% accuracy
Complex layouts: 70-85% accuracy (flowcharts, multi-column)
Tables: Variable (depends on structure complexity)

For best results with complex documents, review the extracted source.md before translation and manually correct any extraction errors.

About

SKILL.md

About

Translates PDF documents to any target language with layout preservation. Supports academic papers, books, manuals, and general documents...

SKILL.md

PDF Translation Skill

Translate PDF documents between any language pair with optimized support for academic papers and documents with complex layouts.

When to Use This Skill

Use this skill when:

User wants to translate a PDF document to another language
User has academic papers, research documents, or technical manuals to translate
User needs to preserve document structure (tables, headings, lists) during translation
User wants both Markdown and PDF output formats
User mentions translating documents with tables, figures, or complex layouts

Usage

/pdf-translator <pdf_path> [options]

Arguments

<pdf_path>: PDF file or directory containing PDFs

Options

Option	Description	Default
`--source-lang`	Source language code (auto for detection)	`auto`
`--target-lang`	Target language code	`ko`
`--output-format`	Output format (markdown/pdf/both)	`both`
`--output-dir`	Output directory	`./translated`
`--parallel`	Concurrent agents	`5`
`--dict`	Custom dictionary (JSON)	none
`--high-quality`	Prefer the runtime's stronger model for translation	`false`
`--academic`	Academic document mode	`false`
`--term-style`	Term annotation style (parenthesis/footnote/inline)	`parenthesis`
`--first-occurrence`	Annotate terms only on first occurrence	`true`
`--describe-images`	Add AI-generated image descriptions	`false`

Language Codes

ja (Japanese), en (English), ko (Korean), zh (Chinese), es (Spanish), fr (French), de (German), ru (Russian), ar (Arabic), he (Hebrew), or any ISO 639-1 code.

Examples

# Basic translation (English PDF to Korean)
/pdf-translator "/docs/manual.pdf"

# Japanese academic paper to Korean with terminology annotations
/pdf-translator "/papers/research.pdf" --source-lang ja --academic

# High-quality translation using the runtime's stronger model
/pdf-translator "/papers/important.pdf" --high-quality

# Academic mode with footnote-style term annotations
/pdf-translator "/papers/thesis.pdf" --academic --term-style footnote

# Batch translation of a directory
/pdf-translator "/docs/" --target-lang ko --parallel 10

# Markdown output only
/pdf-translator "/books/novel.pdf" --output-format markdown

# Arabic RTL document to English
/pdf-translator "/docs/arabic.pdf" --source-lang ar --target-lang en

Architecture

flowchart LR
    PDF[PDF] --> Extract[extract_to_markdown.py]
    Extract --> Source[source.md + images/]
    Source --> Split[split_markdown.py]
    Split --> Sections[sections/]
    Sections --> Translate[Translator Agents]
    Translate --> Merged[translated.md]
    Merged --> GenPDF[generate_pdf.py]
    GenPDF --> Output[translated.pdf]

Key Features:

Full document context preserved during translation
Clean Markdown intermediate format (human-readable, editable)
Section-based parallel translation for large documents
pandoc + weasyprint for high-quality PDF output

Execution Workflow

Phase 0: Environment Setup

bash scripts/setup_env.sh

This installs pandoc and creates .venv/ with Python dependencies (pymupdf, pdfplumber, weasyprint).

PYTHON=".venv/bin/python"

Phase 1: Extract PDF to Markdown

WORK_DIR="/tmp/pdf_translate_$(date +%s)"
$PYTHON scripts/extract_to_markdown.py \
  --pdf "{PDF_PATH}" \
  --output-dir "$WORK_DIR" \
  --source-lang en \
  --target-lang ko

Output:

$WORK_DIR/source.md - Original Markdown (preserves structure)
$WORK_DIR/images/ - Extracted images
$WORK_DIR/metadata.json - Document metadata

Phase 2: Split Markdown (if needed)

For large documents (>6000 tokens), split into sections:

$PYTHON scripts/split_markdown.py \
  --input "$WORK_DIR/source.md" \
  --output-dir "$WORK_DIR/sections" \
  --max-tokens 6000

Output:

$WORK_DIR/sections/section_001.md
$WORK_DIR/sections/section_002.md
$WORK_DIR/sections/sections_manifest.json

Phase 3: Translation

Translate each section using the Markdown translator guide (references/translator_markdown.md).

Option A: Direct Translation (Small documents)

The orchestrator translates the Markdown directly:

Read source.md (or each section)
Translate following translator_markdown.md guidelines
Write to translated.md

Option B: Parallel Translation (Large documents)

Dispatch translation work for each section. Use parallel agents if the active runtime supports them; otherwise process sections sequentially:

Translation job:
  prompt: "Read references/translator_markdown.md for guidelines.
           Translate $WORK_DIR/sections/section_001.md from English to Korean.
           Write output to $WORK_DIR/translated/section_001.md"

Phase 4: Merge Translated Sections

If split, merge translated sections:

cat $WORK_DIR/translated/section_*.md > $WORK_DIR/translated.md

Phase 5: Generate PDF

$PYTHON scripts/generate_pdf.py \
  --markdown "$WORK_DIR/translated.md" \
  --output "$OUTPUT_DIR/{filename}_translated.pdf"

Phase 6: Validation (Optional)

Review output for:

Markdown formatting preserved
Tables rendered correctly
Images referenced properly
No untranslated text

Model Selection

Default (no flags)

Task	Model
Markdown translation	Standard translation-capable model
Validation	Fast model

With `--high-quality`

Task	Model
Markdown translation	Strongest practical model
Validation	Stronger model

Academic Mode (`--academic`)

When enabled:

Technical terms include original language in parentheses
Abbreviations expanded on first occurrence
Citations and references preserved
Formal academic writing style maintained

Term Annotation Styles (`--term-style`)

Style	Example
`parenthesis`	기계 학습(Machine Learning)
`footnote`	기계 학습¹
`inline`	기계 학습/Machine Learning

First Occurrence (`--first-occurrence`)

When true (default):

First mention: 기계 학습(Machine Learning)
Subsequent: 기계 학습

When false:

All mentions include original term

Language-Specific Processing

Source	Special Handling
Japanese	Vertical→horizontal writing, ruby tag removal
Chinese	Traditional/simplified handling, vertical→horizontal
Arabic/Hebrew	RTL→LTR conversion, text direction adjustment
English	Standard processing

Target	Special Handling
Korean	Translationese removal, natural expression check

Custom Dictionary (Optional)

The translator works without external dictionary files. It naturally translates based on context.

Use custom dictionaries ONLY for:

Proper nouns: names, places, organizations, brands
Document-specific terms: proprietary terms unique to this document

Do NOT add common words - let the translator handle them naturally.

Creating a Custom Dictionary

Use the --dict option with a JSON file:

{
  "metadata": {
    "source_language": "en",
    "target_language": "ko",
    "document_title": "Annual Report 2024"
  },
  "proper_nouns": {
    "names": { "John Smith": "존 스미스" },
    "places": { "Silicon Valley": "실리콘밸리" },
    "organizations": { "OpenAI": "OpenAI" }
  },
  "domain_terms": {
    "ProprietaryTech": "고유 기술명"
  },
  "preserve_original": {
    "terms": ["API", "GPU", "URL"]
  },
  "abbreviations": {
    "LLM": "Large Language Model"
  },
  "style_notes": {
    "notes": ""
  }
}

Templates

Template	Use Case
assets/template.json	General documents
assets/template_academic.json	Academic papers, technical documents

Work Directory Structure

$WORK_DIR/
├── source.md               # Original Markdown (extracted from PDF)
├── metadata.json           # Document metadata (title, pages, languages)
├── images/                 # Extracted images
│   ├── page001_img000.png
│   ├── page002_img000.png
│   └── ...
├── sections/               # Split sections (for large documents)
│   ├── section_001.md
│   ├── section_002.md
│   └── sections_manifest.json
├── translated/             # Translated sections
│   ├── section_001.md
│   ├── section_002.md
│   └── ...
└── translated.md           # Final merged translation

Error Handling

Error	Action
PDF extraction failure	Skip corrupted file, report
Translation timeout	Retry with smaller chunks
Table extraction failure	Treat as text block
Layout preservation failure	Fallback to Markdown only
Low quality score	Re-translate with a stronger model

Text Processing

The following automatic text cleanup is applied during extraction and output generation:

Issue	Fix Applied
Corrupted characters (●)	Restored to parentheses
Broken URLs (spaces)	Spaces removed, domains fixed
Missing @ in emails	Restored based on pattern
Artifact text (a1111111111)	Filtered out
Small images (logos, icons)	Filtered (min 200x100)
Page headers/footers	Auto-detected and removed
Superscript numbers	Converted to `^[n]` format
Reference section	Auto-formatted with merged entries
Concatenated words	Split using wordninja (`thepatient` → `the patient`)
Reversed text	Detected and corrected (`rewol` → `lower`)
Missing punctuation spaces	Added (`text.Next` → `text. Next`)
Table text spacing	Improved with x_tolerance parameter

PDF Extraction Error Correction

Some complex PDF artifacts cannot be fully corrected during extraction. The translator prompts include instructions to recognize and correct remaining errors:

Split medical/scientific terms: broncho alveolar → bronchoalveolar
Single-letter fragments: Diaphragm a tic → Diaphragmatic

See references/translator_markdown.md and references/translator_academic.md for details.

Header/Footer Detection

Automatically detects and removes common header/footer patterns:

DOI links (https://doi.org/...)
Journal volume/issue patterns (Journal| (2024) 16:642)
Page numbers (standalone numbers at page boundaries)
Date stamps (Received: 23 July 2024)
Copyright notices

Superscript Handling

Reference numbers and author affiliations are detected by font size and converted to standard format:

word¹ → word^[1]
Author¹,² → Author^[1,2]

Reference Section Processing

Multi-language support for reference section headers:

English: References, Bibliography, Works Cited
Korean: 참고문헌
German: Literatur, Literaturverzeichnis
French: Références, Bibliographie
Chinese/Japanese: 参考文献

List Detection

Automatically detects and formats various list styles:

Bullet: •, ·, -, *, ▪, ▸, ►
Numbered: 1., 1), (1), ①
Roman: i., ii., iii.
Letter: a., a), (a)

Output Formats

Markdown Output

Preserves document structure (headings, paragraphs, lists)
Tables converted to Markdown tables
URLs converted to clickable links
Metadata in YAML frontmatter

PDF Output

Generated via pandoc + weasyprint from Markdown
Clean text rendering with system fonts (Pretendard preferred)
Proper table rendering with borders and headers
Clickable links with styling
Page numbers at bottom

File Reference

Path	Description
`SKILL.md`	This file
`references/orchestrator.md`	Orchestrator workflow guide
`references/translator_markdown.md`	Markdown translation guidelines
`references/translator_academic.md`	Academic document translation
`references/validator_generic.md`	Generic validation instruction
`references/validator_ko.md`	Korean-specific validation
`scripts/setup_env.sh`	Environment setup (installs pandoc, Python dependencies)
`scripts/extract_to_markdown.py`	PDF extraction to Markdown with images
`scripts/split_markdown.py`	Split large Markdown into sections by token count
`scripts/generate_pdf.py`	PDF output generation (Markdown → PDF via pandoc + weasyprint)
`assets/template.json`	Dictionary template for general documents
`assets/template_academic.json`	Dictionary template for academic documents

Known Limitations

PDF extraction has inherent limitations due to the format's nature:

Limitation	Description	Workaround
Figure text extraction	Text inside charts/graphs/diagrams may be extracted as body text	Manual review of figure areas
Complex table structures	Tables with merged cells or nested structures may not parse correctly	Tables extracted as best-effort Markdown
Multi-column layouts	Two-column academic papers may have text order issues	Usually handled correctly, but verify flow
Scanned PDFs	Image-based PDFs require OCR (not included)	Use OCR tools first, then translate
Mathematical formulas	LaTeX/MathML may not render perfectly	Formulas preserved as-is when possible

Quality Expectations

Academic papers: 85-95% accuracy on text extraction
Technical manuals: 80-90% accuracy
Complex layouts: 70-85% accuracy (flowcharts, multi-column)
Tables: Variable (depends on structure complexity)

For best results with complex documents, review the extracted source.md before translation and manually correct any extraction errors.

pdf-translator

About

SKILL.md

pdf-translator

About

SKILL.md

PDF Translation Skill

When to Use This Skill

Usage

Arguments

Options

Language Codes

Examples

Architecture

Execution Workflow

Phase 0: Environment Setup

Phase 1: Extract PDF to Markdown

Phase 2: Split Markdown (if needed)

Phase 3: Translation

Option A: Direct Translation (Small documents)

Option B: Parallel Translation (Large documents)

Phase 4: Merge Translated Sections

Phase 5: Generate PDF

Phase 6: Validation (Optional)

Model Selection

Default (no flags)

With --high-quality

Academic Mode (--academic)

Term Annotation Styles (--term-style)

First Occurrence (--first-occurrence)

Language-Specific Processing

Custom Dictionary (Optional)

Creating a Custom Dictionary

Templates

Work Directory Structure

Error Handling

Text Processing

PDF Extraction Error Correction

Header/Footer Detection

Superscript Handling

Reference Section Processing

List Detection

Output Formats

Markdown Output

PDF Output

File Reference

Known Limitations

Quality Expectations

About

SKILL.md

About

SKILL.md

PDF Translation Skill

When to Use This Skill

Usage

Arguments

Options

Language Codes

Examples

Architecture

Execution Workflow

Phase 0: Environment Setup

Phase 1: Extract PDF to Markdown

Phase 2: Split Markdown (if needed)

Phase 3: Translation

Option A: Direct Translation (Small documents)

Option B: Parallel Translation (Large documents)

Phase 4: Merge Translated Sections

Phase 5: Generate PDF

Phase 6: Validation (Optional)

Model Selection

Default (no flags)

With --high-quality

Academic Mode (--academic)

Term Annotation Styles (--term-style)

First Occurrence (--first-occurrence)

Language-Specific Processing

Custom Dictionary (Optional)

Creating a Custom Dictionary

Templates

With `--high-quality`

Academic Mode (`--academic`)

Term Annotation Styles (`--term-style`)

First Occurrence (`--first-occurrence`)

With `--high-quality`

Academic Mode (`--academic`)

Term Annotation Styles (`--term-style`)

First Occurrence (`--first-occurrence`)