Translates PDF documents to any target language with layout preservation. Supports academic papers, books, manuals, and general documents...
Translate PDF documents between any language pair with optimized support for academic papers and documents with complex layouts.
Use this skill when:
/pdf-translator <pdf_path> [options]
<pdf_path>: PDF file or directory containing PDFs| Option | Description | Default |
|---|---|---|
--source-lang |
Source language code (auto for detection) | auto |
--target-lang |
Target language code | ko |
--output-format |
Output format (markdown/pdf/both) | both |
--output-dir |
Output directory | ./translated |
--parallel |
Concurrent agents | 5 |
--dict |
Custom dictionary (JSON) | none |
--high-quality |
Prefer the runtime's stronger model for translation | false |
--academic |
Academic document mode | false |
--term-style |
Term annotation style (parenthesis/footnote/inline) | parenthesis |
--first-occurrence |
Annotate terms only on first occurrence | true |
--describe-images |
Add AI-generated image descriptions | false |
ja (Japanese), en (English), ko (Korean), zh (Chinese), es (Spanish), fr (French), de (German), ru (Russian), ar (Arabic), he (Hebrew), or any ISO 639-1 code.
# Basic translation (English PDF to Korean)
/pdf-translator "/docs/manual.pdf"
# Japanese academic paper to Korean with terminology annotations
/pdf-translator "/papers/research.pdf" --source-lang ja --academic
# High-quality translation using the runtime's stronger model
/pdf-translator "/papers/important.pdf" --high-quality
# Academic mode with footnote-style term annotations
/pdf-translator "/papers/thesis.pdf" --academic --term-style footnote
# Batch translation of a directory
/pdf-translator "/docs/" --target-lang ko --parallel 10
# Markdown output only
/pdf-translator "/books/novel.pdf" --output-format markdown
# Arabic RTL document to English
/pdf-translator "/docs/arabic.pdf" --source-lang ar --target-lang en
flowchart LR
PDF[PDF] --> Extract[extract_to_markdown.py]
Extract --> Source[source.md + images/]
Source --> Split[split_markdown.py]
Split --> Sections[sections/]
Sections --> Translate[Translator Agents]
Translate --> Merged[translated.md]
Merged --> GenPDF[generate_pdf.py]
GenPDF --> Output[translated.pdf]
Key Features:
bash scripts/setup_env.sh
This installs pandoc and creates .venv/ with Python dependencies (pymupdf, pdfplumber, weasyprint).
PYTHON=".venv/bin/python"
WORK_DIR="/tmp/pdf_translate_$(date +%s)"
$PYTHON scripts/extract_to_markdown.py \
--pdf "{PDF_PATH}" \
--output-dir "$WORK_DIR" \
--source-lang en \
--target-lang ko
Output:
$WORK_DIR/source.md - Original Markdown (preserves structure)$WORK_DIR/images/ - Extracted images$WORK_DIR/metadata.json - Document metadataFor large documents (>6000 tokens), split into sections:
$PYTHON scripts/split_markdown.py \
--input "$WORK_DIR/source.md" \
--output-dir "$WORK_DIR/sections" \
--max-tokens 6000
Output:
$WORK_DIR/sections/section_001.md$WORK_DIR/sections/section_002.md$WORK_DIR/sections/sections_manifest.jsonTranslate each section using the Markdown translator guide (references/translator_markdown.md).
The orchestrator translates the Markdown directly:
source.md (or each section)translator_markdown.md guidelinestranslated.mdDispatch translation work for each section. Use parallel agents if the active runtime supports them; otherwise process sections sequentially:
Translation job:
prompt: "Read references/translator_markdown.md for guidelines.
Translate $WORK_DIR/sections/section_001.md from English to Korean.
Write output to $WORK_DIR/translated/section_001.md"
If split, merge translated sections:
cat $WORK_DIR/translated/section_*.md > $WORK_DIR/translated.md
$PYTHON scripts/generate_pdf.py \
--markdown "$WORK_DIR/translated.md" \
--output "$OUTPUT_DIR/{filename}_translated.pdf"
Review output for:
| Task | Model |
|---|---|
| Markdown translation | Standard translation-capable model |
| Validation | Fast model |
--high-quality| Task | Model |
|---|---|
| Markdown translation | Strongest practical model |
| Validation | Stronger model |
--academic)When enabled:
--term-style)| Style | Example |
|---|---|
parenthesis |
기계 학습(Machine Learning) |
footnote |
기계 학습¹ |
inline |
기계 학습/Machine Learning |
--first-occurrence)When true (default):
When false:
| Source | Special Handling |
|---|---|
| Japanese | Vertical→horizontal writing, ruby tag removal |
| Chinese | Traditional/simplified handling, vertical→horizontal |
| Arabic/Hebrew | RTL→LTR conversion, text direction adjustment |
| English | Standard processing |
| Target | Special Handling |
|---|---|
| Korean | Translationese removal, natural expression check |
The translator works without external dictionary files. It naturally translates based on context.
Use custom dictionaries ONLY for:
Do NOT add common words - let the translator handle them naturally.
Use the --dict option with a JSON file:
{
"metadata": {
"source_language": "en",
"target_language": "ko",
"document_title": "Annual Report 2024"
},
"proper_nouns": {
"names": { "John Smith": "존 스미스" },
"places": { "Silicon Valley": "실리콘밸리" },
"organizations": { "OpenAI": "OpenAI" }
},
"domain_terms": {
"ProprietaryTech": "고유 기술명"
},
"preserve_original": {
"terms": ["API", "GPU", "URL"]
},
"abbreviations": {
"LLM": "Large Language Model"
},
"style_notes": {
"notes": ""
}
}
| Template | Use Case |
|---|---|
| assets/template.json | General documents |
| assets/template_academic.json | Academic papers, technical documents |
$WORK_DIR/
├── source.md # Original Markdown (extracted from PDF)
├── metadata.json # Document metadata (title, pages, languages)
├── images/ # Extracted images
│ ├── page001_img000.png
│ ├── page002_img000.png
│ └── ...
├── sections/ # Split sections (for large documents)
│ ├── section_001.md
│ ├── section_002.md
│ └── sections_manifest.json
├── translated/ # Translated sections
│ ├── section_001.md
│ ├── section_002.md
│ └── ...
└── translated.md # Final merged translation
| Error | Action |
|---|---|
| PDF extraction failure | Skip corrupted file, report |
| Translation timeout | Retry with smaller chunks |
| Table extraction failure | Treat as text block |
| Layout preservation failure | Fallback to Markdown only |
| Low quality score | Re-translate with a stronger model |
The following automatic text cleanup is applied during extraction and output generation:
| Issue | Fix Applied |
|---|---|
| Corrupted characters (●) | Restored to parentheses |
| Broken URLs (spaces) | Spaces removed, domains fixed |
| Missing @ in emails | Restored based on pattern |
| Artifact text (a1111111111) | Filtered out |
| Small images (logos, icons) | Filtered (min 200x100) |
| Page headers/footers | Auto-detected and removed |
| Superscript numbers | Converted to ^[n] format |
| Reference section | Auto-formatted with merged entries |
| Concatenated words | Split using wordninja (thepatient → the patient) |
| Reversed text | Detected and corrected (rewol → lower) |
| Missing punctuation spaces | Added (text.Next → text. Next) |
| Table text spacing | Improved with x_tolerance parameter |
Some complex PDF artifacts cannot be fully corrected during extraction. The translator prompts include instructions to recognize and correct remaining errors:
broncho alveolar → bronchoalveolarDiaphragm a tic → DiaphragmaticSee references/translator_markdown.md and references/translator_academic.md for details.
Automatically detects and removes common header/footer patterns:
https://doi.org/...)Journal| (2024) 16:642)Received: 23 July 2024)Reference numbers and author affiliations are detected by font size and converted to standard format:
word¹ → word^[1]Author¹,² → Author^[1,2]Multi-language support for reference section headers:
Automatically detects and formats various list styles:
•, ·, -, *, ▪, ▸, ►1., 1), (1), ①i., ii., iii.a., a), (a)| Path | Description |
|---|---|
SKILL.md |
This file |
references/orchestrator.md |
Orchestrator workflow guide |
references/translator_markdown.md |
Markdown translation guidelines |
references/translator_academic.md |
Academic document translation |
references/validator_generic.md |
Generic validation instruction |
references/validator_ko.md |
Korean-specific validation |
scripts/setup_env.sh |
Environment setup (installs pandoc, Python dependencies) |
scripts/extract_to_markdown.py |
PDF extraction to Markdown with images |
scripts/split_markdown.py |
Split large Markdown into sections by token count |
scripts/generate_pdf.py |
PDF output generation (Markdown → PDF via pandoc + weasyprint) |
assets/template.json |
Dictionary template for general documents |
assets/template_academic.json |
Dictionary template for academic documents |
PDF extraction has inherent limitations due to the format's nature:
| Limitation | Description | Workaround |
|---|---|---|
| Figure text extraction | Text inside charts/graphs/diagrams may be extracted as body text | Manual review of figure areas |
| Complex table structures | Tables with merged cells or nested structures may not parse correctly | Tables extracted as best-effort Markdown |
| Multi-column layouts | Two-column academic papers may have text order issues | Usually handled correctly, but verify flow |
| Scanned PDFs | Image-based PDFs require OCR (not included) | Use OCR tools first, then translate |
| Mathematical formulas | LaTeX/MathML may not render perfectly | Formulas preserved as-is when possible |
For best results with complex documents, review the extracted source.md before translation and manually correct any extraction errors.