pdf-to-markdown

lttr/pdf-to-markdown

Writing

5 installs

About

SKILL.md

pdf-to-markdown

lttr/pdf-to-markdown

Writing

5 installs

About

Extract text from scanned PDF documents and convert to clean markdown...

SKILL.md

PDF to Markdown Extraction

Extract text content from PDF documents and save as clean markdown.

How It Works

Claude is multimodal. Use the Read tool to visually read PDF pages and transcribe the text directly.

Workflow

Check PDF size - Count pages using pdfinfo <file>.pdf | grep Pages
Split if large (10+ pages) - Use qpdf to split into 4-page chunks
Extract text - For small PDFs: read directly. For large PDFs: process chunks in parallel using Task agents
Merge results - Combine extracted text in page order
Format as markdown - Apply appropriate heading levels, lists, and formatting
Review for errors - Check grammar/spelling, fix obvious OCR-style typos
Save output - Write markdown file next to original PDF with same base name
Cleanup - Remove temporary chunk files

Handling Large PDFs (10+ pages)

For PDFs with 10 or more pages, split into chunks for parallel processing.

Prerequisites

Ensure qpdf is installed:

# Check if installed
command -v qpdf

# Install if missing (Debian/Ubuntu)
sudo apt install qpdf

Splitting Process

# Create unique temp directory
TMPDIR=$(mktemp -d)

# Split into 4-page chunks
qpdf --split-pages=4 input.pdf "$TMPDIR/chunk-%d.pdf"

This creates files like chunk-1.pdf, chunk-2.pdf, etc.

Parallel Extraction

Launch multiple Task agents concurrently (one per chunk) to extract text. Each agent reads its assigned chunk and returns the extracted text. Collect and merge results in page order.

Cleanup

After merging, remove the temporary directory:

rm -rf "$TMPDIR"

Output Location

Save the .md file in the same directory as the source PDF:

Input: /path/to/document.pdf
Output: /path/to/document.md

Language-Specific Notes

Preserve diacritics accurately (háčky, čárky)
Keep abbreviations as in original (e.g., čs., použ.)

Handling Annotations

If the document contains handwritten annotations:

Use ~~strikethrough~~ for crossed-out text
Use *italics* for handwritten additions
Note unclear annotations in the output, use ( and ) to highlight

Quality Checklist

Before saving, verify:

All pages transcribed
Heading hierarchy makes sense
Lists formatted consistently
No obvious typos or garbled text
Special characters rendered correctly

About

SKILL.md

About

Extract text from scanned PDF documents and convert to clean markdown...

SKILL.md

PDF to Markdown Extraction

Extract text content from PDF documents and save as clean markdown.

How It Works

Claude is multimodal. Use the Read tool to visually read PDF pages and transcribe the text directly.

Workflow

Check PDF size - Count pages using pdfinfo <file>.pdf | grep Pages
Split if large (10+ pages) - Use qpdf to split into 4-page chunks
Extract text - For small PDFs: read directly. For large PDFs: process chunks in parallel using Task agents
Merge results - Combine extracted text in page order
Format as markdown - Apply appropriate heading levels, lists, and formatting
Review for errors - Check grammar/spelling, fix obvious OCR-style typos
Save output - Write markdown file next to original PDF with same base name
Cleanup - Remove temporary chunk files

Handling Large PDFs (10+ pages)

For PDFs with 10 or more pages, split into chunks for parallel processing.

Prerequisites

Ensure qpdf is installed:

# Check if installed
command -v qpdf

# Install if missing (Debian/Ubuntu)
sudo apt install qpdf

Splitting Process

# Create unique temp directory
TMPDIR=$(mktemp -d)

# Split into 4-page chunks
qpdf --split-pages=4 input.pdf "$TMPDIR/chunk-%d.pdf"

This creates files like chunk-1.pdf, chunk-2.pdf, etc.

Parallel Extraction

Launch multiple Task agents concurrently (one per chunk) to extract text. Each agent reads its assigned chunk and returns the extracted text. Collect and merge results in page order.

Cleanup

After merging, remove the temporary directory:

rm -rf "$TMPDIR"

Output Location

Save the .md file in the same directory as the source PDF:

Input: /path/to/document.pdf
Output: /path/to/document.md

Language-Specific Notes

Preserve diacritics accurately (háčky, čárky)
Keep abbreviations as in original (e.g., čs., použ.)

Handling Annotations

If the document contains handwritten annotations:

Use ~~strikethrough~~ for crossed-out text
Use *italics* for handwritten additions
Note unclear annotations in the output, use ( and ) to highlight

Quality Checklist

Before saving, verify:

All pages transcribed
Heading hierarchy makes sense
Lists formatted consistently
No obvious typos or garbled text
Special characters rendered correctly