Convert entire PDF documents to clean, structured Markdown for full context loading...
Extract complete PDF content as structured Markdown using IBM Docling AI, preserving:
USE THIS when:
This skill uses a dedicated virtual environment at ~/.claude/skills/pdf-to-markdown/.venv/ to avoid polluting the user's working directory.
cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core
~/.claude/skills/pdf-to-markdown/.venv/bin/python -c "import pymupdf; import docling; import docling_core; print('OK')"
# Convert PDF to markdown (always extracts images)
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf
# Output: document.md + images/ folder (next to the .md file)
When user provides a PDF and wants full content in context:
test -d ~/.claude/skills/pdf-to-markdown/.venv || (cd ~/.claude/skills/pdf-to-markdown && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core)
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py "/path/to/document.pdf"
# Output is written to document.md in the same directory as the PDF
cat /path/to/document.md
PDFs are aggressively cached to avoid re-processing. First extraction is slow (~1 sec/page), every subsequent request is instant.
~/.cache/pdf-to-markdown/<cache_key>/--clear-cache or --clear-all-cache# Clear cache for a specific PDF
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py document.pdf --clear-cache
# Clear entire cache
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --clear-all-cache
# Show cache statistics
~/.claude/skills/pdf-to-markdown/.venv/bin/python ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py --cache-stats
~/.cache/pdf-to-markdown/<cache_key>/
├── metadata.json # source path, mtime, size, total_pages
├── full_output.md # cached full markdown
└── images/ # extracted images
Images are always extracted. They are:
~/.cache/pdf-to-markdown/<cache_key>/images/images/ folder next to the output .md fileimages/filename.png)IMPORTANT: When the extracted markdown contains image references like:
**[Image: figure_1.png (1200x800, 125.3KB)]**
And the user asks about something that might be visual (charts, graphs, diagrams, figures, screenshots, layouts, designs, plots, illustrations), automatically use the Read tool to view the relevant image file(s) before answering. Don't ask the user - just look at it.
Examples of when to auto-view images:
The markdown output includes:
---
source: document.pdf
total_pages: 42
extracted_at: 2025-01-15T10:30:00
from_cache: true
images_dir: images
---
# Main Title
## Section Header
Regular paragraph text with **bold**, *italic*, and `code` formatting.

**[Image: figure_1.png (800x600, 45.2KB)]**
| Column A | Column B |
|----------|----------|
| Data 1 | Data 2 |
---
## Extracted Images
| # | File | Dimensions | Size |
|---|------|------------|------|
| 1 | figure_1.png | 800x600 | 45.2KB |
| 2 | chart_2.png | 1200x800 | 89.1KB |
Location: ~/.claude/skills/pdf-to-markdown/scripts/pdf_to_md.py
Usage: pdf_to_md.py <input.pdf> [output.md] [options]
Options:
--no-progress Disable progress indicator
Cache Options:
--clear-cache Clear cache for this PDF and re-extract
--clear-all-cache Clear entire cache directory and exit
--cache-stats Show cache statistics and exit
Recreate the skill's virtual environment:
cd ~/.claude/skills/pdf-to-markdown && rm -rf .venv && uv venv .venv && uv pip install --python .venv/bin/python pymupdf docling docling-core
For scanned PDFs, ensure Tesseract OCR is installed: brew install tesseract
This skill uses IBM's TableFormer AI model which has ~93.6% accuracy on complex tables. If tables are still garbled, the PDF may have unusual formatting.