Extract text from scanned PDF documents and convert to clean markdown...
Extract text content from PDF documents and save as clean markdown.
Claude is multimodal. Use the Read tool to visually read PDF pages and transcribe the text directly.
pdfinfo <file>.pdf | grep PagesFor PDFs with 10 or more pages, split into chunks for parallel processing.
Ensure qpdf is installed:
# Check if installed
command -v qpdf
# Install if missing (Debian/Ubuntu)
sudo apt install qpdf
# Create unique temp directory
TMPDIR=$(mktemp -d)
# Split into 4-page chunks
qpdf --split-pages=4 input.pdf "$TMPDIR/chunk-%d.pdf"
This creates files like chunk-1.pdf, chunk-2.pdf, etc.
Launch multiple Task agents concurrently (one per chunk) to extract text. Each agent reads its assigned chunk and returns the extracted text. Collect and merge results in page order.
After merging, remove the temporary directory:
rm -rf "$TMPDIR"
Save the .md file in the same directory as the source PDF:
/path/to/document.pdf/path/to/document.mdčs., použ.)If the document contains handwritten annotations:
~~strikethrough~~ for crossed-out text*italics* for handwritten additions( and ) to highlightBefore saving, verify: