Production-ready PDF processing with forms, tables, OCR, validation, and batch operations...
Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
print(text)
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection
All scripts include:
--help flag for all scriptsFor complete form workflows including:
See FORMS.md
For complex table extraction:
See TABLES.md
For scanned PDFs and image-based documents:
See OCR.md
analyze_form.py - Extract form field information
python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]
fill_form.py - Fill PDF forms with data
python scripts/fill_form.py input.pdf data.json output.pdf [--validate]
validate_form.py - Validate form data before filling
python scripts/validate_form.py data.json schema.json
extract_tables.py - Extract tables to CSV/Excel
python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv|excel]
extract_text.py - Extract text with formatting preservation
python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]
merge_pdfs.py - Merge multiple PDFs
python scripts/merge_pdfs.py file1.pdf file2.pdf file3.pdf --output merged.pdf
split_pdf.py - Split PDF into individual pages
python scripts/split_pdf.py input.pdf --output-dir pages/
validate_pdf.py - Validate PDF integrity
python scripts/validate_pdf.py input.pdf
# 1. Analyze form structure
python scripts/analyze_form.py template.pdf --output schema.json
# 2. Validate submission data
python scripts/validate_form.py submission.json schema.json
# 3. Fill form
python scripts/fill_form.py template.pdf submission.json completed.pdf
# 4. Validate output
python scripts/validate_pdf.py completed.pdf
# 1. Extract tables
python scripts/extract_tables.py monthly_report.pdf --output data.csv
# 2. Extract text for analysis
python scripts/extract_text.py monthly_report.pdf --output report.txt
import glob
from pathlib import Path
import subprocess
# Process all PDFs in directory
for pdf_file in glob.glob("invoices/*.pdf"):
output_file = Path("processed") / Path(pdf_file).name
result = subprocess.run([
"python", "scripts/extract_text.py",
pdf_file,
"--output", str(output_file)
], capture_output=True)
if result.returncode == 0:
print(f"✓ Processed: {pdf_file}")
else:
print(f"✗ Failed: {pdf_file} - {result.stderr}")
All scripts follow consistent error patterns:
# Exit codes
# 0 - Success
# 1 - File not found
# 2 - Invalid input
# 3 - Processing error
# 4 - Validation error
# Example usage in automation
result = subprocess.run(["python", "scripts/fill_form.py", ...])
if result.returncode == 0:
print("Success")
elif result.returncode == 4:
print("Validation failed - check input data")
else:
print(f"Error occurred: {result.returncode}")
All scripts require:
pip install pdfplumber pypdf pillow pytesseract pandas
Optional for OCR:
# Install tesseract-ocr system package
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases
--parallel flag (where supported)"Module not found" errors:
pip install -r requirements.txt
Tesseract not found:
# Install tesseract system package (see Dependencies)
Memory errors with large PDFs:
# Process page by page instead of loading entire PDF
with pdfplumber.open("large.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
# Process page immediately
Permission errors:
chmod +x scripts/*.py
All scripts support --help:
python scripts/analyze_form.py --help
python scripts/extract_tables.py --help
For detailed documentation on specific topics, see: