Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    autojunjie

    pdf-to-docx

    autojunjie/pdf-to-docx
    Productivity
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Convert PDF pages to editable Word documents (.docx) while preserving layout...

    SKILL.md

    PDF to Word Converter

    Convert PDF pages to editable Word documents while preserving layout structure.

    Workflow

    1. Extract PDF page as image - Use pdfplumber to render page at high resolution
    2. Run OCR - Use tesseract to extract text from the image
    3. Create Word document - Use python-docx to create document with matching layout
    4. Verify result - Compare generated document with original PDF

    Quick Start

    Extract a single page:

    python scripts/extract_pdf_page.py /path/to/document.pdf 1 -o /output/dir
    

    Create two-column Word document:

    python scripts/create_two_column_docx.py /output/dir/page1_text.txt output.docx \
      --title "Document Title" \
      --author "Author Name" \
      --page-number 1 \
      --total-pages 8
    

    Manual Workflow (for custom layouts)

    When scripts don't match the exact layout needed, follow this manual process:

    Step 1: Extract page as image

    import pdfplumber
    
    with pdfplumber.open("document.pdf") as pdf:
        page = pdf.pages[0]  # 0-indexed
        pil_image = page.to_image(resolution=200).original
        pil_image.save("page1.png", "PNG")
    

    Step 2: Run OCR

    tesseract page1.png page1_text -l eng
    

    Step 3: View original to understand layout

    Read the extracted image to understand:

    • Column structure (single, two-column, etc.)
    • Header/footer content
    • Section headers and formatting
    • Image/diagram placements

    Step 4: Create Word document with python-docx

    from docx import Document
    from docx.shared import Pt, Cm
    from docx.enum.text import WD_ALIGN_PARAGRAPH
    from docx.enum.table import WD_TABLE_ALIGNMENT
    from docx.oxml.ns import qn
    from docx.oxml import OxmlElement
    
    doc = Document()
    
    # Set margins
    for section in doc.sections:
        section.top_margin = Cm(1.5)
        section.bottom_margin = Cm(1.5)
        section.left_margin = Cm(1.5)
        section.right_margin = Cm(1.5)
    
    # Two-column layout using borderless table
    table = doc.add_table(rows=1, cols=2)
    # Remove borders from cells
    def remove_borders(cell):
        tc = cell._tc
        tcPr = tc.get_or_add_tcPr()
        tcBorders = OxmlElement('w:tcBorders')
        for edge in ('left', 'top', 'right', 'bottom'):
            el = OxmlElement(f'w:{edge}')
            el.set(qn('w:val'), 'nil')
            tcBorders.append(el)
        tcPr.append(tcBorders)
    
    for cell in table.rows[0].cells:
        remove_borders(cell)
        cell.width = Cm(8.5)
    
    # Add content to left column
    left_cell = table.rows[0].cells[0]
    p = left_cell.add_paragraph()
    p.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
    run = p.add_run("Content here...")
    run.font.size = Pt(9)
    
    doc.save("output.docx")
    

    Layout Patterns

    Academic Paper (Two-Column)

    • Use borderless table with 2 columns
    • Column width: ~8.5 cm each
    • Font size: 9-10pt for body, 10-12pt for headers
    • Justified text alignment
    • Section headers in bold

    Single Column Document

    • Standard paragraph formatting
    • No table needed
    • Wider margins acceptable

    With Images/Diagrams

    • Mark image positions with placeholder text: [Figure X - See original PDF]
    • Images must be manually extracted and inserted

    Dependencies

    Required:

    • pdfplumber: PDF parsing and image extraction
    • pillow: Image processing
    • python-docx: Word document creation
    • tesseract: OCR (install via brew install tesseract)

    Install Python packages:

    pip install pdfplumber pillow python-docx
    # Or use uvx:
    uvx --with pdfplumber --with pillow --with python-docx python script.py
    

    Tips

    • Use resolution 200+ DPI for better OCR accuracy
    • For scanned PDFs, OCR is required
    • For text-based PDFs, pdfplumber can extract text directly
    • Compare final document with original to verify layout accuracy
    • Bold/italic formatting must be applied manually based on visual inspection
    Recommended Servers
    Docfork
    Docfork
    Google Docs
    Google Docs
    ScrapeGraph AI Integration Server
    ScrapeGraph AI Integration Server
    Repository
    autojunjie/pdf-to-word-skill
    Files