pdf-to-docx

autojunjie/pdf-to-docx

Productivity

1 installs

About

SKILL.md

pdf-to-docx

autojunjie/pdf-to-docx

Productivity

1 installs

About

Convert PDF pages to editable Word documents (.docx) while preserving layout...

SKILL.md

PDF to Word Converter

Convert PDF pages to editable Word documents while preserving layout structure.

Workflow

Extract PDF page as image - Use pdfplumber to render page at high resolution
Run OCR - Use tesseract to extract text from the image
Create Word document - Use python-docx to create document with matching layout
Verify result - Compare generated document with original PDF

Quick Start

Extract a single page:

python scripts/extract_pdf_page.py /path/to/document.pdf 1 -o /output/dir

Create two-column Word document:

python scripts/create_two_column_docx.py /output/dir/page1_text.txt output.docx \
  --title "Document Title" \
  --author "Author Name" \
  --page-number 1 \
  --total-pages 8

Manual Workflow (for custom layouts)

When scripts don't match the exact layout needed, follow this manual process:

Step 1: Extract page as image

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]  # 0-indexed
    pil_image = page.to_image(resolution=200).original
    pil_image.save("page1.png", "PNG")

Step 2: Run OCR

tesseract page1.png page1_text -l eng

Step 3: View original to understand layout

Read the extracted image to understand:

Column structure (single, two-column, etc.)
Header/footer content
Section headers and formatting
Image/diagram placements

Step 4: Create Word document with python-docx

from docx import Document
from docx.shared import Pt, Cm
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

doc = Document()

# Set margins
for section in doc.sections:
    section.top_margin = Cm(1.5)
    section.bottom_margin = Cm(1.5)
    section.left_margin = Cm(1.5)
    section.right_margin = Cm(1.5)

# Two-column layout using borderless table
table = doc.add_table(rows=1, cols=2)
# Remove borders from cells
def remove_borders(cell):
    tc = cell._tc
    tcPr = tc.get_or_add_tcPr()
    tcBorders = OxmlElement('w:tcBorders')
    for edge in ('left', 'top', 'right', 'bottom'):
        el = OxmlElement(f'w:{edge}')
        el.set(qn('w:val'), 'nil')
        tcBorders.append(el)
    tcPr.append(tcBorders)

for cell in table.rows[0].cells:
    remove_borders(cell)
    cell.width = Cm(8.5)

# Add content to left column
left_cell = table.rows[0].cells[0]
p = left_cell.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
run = p.add_run("Content here...")
run.font.size = Pt(9)

doc.save("output.docx")

Layout Patterns

Academic Paper (Two-Column)

Use borderless table with 2 columns
Column width: ~8.5 cm each
Font size: 9-10pt for body, 10-12pt for headers
Justified text alignment
Section headers in bold

Single Column Document

Standard paragraph formatting
No table needed
Wider margins acceptable

With Images/Diagrams

Mark image positions with placeholder text: [Figure X - See original PDF]
Images must be manually extracted and inserted

Dependencies

Required:

pdfplumber: PDF parsing and image extraction
pillow: Image processing
python-docx: Word document creation
tesseract: OCR (install via brew install tesseract)

Install Python packages:

pip install pdfplumber pillow python-docx
# Or use uvx:
uvx --with pdfplumber --with pillow --with python-docx python script.py

Tips

Use resolution 200+ DPI for better OCR accuracy
For scanned PDFs, OCR is required
For text-based PDFs, pdfplumber can extract text directly
Compare final document with original to verify layout accuracy
Bold/italic formatting must be applied manually based on visual inspection

About

SKILL.md

About

Convert PDF pages to editable Word documents (.docx) while preserving layout...

SKILL.md

PDF to Word Converter

Convert PDF pages to editable Word documents while preserving layout structure.

Workflow

Extract PDF page as image - Use pdfplumber to render page at high resolution
Run OCR - Use tesseract to extract text from the image
Create Word document - Use python-docx to create document with matching layout
Verify result - Compare generated document with original PDF

Quick Start

Extract a single page:

python scripts/extract_pdf_page.py /path/to/document.pdf 1 -o /output/dir

Create two-column Word document:

python scripts/create_two_column_docx.py /output/dir/page1_text.txt output.docx \
  --title "Document Title" \
  --author "Author Name" \
  --page-number 1 \
  --total-pages 8

Manual Workflow (for custom layouts)

When scripts don't match the exact layout needed, follow this manual process:

Step 1: Extract page as image

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    page = pdf.pages[0]  # 0-indexed
    pil_image = page.to_image(resolution=200).original
    pil_image.save("page1.png", "PNG")

Step 2: Run OCR

tesseract page1.png page1_text -l eng

Step 3: View original to understand layout

Read the extracted image to understand:

Column structure (single, two-column, etc.)
Header/footer content
Section headers and formatting
Image/diagram placements

Step 4: Create Word document with python-docx

from docx import Document
from docx.shared import Pt, Cm
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_TABLE_ALIGNMENT
from docx.oxml.ns import qn
from docx.oxml import OxmlElement

doc = Document()

# Set margins
for section in doc.sections:
    section.top_margin = Cm(1.5)
    section.bottom_margin = Cm(1.5)
    section.left_margin = Cm(1.5)
    section.right_margin = Cm(1.5)

# Two-column layout using borderless table
table = doc.add_table(rows=1, cols=2)
# Remove borders from cells
def remove_borders(cell):
    tc = cell._tc
    tcPr = tc.get_or_add_tcPr()
    tcBorders = OxmlElement('w:tcBorders')
    for edge in ('left', 'top', 'right', 'bottom'):
        el = OxmlElement(f'w:{edge}')
        el.set(qn('w:val'), 'nil')
        tcBorders.append(el)
    tcPr.append(tcBorders)

for cell in table.rows[0].cells:
    remove_borders(cell)
    cell.width = Cm(8.5)

# Add content to left column
left_cell = table.rows[0].cells[0]
p = left_cell.add_paragraph()
p.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
run = p.add_run("Content here...")
run.font.size = Pt(9)

doc.save("output.docx")

Layout Patterns

Academic Paper (Two-Column)

Use borderless table with 2 columns
Column width: ~8.5 cm each
Font size: 9-10pt for body, 10-12pt for headers
Justified text alignment
Section headers in bold

Single Column Document

Standard paragraph formatting
No table needed
Wider margins acceptable

With Images/Diagrams

Mark image positions with placeholder text: [Figure X - See original PDF]
Images must be manually extracted and inserted

Dependencies

Required:

pdfplumber: PDF parsing and image extraction
pillow: Image processing
python-docx: Word document creation
tesseract: OCR (install via brew install tesseract)

Install Python packages:

pip install pdfplumber pillow python-docx
# Or use uvx:
uvx --with pdfplumber --with pillow --with python-docx python script.py

Tips

Use resolution 200+ DPI for better OCR accuracy
For scanned PDFs, OCR is required
For text-based PDFs, pdfplumber can extract text directly
Compare final document with original to verify layout accuracy
Bold/italic formatting must be applied manually based on visual inspection