Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    trilogy-group

    pdf-processor

    trilogy-group/pdf-processor
    Data & Analytics
    3 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    PDF processing skill for extraction, analysis, and manipulation of PDF documents

    SKILL.md

    PDF Processor Skill

    Overview

    The PDF Processor skill provides comprehensive PDF document handling capabilities including extraction, analysis, manipulation, and generation of PDF files for documentation, reporting, and compliance purposes.

    Features

    1. Text Extraction

    • Extract plain text from PDFs
    • Preserve formatting and structure
    • Extract tables and structured data
    • Multi-language support
    • Handle encrypted PDFs

    2. OCR Processing

    • Convert scanned documents to text
    • Support for 100+ languages
    • Image preprocessing for better accuracy
    • Handwriting recognition
    • Layout analysis

    3. Metadata Operations

    • Extract document properties
    • Read/write custom metadata
    • Extract embedded files
    • Digital signature verification
    • Creation/modification date tracking

    4. PDF Manipulation

    • Merge multiple PDFs
    • Split PDFs by pages or bookmarks
    • Rotate pages
    • Crop and resize
    • Add watermarks and stamps

    5. Form Processing

    • Extract form fields
    • Fill PDF forms programmatically
    • Validate form data
    • Create fillable forms
    • Export form data to JSON/CSV

    6. Report Generation

    • Generate PDFs from templates
    • Create reports from data
    • Add charts and graphs
    • Include images and logos
    • Apply corporate branding

    Configuration

    {
      "pdf_processor": {
        "enabled": true,
        "ocr": {
          "enabled": true,
          "languages": ["eng", "fra", "deu", "spa"],
          "dpi": 300,
          "preprocessing": true
        },
        "extraction": {
          "preserve_formatting": true,
          "extract_images": true,
          "extract_tables": true,
          "extract_metadata": true
        },
        "security": {
          "allow_encrypted": true,
          "max_file_size_mb": 100,
          "sandbox_mode": true
        },
        "output": {
          "formats": ["text", "json", "html", "markdown"],
          "compression": true,
          "optimization": true
        },
        "performance": {
          "parallel_processing": true,
          "max_workers": 4,
          "cache_enabled": true
        }
      }
    }
    

    Usage Examples

    Text Extraction

    # Extract text from a PDF
    from pdf_processor import PDFExtractor
    
    extractor = PDFExtractor()
    text = extractor.extract_text('document.pdf')
    
    # Extract with formatting preserved
    formatted_text = extractor.extract_text(
        'document.pdf',
        preserve_formatting=True
    )
    
    # Extract specific pages
    page_text = extractor.extract_pages(
        'document.pdf',
        pages=[1, 3, 5]
    )
    

    OCR Processing

    # Process scanned PDF with OCR
    from pdf_processor import OCRProcessor
    
    ocr = OCRProcessor(languages=['eng', 'spa'])
    text = ocr.process_scanned_pdf('scanned.pdf')
    
    # With preprocessing for better accuracy
    text = ocr.process_scanned_pdf(
        'scanned.pdf',
        preprocess=True,
        deskew=True,
        denoise=True
    )
    

    Table Extraction

    # Extract tables from PDF
    from pdf_processor import TableExtractor
    
    extractor = TableExtractor()
    tables = extractor.extract_tables('report.pdf')
    
    for idx, table in enumerate(tables):
        # Convert to pandas DataFrame
        df = table.to_dataframe()
        # Export to CSV
        df.to_csv(f'table_{idx}.csv')
    

    PDF Generation

    # Generate PDF report from template
    from pdf_processor import ReportGenerator
    
    generator = ReportGenerator()
    
    data = {
        'title': 'Monthly DevOps Report',
        'date': '2024-01-15',
        'metrics': {
            'uptime': '99.9%',
            'deployments': 47,
            'incidents': 2
        },
        'charts': ['uptime_chart.png', 'deployment_trend.png']
    }
    
    generator.create_report(
        template='monthly_report_template.html',
        data=data,
        output='monthly_report.pdf'
    )
    

    Form Processing

    # Extract and fill PDF forms
    from pdf_processor import FormProcessor
    
    processor = FormProcessor()
    
    # Extract form fields
    fields = processor.extract_fields('form.pdf')
    print(f"Found {len(fields)} form fields")
    
    # Fill form with data
    form_data = {
        'name': 'John Doe',
        'email': 'john@example.com',
        'department': 'Engineering'
    }
    
    processor.fill_form(
        'form.pdf',
        form_data,
        output='filled_form.pdf'
    )
    

    PDF Manipulation

    # Merge multiple PDFs
    from pdf_processor import PDFManipulator
    
    manipulator = PDFManipulator()
    
    # Merge PDFs
    manipulator.merge_pdfs(
        ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'],
        output='merged.pdf'
    )
    
    # Split PDF by pages
    manipulator.split_pdf(
        'large_document.pdf',
        pages_per_file=10,
        output_dir='split_docs/'
    )
    
    # Add watermark
    manipulator.add_watermark(
        'document.pdf',
        watermark='CONFIDENTIAL',
        output='watermarked.pdf',
        opacity=0.3
    )
    

    Integration Examples

    Compliance Report Generation

    # Generate compliance reports from audit data
    def generate_compliance_report(audit_data):
        generator = ReportGenerator()
        
        # Create PDF with audit findings
        report = generator.create_report(
            template='compliance_template.pdf',
            data={
                'audit_date': audit_data['date'],
                'findings': audit_data['findings'],
                'recommendations': audit_data['recommendations'],
                'compliance_score': audit_data['score']
            }
        )
        
        # Add digital signature
        report.sign(
            certificate='company_cert.p12',
            password='cert_password'
        )
        
        return report
    

    Documentation Processing Pipeline

    # Process technical documentation
    class DocProcessor:
        def process_documentation(self, pdf_path):
            # Extract text and metadata
            text = self.extract_text(pdf_path)
            metadata = self.extract_metadata(pdf_path)
            
            # Extract code snippets
            code_blocks = self.extract_code_blocks(text)
            
            # Extract diagrams and charts
            images = self.extract_images(pdf_path)
            
            # Generate searchable index
            index = self.create_search_index(text)
            
            # Convert to multiple formats
            self.export_to_markdown(text, 'docs.md')
            self.export_to_html(text, 'docs.html')
            
            return {
                'text': text,
                'metadata': metadata,
                'code_blocks': code_blocks,
                'images': images,
                'index': index
            }
    

    Advanced Features

    Batch Processing

    # Process multiple PDFs in parallel
    from pdf_processor import BatchProcessor
    
    processor = BatchProcessor(max_workers=4)
    
    # Define processing pipeline
    pipeline = [
        ('extract_text', {}),
        ('extract_tables', {}),
        ('extract_metadata', {})
    ]
    
    # Process all PDFs in directory
    results = processor.process_directory(
        'documents/',
        pipeline=pipeline,
        output_format='json'
    )
    

    Intelligent Data Extraction

    # Extract specific data using patterns
    from pdf_processor import IntelligentExtractor
    
    extractor = IntelligentExtractor()
    
    # Define extraction patterns
    patterns = {
        'invoice_number': r'Invoice #: (\d+)',
        'total_amount': r'Total: \$([\d,]+\.\d{2})',
        'date': r'Date: (\d{2}/\d{2}/\d{4})'
    }
    
    # Extract structured data
    data = extractor.extract_by_patterns(
        'invoice.pdf',
        patterns=patterns
    )
    

    Performance Optimization

    Caching Strategy

    # Enable caching for repeated operations
    from pdf_processor import CachedProcessor
    
    processor = CachedProcessor(
        cache_dir='/tmp/pdf_cache',
        ttl=3600  # Cache for 1 hour
    )
    
    # Subsequent calls use cache
    text1 = processor.extract_text('large_doc.pdf')  # Slow
    text2 = processor.extract_text('large_doc.pdf')  # Fast (cached)
    

    Memory Management

    # Stream processing for large PDFs
    from pdf_processor import StreamProcessor
    
    processor = StreamProcessor()
    
    # Process large PDF in chunks
    for chunk in processor.stream_pages('huge_document.pdf', chunk_size=10):
        # Process 10 pages at a time
        process_chunk(chunk)
    

    Error Handling

    # Robust error handling
    from pdf_processor import PDFProcessor, PDFError
    
    try:
        processor = PDFProcessor()
        result = processor.process('document.pdf')
    except PDFError.CorruptedFile as e:
        print(f"PDF is corrupted: {e}")
        # Attempt repair
        repaired = processor.repair_pdf('document.pdf')
    except PDFError.PasswordProtected as e:
        print(f"PDF is password protected")
        # Request password
        password = input("Enter PDF password: ")
        result = processor.process('document.pdf', password=password)
    except PDFError.UnsupportedFormat as e:
        print(f"Unsupported PDF format: {e}")
    

    Scripts Directory

    The PDF processor includes utility scripts in the scripts/ directory:

    pdf-extract.py

    #!/usr/bin/env python3
    # Extract text from PDFs via command line
    
    import argparse
    from pdf_processor import PDFExtractor
    
    def main():
        parser = argparse.ArgumentParser()
        parser.add_argument('input', help='Input PDF file')
        parser.add_argument('-o', '--output', help='Output file')
        parser.add_argument('--format', choices=['text', 'json', 'html'],
                           default='text')
        args = parser.parse_args()
        
        extractor = PDFExtractor()
        result = extractor.extract(args.input, format=args.format)
        
        if args.output:
            with open(args.output, 'w') as f:
                f.write(result)
        else:
            print(result)
    
    if __name__ == '__main__':
        main()
    

    pdf-merge.sh

    #!/bin/bash
    # Merge multiple PDFs
    
    if [ $# -lt 2 ]; then
        echo "Usage: $0 output.pdf input1.pdf input2.pdf ..."
        exit 1
    fi
    
    OUTPUT=$1
    shift
    
    python3 -c "
    from pdf_processor import PDFManipulator
    m = PDFManipulator()
    m.merge_pdfs(['$@'], '$OUTPUT')
    print(f'Merged {len(['$@'])} PDFs into $OUTPUT')
    "
    

    Troubleshooting

    Common Issues

    1. OCR Accuracy Issues

      • Solution: Increase DPI, enable preprocessing
      • Check language settings
    2. Memory Issues with Large PDFs

      • Solution: Use streaming mode
      • Process in chunks
    3. Corrupted PDF Files

      • Solution: Use repair function
      • Try alternative extraction methods
    4. Missing Dependencies

      • Install: pip install pypdf2 pdfplumber pytesseract
      • Install system deps: apt-get install tesseract-ocr poppler-utils

    Best Practices

    1. Always validate input PDFs
    2. Use appropriate error handling
    3. Enable caching for repeated operations
    4. Stream large files instead of loading into memory
    5. Sanitize user-uploaded PDFs
    6. Respect PDF permissions and DRM
    7. Optimize PDFs after manipulation
    8. Use async processing for web applications
    Repository
    trilogy-group/swarm-claude-plugin
    Files