Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    Harryoung

    excel-parser

    Harryoung/excel-parser
    Data & Analytics
    83
    4 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis...

    SKILL.md

    Excel Parser

    Table of Contents

    • Overview
    • Core Philosophy: Scout Pattern
    • When to Use This Skill
    • Processing Workflow
    • Complexity Scoring Rules
    • Path A: Pandas Standard Mode
    • Path B: HTML Semantic Mode
    • Best Practices
    • Troubleshooting
    • Dependencies
    • Resources

    Overview

    Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).

    Core Philosophy: Scout Pattern

    Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:

    1. Metadata Scanning - Use openpyxl to scan file structure without loading data
    2. Complexity Scoring - Calculate score based on merged cells, row count, and layout
    3. Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
    4. Optimized Execution - Execute with the most appropriate tool for the file type

    Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"

    When to Use This Skill

    Use excel-parser when:

    • Processing Excel/CSV files with unknown structure or varying complexity
    • Handling files ranging from simple data tables to complex financial reports
    • Need to optimize between processing speed and extraction accuracy
    • Working with files that may contain merged cells, multi-level headers, or irregular layouts

    Skip this skill when:

    • File structure is already known and documented
    • Processing simple, well-structured tables with confirmed format
    • Using predefined scripts for specific file formats

    Processing Workflow

    Step 1: Analyze File Complexity

    Use the scripts/complexity_analyzer.py to scan file metadata:

    python scripts/complexity_analyzer.py <file_path> [sheet_name]
    

    What it analyzes (without loading data):

    • Merged cell distribution (shallow vs deep in the table)
    • Row count and data continuity
    • Empty row interruptions (indicates multi-table layouts)

    Output (JSON format):

    {
      "is_complex": false,
      "recommended_strategy": "pandas",
      "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
      "stats": {
        "total_rows": 5000,
        "deep_merges": 0,
        "empty_interruptions": 0
      }
    }
    

    Step 2: Route to Optimal Strategy

    Based on complexity analysis:

    • is_complex = false → Use Path A (Pandas Standard Mode)
    • is_complex = true → Use Path B (HTML Semantic Mode)

    Step 3: Execute Processing

    Follow the selected path's workflow to extract data.

    Complexity Scoring Rules

    Rule 1: Deep Merged Cells

    • Condition: Merged cells appearing beyond row 5
    • Interpretation: Complex table structure (not just header formatting)
    • Decision: Mark as complex if >2 deep merges detected
    • Example: Financial reports with merged category labels in data region

    Rule 2: Empty Row Interruptions

    • Condition: Multiple empty rows within the table
    • Interpretation: Multiple sub-tables in single sheet
    • Decision: Mark as complex if >2 empty row interruptions found
    • Example: Summary table + detail table in one sheet

    Rule 3: Row Count Override

    • Condition: Total rows >1000
    • Interpretation: Too large for HTML processing (token explosion)
    • Decision: Force Pandas mode regardless of complexity
    • Rationale: HTML conversion would exceed token limits

    Rule 4: Default (Standard Table)

    • Condition: No deep merges, continuous data, moderate size
    • Interpretation: Standard data table
    • Decision: Use Pandas for optimal speed

    Path A: Pandas Standard Mode

    When: Simple/large tables (most common case)

    Strategy: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.

    Workflow:

    1. Sample First 20 Rows

      • Read only the first 20 rows using pd.read_excel(..., nrows=20)
      • Convert to CSV format for analysis
    2. Determine Header Position

      • Examine the sampled rows to identify which row contains the actual column headers
      • Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names
    3. Read Full Data

      • Use pd.read_excel(..., header=<detected_row>) to load complete data
      • The header parameter ensures proper column naming

    Token Cost: ~500 tokens (only 20 rows analyzed) Processing Speed: Very fast (Pandas native speed)

    For implementation details, see references/smart_excel_router.py

    Path B: HTML Semantic Mode

    When: Complex/irregular tables (merged cells, multi-level headers)

    Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.

    Workflow:

    1. Convert to Semantic HTML

      • Load workbook with openpyxl
      • Build HTML table preserving merged cell spans
      • Use rowspan and colspan attributes to maintain structure
    2. Extract Structured Data

      • Analyze HTML table structure
      • Identify hierarchical headers from merged cells
      • Extract data preserving semantic relationships

    Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (semantic extraction) Use Case: Only for small (<1000 rows), complex files where Pandas would fail

    For implementation details, see references/smart_excel_router.py

    Best Practices

    1. Trust the Scout

    Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.

    2. Respect the Row Count Rule

    Never attempt HTML mode on files >1000 rows. Token limits will cause failures.

    3. Pandas First for Unknown Files

    When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.

    4. Cache Analysis Results

    If processing multiple sheets from same file, run analysis once and cache results.

    5. Preserve Original Files

    Never modify the original Excel file during analysis or processing.

    Troubleshooting

    File Cannot Be Opened

    • Symptom: FileNotFoundError or permission errors
    • Causes: Invalid path, file locked by another process, insufficient permissions
    • Solutions:
      • Verify file path is correct and file exists
      • Close the file if open in Excel or another application
      • Check read permissions on the file

    Corrupted File Errors

    • Symptom: BadZipFile or InvalidFileException
    • Causes: Incomplete download, file corruption, wrong file extension
    • Solutions:
      • Re-download or obtain fresh copy of the file
      • Verify file is actual Excel format (not CSV with .xlsx extension)
      • Try opening in Excel to confirm file integrity

    Memory Issues with Large Files

    • Symptom: MemoryError or system slowdown
    • Causes: File too large for available RAM
    • Solutions:
      • Use read_only=True mode in openpyxl
      • Process file in chunks using Pandas chunksize parameter
      • Increase system memory or use machine with more RAM

    Encoding Problems

    • Symptom: Garbled text or UnicodeDecodeError
    • Causes: Non-UTF8 encoding in source data
    • Solutions:
      • Specify encoding when reading CSV: pd.read_csv(..., encoding='gbk')
      • For Excel, data is usually UTF-8; check source data generation

    HTML Mode Token Overflow

    • Symptom: Truncated output or API errors
    • Causes: Complex file exceeds token limits despite row count check
    • Solutions:
      • Force Pandas mode even for complex files
      • Split sheet into smaller ranges and process separately
      • Extract only essential columns before HTML conversion

    Incorrect Header Detection

    • Symptom: Wrong columns or data shifted
    • Causes: Unusual header patterns not caught by sampling
    • Solutions:
      • Manually specify header row if known
      • Increase sample size beyond 20 rows
      • Use HTML mode for better structure understanding

    Dependencies

    Required Python packages:

    • openpyxl - Metadata scanning and Excel file manipulation
    • pandas - High-speed data reading and manipulation

    Resources

    This skill includes:

    • scripts/complexity_analyzer.py - Standalone executable for complexity analysis
    • references/smart_excel_router.py - Complete implementation reference with both processing paths
    Recommended Servers
    Local Model Suitability MCP
    Local Model Suitability MCP
    InfraNodus Knowledge Graphs & Text Analysis
    InfraNodus Knowledge Graphs & Text Analysis
    Bizfile MCP — Global Company Intelligence
    Bizfile MCP — Global Company Intelligence
    Repository
    harryoung/efka
    Files