excel-parser

Harryoung/excel-parser

Data & Analytics

4 installs

About

SKILL.md

excel-parser

Harryoung/excel-parser

Data & Analytics

4 installs

About

Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis...

SKILL.md

Excel Parser

Overview
Core Philosophy: Scout Pattern
When to Use This Skill
Processing Workflow
Complexity Scoring Rules
Path A: Pandas Standard Mode
Path B: HTML Semantic Mode
Best Practices
Troubleshooting
Dependencies
Resources

Overview

Provide intelligent routing strategies for parsing Excel/CSV files by analyzing complexity and choosing the optimal processing path. The skill implements a "Scout Pattern" that scans file metadata before processing to balance speed (Pandas) with accuracy (semantic extraction).

Core Philosophy: Scout Pattern

Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:

Metadata Scanning - Use openpyxl to scan file structure without loading data
Complexity Scoring - Calculate score based on merged cells, row count, and layout
Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
Optimized Execution - Execute with the most appropriate tool for the file type

Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"

When to Use This Skill

Use excel-parser when:

Processing Excel/CSV files with unknown structure or varying complexity
Handling files ranging from simple data tables to complex financial reports
Need to optimize between processing speed and extraction accuracy
Working with files that may contain merged cells, multi-level headers, or irregular layouts

Skip this skill when:

File structure is already known and documented
Processing simple, well-structured tables with confirmed format
Using predefined scripts for specific file formats

Processing Workflow

Step 1: Analyze File Complexity

Use the scripts/complexity_analyzer.py to scan file metadata:

python scripts/complexity_analyzer.py <file_path> [sheet_name]

What it analyzes (without loading data):

Merged cell distribution (shallow vs deep in the table)
Row count and data continuity
Empty row interruptions (indicates multi-table layouts)

Output (JSON format):

{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}

Step 2: Route to Optimal Strategy

Based on complexity analysis:

is_complex = false → Use Path A (Pandas Standard Mode)
is_complex = true → Use Path B (HTML Semantic Mode)

Step 3: Execute Processing

Follow the selected path's workflow to extract data.

Complexity Scoring Rules

Rule 1: Deep Merged Cells

Condition: Merged cells appearing beyond row 5
Interpretation: Complex table structure (not just header formatting)
Decision: Mark as complex if >2 deep merges detected
Example: Financial reports with merged category labels in data region

Rule 2: Empty Row Interruptions

Condition: Multiple empty rows within the table
Interpretation: Multiple sub-tables in single sheet
Decision: Mark as complex if >2 empty row interruptions found
Example: Summary table + detail table in one sheet

Rule 3: Row Count Override

Condition: Total rows >1000
Interpretation: Too large for HTML processing (token explosion)
Decision: Force Pandas mode regardless of complexity
Rationale: HTML conversion would exceed token limits

Rule 4: Default (Standard Table)

Condition: No deep merges, continuous data, moderate size
Interpretation: Standard data table
Decision: Use Pandas for optimal speed

Path A: Pandas Standard Mode

When: Simple/large tables (most common case)

Strategy: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.

Workflow:

Sample First 20 Rows
- Read only the first 20 rows using pd.read_excel(..., nrows=20)
- Convert to CSV format for analysis
Determine Header Position
- Examine the sampled rows to identify which row contains the actual column headers
- Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names
Read Full Data
- Use pd.read_excel(..., header=<detected_row>) to load complete data
- The header parameter ensures proper column naming

Token Cost: ~500 tokens (only 20 rows analyzed) Processing Speed: Very fast (Pandas native speed)

For implementation details, see references/smart_excel_router.py

Path B: HTML Semantic Mode

When: Complex/irregular tables (merged cells, multi-level headers)

Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.

Workflow:

Convert to Semantic HTML
- Load workbook with openpyxl
- Build HTML table preserving merged cell spans
- Use rowspan and colspan attributes to maintain structure
Extract Structured Data
- Analyze HTML table structure
- Identify hierarchical headers from merged cells
- Extract data preserving semantic relationships

Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (semantic extraction) Use Case: Only for small (<1000 rows), complex files where Pandas would fail

For implementation details, see references/smart_excel_router.py

Best Practices

1. Trust the Scout

Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.

2. Respect the Row Count Rule

Never attempt HTML mode on files >1000 rows. Token limits will cause failures.

3. Pandas First for Unknown Files

When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.

4. Cache Analysis Results

If processing multiple sheets from same file, run analysis once and cache results.

5. Preserve Original Files

Never modify the original Excel file during analysis or processing.

Troubleshooting

File Cannot Be Opened

Symptom: FileNotFoundError or permission errors
Causes: Invalid path, file locked by another process, insufficient permissions
Solutions:
- Verify file path is correct and file exists
- Close the file if open in Excel or another application
- Check read permissions on the file

Corrupted File Errors

Symptom: BadZipFile or InvalidFileException
Causes: Incomplete download, file corruption, wrong file extension
Solutions:
- Re-download or obtain fresh copy of the file
- Verify file is actual Excel format (not CSV with .xlsx extension)
- Try opening in Excel to confirm file integrity

Memory Issues with Large Files

Symptom: MemoryError or system slowdown
Causes: File too large for available RAM
Solutions:
- Use read_only=True mode in openpyxl
- Process file in chunks using Pandas chunksize parameter
- Increase system memory or use machine with more RAM

Encoding Problems

Symptom: Garbled text or UnicodeDecodeError
Causes: Non-UTF8 encoding in source data
Solutions:
- Specify encoding when reading CSV: pd.read_csv(..., encoding='gbk')
- For Excel, data is usually UTF-8; check source data generation

HTML Mode Token Overflow

Symptom: Truncated output or API errors
Causes: Complex file exceeds token limits despite row count check
Solutions:
- Force Pandas mode even for complex files
- Split sheet into smaller ranges and process separately
- Extract only essential columns before HTML conversion

Incorrect Header Detection

Symptom: Wrong columns or data shifted
Causes: Unusual header patterns not caught by sampling
Solutions:
- Manually specify header row if known
- Increase sample size beyond 20 rows
- Use HTML mode for better structure understanding

Dependencies

Required Python packages:

openpyxl - Metadata scanning and Excel file manipulation
pandas - High-speed data reading and manipulation

Resources

This skill includes:

scripts/complexity_analyzer.py - Standalone executable for complexity analysis
references/smart_excel_router.py - Complete implementation reference with both processing paths

About

SKILL.md

About

Smart Excel/CSV file parsing with intelligent routing based on file complexity analysis...

SKILL.md

Excel Parser

Overview
Core Philosophy: Scout Pattern
When to Use This Skill
Processing Workflow
Complexity Scoring Rules
Path A: Pandas Standard Mode
Path B: HTML Semantic Mode
Best Practices
Troubleshooting
Dependencies
Resources

Overview

Core Philosophy: Scout Pattern

Before processing data, deploy a lightweight "scout" to analyze file metadata and make intelligent routing decisions:

Metadata Scanning - Use openpyxl to scan file structure without loading data
Complexity Scoring - Calculate score based on merged cells, row count, and layout
Path Selection - Choose between Pandas (fast) or HTML (accurate) processing
Optimized Execution - Execute with the most appropriate tool for the file type

Key Principle: "LLM handles metadata decisions, Pandas/HTML processes bulk data"

When to Use This Skill

Use excel-parser when:

Processing Excel/CSV files with unknown structure or varying complexity
Handling files ranging from simple data tables to complex financial reports
Need to optimize between processing speed and extraction accuracy
Working with files that may contain merged cells, multi-level headers, or irregular layouts

Skip this skill when:

File structure is already known and documented
Processing simple, well-structured tables with confirmed format
Using predefined scripts for specific file formats

Processing Workflow

Step 1: Analyze File Complexity

Use the scripts/complexity_analyzer.py to scan file metadata:

python scripts/complexity_analyzer.py <file_path> [sheet_name]

What it analyzes (without loading data):

Merged cell distribution (shallow vs deep in the table)
Row count and data continuity
Empty row interruptions (indicates multi-table layouts)

Output (JSON format):

{
  "is_complex": false,
  "recommended_strategy": "pandas",
  "reasons": ["No deep merges detected", "Row count exceeds 1000, forcing Pandas mode"],
  "stats": {
    "total_rows": 5000,
    "deep_merges": 0,
    "empty_interruptions": 0
  }
}

Step 2: Route to Optimal Strategy

Based on complexity analysis:

is_complex = false → Use Path A (Pandas Standard Mode)
is_complex = true → Use Path B (HTML Semantic Mode)

Step 3: Execute Processing

Follow the selected path's workflow to extract data.

Complexity Scoring Rules

Rule 1: Deep Merged Cells

Condition: Merged cells appearing beyond row 5
Interpretation: Complex table structure (not just header formatting)
Decision: Mark as complex if >2 deep merges detected
Example: Financial reports with merged category labels in data region

Rule 2: Empty Row Interruptions

Condition: Multiple empty rows within the table
Interpretation: Multiple sub-tables in single sheet
Decision: Mark as complex if >2 empty row interruptions found
Example: Summary table + detail table in one sheet

Rule 3: Row Count Override

Condition: Total rows >1000
Interpretation: Too large for HTML processing (token explosion)
Decision: Force Pandas mode regardless of complexity
Rationale: HTML conversion would exceed token limits

Rule 4: Default (Standard Table)

Condition: No deep merges, continuous data, moderate size
Interpretation: Standard data table
Decision: Use Pandas for optimal speed

Path A: Pandas Standard Mode

When: Simple/large tables (most common case)

Strategy: Agent analyzes ONLY the first 20 rows to determine header position, then use Pandas to read full data at native speed.

Workflow:

Sample First 20 Rows
- Read only the first 20 rows using pd.read_excel(..., nrows=20)
- Convert to CSV format for analysis
Determine Header Position
- Examine the sampled rows to identify which row contains the actual column headers
- Common patterns: Row 0 (standard), Row 1-2 (if title rows exist), Row with distinct column names
Read Full Data
- Use pd.read_excel(..., header=<detected_row>) to load complete data
- The header parameter ensures proper column naming

Token Cost: ~500 tokens (only 20 rows analyzed) Processing Speed: Very fast (Pandas native speed)

For implementation details, see references/smart_excel_router.py

Path B: HTML Semantic Mode

When: Complex/irregular tables (merged cells, multi-level headers)

Strategy: Convert to semantic HTML preserving structure (rowspan/colspan), then extract data understanding the visual layout.

Workflow:

Convert to Semantic HTML
- Load workbook with openpyxl
- Build HTML table preserving merged cell spans
- Use rowspan and colspan attributes to maintain structure
Extract Structured Data
- Analyze HTML table structure
- Identify hierarchical headers from merged cells
- Extract data preserving semantic relationships

Token Cost: Higher (full HTML structure analyzed) Processing Speed: Slower (semantic extraction) Use Case: Only for small (<1000 rows), complex files where Pandas would fail

For implementation details, see references/smart_excel_router.py

Best Practices

1. Trust the Scout

Always run complexity analysis before processing. The metadata scan is fast (<1 second) and prevents wasted effort on wrong approach.

2. Respect the Row Count Rule

Never attempt HTML mode on files >1000 rows. Token limits will cause failures.

3. Pandas First for Unknown Files

When in doubt, try Pandas mode first. It fails fast and clearly when structure is incompatible.

4. Cache Analysis Results

If processing multiple sheets from same file, run analysis once and cache results.

5. Preserve Original Files

Never modify the original Excel file during analysis or processing.

Troubleshooting

File Cannot Be Opened

Symptom: FileNotFoundError or permission errors
Causes: Invalid path, file locked by another process, insufficient permissions
Solutions:
- Verify file path is correct and file exists
- Close the file if open in Excel or another application
- Check read permissions on the file

Corrupted File Errors

Symptom: BadZipFile or InvalidFileException
Causes: Incomplete download, file corruption, wrong file extension
Solutions:
- Re-download or obtain fresh copy of the file
- Verify file is actual Excel format (not CSV with .xlsx extension)
- Try opening in Excel to confirm file integrity

Memory Issues with Large Files

Symptom: MemoryError or system slowdown
Causes: File too large for available RAM
Solutions:
- Use read_only=True mode in openpyxl
- Process file in chunks using Pandas chunksize parameter
- Increase system memory or use machine with more RAM

Encoding Problems

Symptom: Garbled text or UnicodeDecodeError
Causes: Non-UTF8 encoding in source data
Solutions:
- Specify encoding when reading CSV: pd.read_csv(..., encoding='gbk')
- For Excel, data is usually UTF-8; check source data generation

HTML Mode Token Overflow

Symptom: Truncated output or API errors
Causes: Complex file exceeds token limits despite row count check
Solutions:
- Force Pandas mode even for complex files
- Split sheet into smaller ranges and process separately
- Extract only essential columns before HTML conversion

Incorrect Header Detection

Symptom: Wrong columns or data shifted
Causes: Unusual header patterns not caught by sampling
Solutions:
- Manually specify header row if known
- Increase sample size beyond 20 rows
- Use HTML mode for better structure understanding

Dependencies

Required Python packages:

openpyxl - Metadata scanning and Excel file manipulation
pandas - High-speed data reading and manipulation

Resources

This skill includes:

scripts/complexity_analyzer.py - Standalone executable for complexity analysis
references/smart_excel_router.py - Complete implementation reference with both processing paths

excel-parser

About

SKILL.md

excel-parser

About

SKILL.md

Excel Parser

Table of Contents

Overview

Core Philosophy: Scout Pattern

When to Use This Skill

Processing Workflow

Step 1: Analyze File Complexity

Step 2: Route to Optimal Strategy

Step 3: Execute Processing

Complexity Scoring Rules

Rule 1: Deep Merged Cells

Rule 2: Empty Row Interruptions

Rule 3: Row Count Override

Rule 4: Default (Standard Table)

Path A: Pandas Standard Mode

Path B: HTML Semantic Mode

Best Practices

1. Trust the Scout

2. Respect the Row Count Rule

3. Pandas First for Unknown Files

4. Cache Analysis Results

5. Preserve Original Files

Troubleshooting

File Cannot Be Opened

Corrupted File Errors

Memory Issues with Large Files

Encoding Problems

HTML Mode Token Overflow

Incorrect Header Detection

Dependencies

Resources

About

SKILL.md

About

SKILL.md

Excel Parser

Table of Contents

Overview

Core Philosophy: Scout Pattern

When to Use This Skill

Processing Workflow

Step 1: Analyze File Complexity

Step 2: Route to Optimal Strategy

Step 3: Execute Processing

Complexity Scoring Rules

Rule 1: Deep Merged Cells

Rule 2: Empty Row Interruptions

Rule 3: Row Count Override

Rule 4: Default (Standard Table)

Path A: Pandas Standard Mode

Path B: HTML Semantic Mode

Best Practices

1. Trust the Scout

2. Respect the Row Count Rule

3. Pandas First for Unknown Files

4. Cache Analysis Results

5. Preserve Original Files

Troubleshooting

File Cannot Be Opened

Corrupted File Errors

Memory Issues with Large Files

Encoding Problems

HTML Mode Token Overflow

Incorrect Header Detection

Dependencies

Resources