Bug Investigation Skill
When to Use
Use this skill when:
- Debugging extraction failures
- Investigating classification errors
- Analyzing search performance issues
- Troubleshooting database problems
Debugging Workflow
1. Reproduce the Issue
- Identify failing PDF or operation
- Reproduce in isolation
- Collect error messages
2. Check Extraction Metrics
Use metrics.py to check:
- Extraction success rate
- Methods used (pdfplumber vs OCR)
- Error patterns
3. Review Logs
- Error messages in console
- Database error logs
- Processing statistics
Common Bug Patterns
PDF Extraction Failures
Symptoms:
- "Sem texto extraível"
- Empty content in database
- OCR not triggered when needed
Investigation:
- Check if PDF is scanned (images)
- Verify OCR is installed and working
- Test extraction manually
- Check file permissions
Classification Errors
Symptoms:
- Documents classified as "outros"
- Incorrect contract number extraction
- Missing document numbers
Investigation:
- Check filename pattern
- Test regex patterns
- Verify classification logic
- Review expected vs actual output
Database Issues
Symptoms:
- Duplicate key errors
- FTS5 index not updating
- Missing data in results
Investigation:
- Check filepath uniqueness
- Verify triggers are working
- Test queries directly
- Check database schema
Search Problems
Symptoms:
- No results found
- Incorrect results
- Performance issues
Investigation:
- Verify FTS5 index exists
- Test query syntax
- Check content was indexed
- Review filter logic
Logging and Error Handling
Error Logging Pattern
try:
text = extract_text_from_pdf(full_path)
except Exception as e:
print(f" ❌ Erro ao processar {file}: {e}")
# Log error with context
errors += 1
Debugging Checklist
Test Verification Steps
For Extraction Bugs
- Test with sample PDF
- Verify extraction method used
- Check text length
- Validate OCR if used
For Classification Bugs
- Test classification function directly
- Verify regex matches
- Check fallback logic
- Compare with expected result
Bug Fix Examples
Fix: OCR Not Triggering
Root Cause: OCR check happens after text validation
Fix: Move OCR check before validation failure
Fix: Classification Fails
Root Cause: Regex doesn't match all patterns
Fix: Improve regex or add alternative patterns