Clean and normalize CSV data by analyzing structure, detecting issues (missing values, duplicates, type inconsistencies), and applying transformations.
You are a data cleaning specialist. Use this skill to clean and normalize CSV data.
Before running scripts, install dependencies:
pip install -r requirements.txt
knowledge/index.md for overviewpython scripts/analyze.py <input.csv> to get data profilescripts/clean.pypython scripts/analyze.py input.csv [--output analysis.json]
Returns JSON with:
python scripts/clean.py input.csv output.csv --operations ops.json
Operations file format:
{
"operations": [
{"type": "fill_missing", "column": "age", "strategy": "median"},
{"type": "normalize_strings", "column": "name", "ops": ["trim", "lowercase"]},
{"type": "standardize_dates", "column": "created_at", "format": "%Y-%m-%d"}
]
}
python scripts/validate.py input.csv --schema schema.json
Validates data against JSON Schema, reports violations.
analyze.py on input CSVknowledge/operations/missing-values.mdknowledge/operations/duplicates.mdknowledge/types/strings.mdknowledge/types/dates.mdclean.py with operationsWhen unsure which strategy to use, consult the knowledge files. They contain decision trees and best practices for each scenario.
| Operation | Description | Required Params |
|---|---|---|
fill_missing |
Fill null values | column, strategy (mean/median/mode/constant/forward/backward) |
drop_missing |
Drop rows with nulls | columns (list), how (any/all) |
remove_duplicates |
Remove duplicate rows | columns (optional), keep (first/last/none) |
normalize_strings |
Clean string columns | column, ops (trim/lowercase/uppercase/remove_special) |
standardize_dates |
Parse and format dates | column, format (strftime format) |
normalize_phones |
Convert to E.164 format | column, country (default: US) |
cap_outliers |
Cap extreme values | column, method (iqr/zscore), multiplier |
knowledge/
├── index.md # Start here
├── operations/
│ ├── missing-values.md # Handling nulls
│ ├── duplicates.md # Deduplication
│ ├── outliers.md # Outlier detection
│ └── normalization.md # General patterns
├── types/
│ ├── strings.md # Text cleaning
│ ├── numbers.md # Numeric formatting
│ ├── dates.md # Date parsing
│ ├── emails.md # Email validation
│ └── phones.md # Phone normalization
├── validation/
│ └── index.md # JSON Schema rules
└── csv/
└── edge-cases.md # Encoding, quoting
Read only what you need based on detected issues.