csv-data-auditor

aig787/csv-data-auditor

Data & Analytics

About

SKILL.md

csv-data-auditor

aig787/csv-data-auditor

Data & Analytics

About

Validate and audit CSV data for quality, consistency, and completeness. Use when you need to check CSV files for data issues, missing values, or format inconsistencies.

SKILL.md

CSV Data Auditor

Instructions

When auditing CSV data, perform these validation checks systematically:

1. File Structure Validation

Verify the file exists and is readable
Check if the file is a valid CSV format
Ensure consistent delimiter usage
Verify proper quoting and escaping
Check for balanced quotes

2. Header Analysis

Confirm headers exist (unless headerless CSV)
Check for duplicate header names
Validate header naming conventions
Ensure headers are descriptive and consistent
Check for special characters in headers

3. Row Consistency

Count total rows and columns
Verify all rows have the same number of columns
Check for empty rows or rows with only whitespace
Identify truncated rows
Detect corrupted or malformed rows

4. Data Type Validation

For each column, validate the expected data type:

Numeric Fields

Check for non-numeric values
Validate decimal point usage
Look for negative values where inappropriate
Check for extremely large/small values

Date/Time Fields

Verify date format consistency (ISO 8601, US, EU, etc.)
Check for invalid dates (e.g., February 30th)
Validate time zone handling
Look for future dates where inappropriate

Text Fields

Check for encoding issues
Look for unexpected special characters
Verify string length constraints
Check for leading/trailing whitespace

Boolean Fields

Validate true/false representations
Check for 1/0, Y/N, Yes/No consistency
Look for ambiguous values

5. Data Quality Checks

Missing Values

Count NULL/empty values per column
Calculate missing value percentages
Identify patterns in missing data
Check for placeholders like "N/A", "null", "-"

Duplicates

Find exact duplicate rows
Check for duplicate IDs or keys
Identify near-duplicates (typos, variations)
Validate uniqueness constraints

Outliers

Identify statistical outliers in numeric columns
Check for values outside expected ranges
Look for anomalies in categorical data
Validate business logic constraints

Consistency Checks

Cross-validate related columns
Check referential integrity
Validate business rules
Look for logical contradictions

6. Validation Script (Python)

Create a Python script to perform automated checks:

import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report

7. Report Format

Generate a structured audit report:

# CSV Audit Report

## File Information
- File: data.csv
- Size: 15.2 MB
- Rows: 50,000
- Columns: 12
- Headers: id, name, email, age, join_date, salary, department, ...

## Issues Found

### Critical Issues
1. **Missing Values**:
   - `email`: 2,500 missing (5.0%)
   - `salary`: 150 missing (0.3%)

### Warnings
1. **Inconsistent Date Format**:
   - `join_date`: Mix of ISO and US formats detected
   - Examples: 2023-01-15, 01/15/2023, 15-Jan-2023

2. **Potential Outliers**:
   - `age`: Values 0 and 150 detected
   - `salary`: Extremely high values > $1M

### Recommendations
1. Clean up email field - contact data source
2. Standardize date format to ISO 8601
3. Validate age and salary ranges
4. Remove or investigate duplicate rows

## Summary
- Overall Quality: Good
- Issues to Fix: 3 critical, 5 warnings
- Estimated Fix Time: 2-3 hours

8. Common Issues and Solutions

Encoding Problems

Try different encodings (UTF-8, Latin-1, Windows-1252)
Use chardet library to detect encoding
Handle BOM (Byte Order Mark) if present

Large Files

Process in chunks for memory efficiency
Use Dask for out-of-core processing
Sample data for quick validation

Complex CSVs

Handle quoted fields with embedded delimiters
Process multi-line records carefully
Validate escape character usage

Usage

Provide the CSV file path
Specify delimiter if not comma
Configure validation rules based on your data
Review the generated audit report
Address issues systematically

Tips

Always keep a backup of original data
Document any data transformations
Create validation rules based on business requirements
Consider using schema validation tools like Great Expectations
Automate regular audits for production data

About

SKILL.md

About

Validate and audit CSV data for quality, consistency, and completeness. Use when you need to check CSV files for data issues, missing values, or format inconsistencies.

SKILL.md

CSV Data Auditor

Instructions

When auditing CSV data, perform these validation checks systematically:

1. File Structure Validation

Verify the file exists and is readable
Check if the file is a valid CSV format
Ensure consistent delimiter usage
Verify proper quoting and escaping
Check for balanced quotes

2. Header Analysis

Confirm headers exist (unless headerless CSV)
Check for duplicate header names
Validate header naming conventions
Ensure headers are descriptive and consistent
Check for special characters in headers

3. Row Consistency

Count total rows and columns
Verify all rows have the same number of columns
Check for empty rows or rows with only whitespace
Identify truncated rows
Detect corrupted or malformed rows

4. Data Type Validation

For each column, validate the expected data type:

Numeric Fields

Check for non-numeric values
Validate decimal point usage
Look for negative values where inappropriate
Check for extremely large/small values

Date/Time Fields

Verify date format consistency (ISO 8601, US, EU, etc.)
Check for invalid dates (e.g., February 30th)
Validate time zone handling
Look for future dates where inappropriate

Text Fields

Check for encoding issues
Look for unexpected special characters
Verify string length constraints
Check for leading/trailing whitespace

Boolean Fields

Validate true/false representations
Check for 1/0, Y/N, Yes/No consistency
Look for ambiguous values

5. Data Quality Checks

Missing Values

Count NULL/empty values per column
Calculate missing value percentages
Identify patterns in missing data
Check for placeholders like "N/A", "null", "-"

Duplicates

Find exact duplicate rows
Check for duplicate IDs or keys
Identify near-duplicates (typos, variations)
Validate uniqueness constraints

Outliers

Identify statistical outliers in numeric columns
Check for values outside expected ranges
Look for anomalies in categorical data
Validate business logic constraints

Consistency Checks

Cross-validate related columns
Check referential integrity
Validate business rules
Look for logical contradictions

6. Validation Script (Python)

Create a Python script to perform automated checks:

import pandas as pd
import numpy as np
from datetime import datetime

def audit_csv(file_path, delimiter=','):
    """Perform comprehensive CSV audit"""

    # Load CSV
    try:
        df = pd.read_csv(file_path, delimiter=delimiter)
    except Exception as e:
        return {"error": f"Failed to load CSV: {str(e)}"}

    audit_report = {
        "file_info": {
            "rows": len(df),
            "columns": len(df.columns),
            "headers": list(df.columns)
        },
        "issues": []
    }

    # Check for missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        if count > 0:
            percentage = (count / len(df)) * 100
            audit_report["issues"].append({
                "type": "missing_values",
                "column": col,
                "count": count,
                "percentage": round(percentage, 2)
            })

    # Check for duplicates
    duplicate_rows = df.duplicated().sum()
    if duplicate_rows > 0:
        audit_report["issues"].append({
            "type": "duplicates",
            "count": duplicate_rows
        })

    # Data type validation
    for col in df.columns:
        # Check for mixed types in object columns
        if df[col].dtype == 'object':
            sample_values = df[col].dropna().head(10)
            # Add specific type checks based on column name patterns

    return audit_report

7. Report Format

Generate a structured audit report:

# CSV Audit Report

## File Information
- File: data.csv
- Size: 15.2 MB
- Rows: 50,000
- Columns: 12
- Headers: id, name, email, age, join_date, salary, department, ...

## Issues Found

### Critical Issues
1. **Missing Values**:
   - `email`: 2,500 missing (5.0%)
   - `salary`: 150 missing (0.3%)

### Warnings
1. **Inconsistent Date Format**:
   - `join_date`: Mix of ISO and US formats detected
   - Examples: 2023-01-15, 01/15/2023, 15-Jan-2023

2. **Potential Outliers**:
   - `age`: Values 0 and 150 detected
   - `salary`: Extremely high values > $1M

### Recommendations
1. Clean up email field - contact data source
2. Standardize date format to ISO 8601
3. Validate age and salary ranges
4. Remove or investigate duplicate rows

## Summary
- Overall Quality: Good
- Issues to Fix: 3 critical, 5 warnings
- Estimated Fix Time: 2-3 hours

8. Common Issues and Solutions

Encoding Problems

Try different encodings (UTF-8, Latin-1, Windows-1252)
Use chardet library to detect encoding
Handle BOM (Byte Order Mark) if present

Large Files

Process in chunks for memory efficiency
Use Dask for out-of-core processing
Sample data for quick validation

Complex CSVs

Handle quoted fields with embedded delimiters
Process multi-line records carefully
Validate escape character usage

Usage

Provide the CSV file path
Specify delimiter if not comma
Configure validation rules based on your data
Review the generated audit report
Address issues systematically

Tips

Always keep a backup of original data
Document any data transformations
Create validation rules based on business requirements
Consider using schema validation tools like Great Expectations
Automate regular audits for production data