Validate and audit CSV data for quality, consistency, and completeness. Use when you need to check CSV files for data issues, missing values, or format inconsistencies.
When auditing CSV data, perform these validation checks systematically:
For each column, validate the expected data type:
Create a Python script to perform automated checks:
import pandas as pd
import numpy as np
from datetime import datetime
def audit_csv(file_path, delimiter=','):
"""Perform comprehensive CSV audit"""
# Load CSV
try:
df = pd.read_csv(file_path, delimiter=delimiter)
except Exception as e:
return {"error": f"Failed to load CSV: {str(e)}"}
audit_report = {
"file_info": {
"rows": len(df),
"columns": len(df.columns),
"headers": list(df.columns)
},
"issues": []
}
# Check for missing values
missing_counts = df.isnull().sum()
for col, count in missing_counts.items():
if count > 0:
percentage = (count / len(df)) * 100
audit_report["issues"].append({
"type": "missing_values",
"column": col,
"count": count,
"percentage": round(percentage, 2)
})
# Check for duplicates
duplicate_rows = df.duplicated().sum()
if duplicate_rows > 0:
audit_report["issues"].append({
"type": "duplicates",
"count": duplicate_rows
})
# Data type validation
for col in df.columns:
# Check for mixed types in object columns
if df[col].dtype == 'object':
sample_values = df[col].dropna().head(10)
# Add specific type checks based on column name patterns
return audit_report
Generate a structured audit report:
# CSV Audit Report
## File Information
- File: data.csv
- Size: 15.2 MB
- Rows: 50,000
- Columns: 12
- Headers: id, name, email, age, join_date, salary, department, ...
## Issues Found
### Critical Issues
1. **Missing Values**:
- `email`: 2,500 missing (5.0%)
- `salary`: 150 missing (0.3%)
### Warnings
1. **Inconsistent Date Format**:
- `join_date`: Mix of ISO and US formats detected
- Examples: 2023-01-15, 01/15/2023, 15-Jan-2023
2. **Potential Outliers**:
- `age`: Values 0 and 150 detected
- `salary`: Extremely high values > $1M
### Recommendations
1. Clean up email field - contact data source
2. Standardize date format to ISO 8601
3. Validate age and salary ranges
4. Remove or investigate duplicate rows
## Summary
- Overall Quality: Good
- Issues to Fix: 3 critical, 5 warnings
- Estimated Fix Time: 2-3 hours
chardet library to detect encoding