Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    neversight

    data-quality

    neversight/data-quality
    Data & Analytics
    2
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Diagnose and fix data quality problems in datasets...

    SKILL.md

    Data Quality Skill

    Systematic approach to diagnosing and fixing data quality problems.

    Data Quality Process

    Define & Identify → Detect & Quantify → Clean & Rectify → Measure & Verify
    
    1. Define: Understand data context, business rules, quality requirements
    2. Detect: Profile data, find glitches (missing, duplicates, outliers, violations)
    3. Clean: Apply appropriate repair strategies
    4. Measure: Validate repairs, quantify improvement

    Quick Reference

    Problem Script Key Function
    Data overview data_profiling.py profile_dataframe(df)
    Find quality issues data_profiling.py detect_glitches(df)
    Missing values missing_data.py analyze_missing(df)
    Imputation missing_data.py impute_mean/median/regression()
    Duplicates duplicate_detection.py find_duplicates(df, cols)
    Deduplication duplicate_detection.py deduplicate(df, cols)
    Outliers anomaly_detection.py detect_anomalies(df)
    Constraint check constraint_checking.py validate_constraints(df, rules)
    String matching similarity_metrics.py jaro_winkler_similarity()

    Workflow

    Step 1: Profile the Data

    from scripts.data_profiling import profile_dataframe, detect_glitches, generate_quality_report
    
    # Quick overview
    print(generate_quality_report(df))
    
    # Detailed profile
    profile = profile_dataframe(df)
    
    # Find issues
    glitches = detect_glitches(df)
    

    Step 2: Analyze Specific Issues

    Missing Data:

    from scripts.missing_data import analyze_missing, test_mcar
    
    analysis = analyze_missing(df)
    # Check if safe to delete rows
    mcar_test = test_mcar(df, 'column_with_missing', ['other_cols'])
    

    Duplicates:

    from scripts.duplicate_detection import find_duplicates, cluster_duplicates
    
    matches = find_duplicates(df, ['name', 'email'], threshold=0.85)
    clusters = cluster_duplicates(matches)
    

    Outliers:

    from scripts.anomaly_detection import detect_anomalies, iqr_outliers
    
    # Multi-column summary
    anomalies = detect_anomalies(df, method='iqr')
    
    # Single column detail
    result = iqr_outliers(df, 'price', multiplier=1.5)
    

    Constraints:

    from scripts.constraint_checking import validate_constraints
    
    constraints = [
        {'type': 'unique', 'columns': ['id']},
        {'type': 'not_null', 'columns': ['name', 'email']},
        {'type': 'fd', 'determinant': ['id'], 'dependent': ['name']},
        {'type': 'domain', 'column': 'age', 'min_value': 0, 'max_value': 150},
    ]
    results = validate_constraints(df, constraints)
    

    Step 3: Clean the Data

    Handle Missing:

    from scripts.missing_data import impute_median, impute_regression, listwise_deletion
    
    # Simple: median for numeric
    df_clean = impute_median(df, 'age')
    
    # Better: regression-based
    df_clean = impute_regression(df, 'income', ['age', 'education'])
    
    # If MCAR confirmed
    df_clean = listwise_deletion(df)
    

    Remove Duplicates:

    from scripts.duplicate_detection import deduplicate
    
    df_clean, summary = deduplicate(
        df, 
        columns=['name', 'email', 'address'],
        threshold=0.8,
        merge_strategy='most_complete'
    )
    print(f"Reduced from {summary['original_rows']} to {summary['final_rows']} rows")
    

    Handle Outliers:

    # Cap extreme values
    q01, q99 = df['col'].quantile([0.01, 0.99])
    df['col'] = df['col'].clip(q01, q99)
    
    # Or remove
    df_clean = df[~detect_anomalies(df)['col']['outlier_indices']]
    

    Step 4: Validate

    Re-run profiling and constraint checks on cleaned data to verify improvements.

    References

    For deeper understanding:

    • references/dimensions.md: Data quality dimensions (accuracy, completeness, etc.)
    • references/glitch_taxonomy.md: Types of data glitches and detection approaches
    • references/repair_strategies.md: Detailed repair and cleaning strategies

    Key Concepts

    Data Quality = Fit for Use

    • Free of defects
    • Has features needed for the task
    • Right information, right place, right time

    Missing Data Mechanisms:

    • MCAR: Missing Completely At Random (safe to delete)
    • MAR: Missing At Random (imputation may work)
    • MNAR: Missing Not At Random (most problematic)

    Constraints:

    • Functional Dependency: X → Y means X uniquely determines Y
    • Referential Integrity: foreign keys reference valid primary keys
    • Domain Constraints: values within allowed set/range

    Entity Resolution:

    • Blocking reduces O(n²) to O(n·window)
    • Similarity metrics: Jaro-Winkler (names), Levenshtein (typos), Jaccard (sets)
    • Cluster by transitive closure, merge by strategy

    Similarity Metrics Comparison

    Metric Best For Example
    Jaro-Winkler Names, short strings "Robert" vs "Rupert"
    Levenshtein Typos, edit distance "recieve" vs "receive"
    Jaccard Token/word comparison "John Doe" vs "Doe, John"
    Q-gram Fuzzy substring matching Partial matches
    Recommended Servers
    Google BigQuery
    Google BigQuery
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    ThinAir Data
    ThinAir Data
    Repository
    neversight/skills_feed