Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    sickn33

    data-engineering-data-pipeline

    sickn33/data-engineering-data-pipeline
    Data & Analytics
    8,021
    1 installs

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

    SKILL.md

    Data Pipeline Architecture

    You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

    Use this skill when

    • Working on data pipeline architecture tasks or workflows
    • Needing guidance, best practices, or checklists for data pipeline architecture

    Do not use this skill when

    • The task is unrelated to data pipeline architecture
    • You need a different domain or tool outside this scope

    Requirements

    $ARGUMENTS

    Core Capabilities

    • Design ETL/ELT, Lambda, Kappa, and Lakehouse architectures
    • Implement batch and streaming data ingestion
    • Build workflow orchestration with Airflow/Prefect
    • Transform data using dbt and Spark
    • Manage Delta Lake/Iceberg storage with ACID transactions
    • Implement data quality frameworks (Great Expectations, dbt tests)
    • Monitor pipelines with CloudWatch/Prometheus/Grafana
    • Optimize costs through partitioning, lifecycle policies, and compute optimization

    Instructions

    1. Architecture Design

    • Assess: sources, volume, latency requirements, targets
    • Select pattern: ETL (transform before load), ELT (load then transform), Lambda (batch + speed layers), Kappa (stream-only), Lakehouse (unified)
    • Design flow: sources → ingestion → processing → storage → serving
    • Add observability touchpoints

    2. Ingestion Implementation

    Batch

    • Incremental loading with watermark columns
    • Retry logic with exponential backoff
    • Schema validation and dead letter queue for invalid records
    • Metadata tracking (_extracted_at, _source)

    Streaming

    • Kafka consumers with exactly-once semantics
    • Manual offset commits within transactions
    • Windowing for time-based aggregations
    • Error handling and replay capability

    3. Orchestration

    Airflow

    • Task groups for logical organization
    • XCom for inter-task communication
    • SLA monitoring and email alerts
    • Incremental execution with execution_date
    • Retry with exponential backoff

    Prefect

    • Task caching for idempotency
    • Parallel execution with .submit()
    • Artifacts for visibility
    • Automatic retries with configurable delays

    4. Transformation with dbt

    • Staging layer: incremental materialization, deduplication, late-arriving data handling
    • Marts layer: dimensional models, aggregations, business logic
    • Tests: unique, not_null, relationships, accepted_values, custom data quality tests
    • Sources: freshness checks, loaded_at_field tracking
    • Incremental strategy: merge or delete+insert

    5. Data Quality Framework

    Great Expectations

    • Table-level: row count, column count
    • Column-level: uniqueness, nullability, type validation, value sets, ranges
    • Checkpoints for validation execution
    • Data docs for documentation
    • Failure notifications

    dbt Tests

    • Schema tests in YAML
    • Custom data quality tests with dbt-expectations
    • Test results tracked in metadata

    6. Storage Strategy

    Delta Lake

    • ACID transactions with append/overwrite/merge modes
    • Upsert with predicate-based matching
    • Time travel for historical queries
    • Optimize: compact small files, Z-order clustering
    • Vacuum to remove old files

    Apache Iceberg

    • Partitioning and sort order optimization
    • MERGE INTO for upserts
    • Snapshot isolation and time travel
    • File compaction with binpack strategy
    • Snapshot expiration for cleanup

    7. Monitoring & Cost Optimization

    Monitoring

    • Track: records processed/failed, data size, execution time, success/failure rates
    • CloudWatch metrics and custom namespaces
    • SNS alerts for critical/warning/info events
    • Data freshness checks
    • Performance trend analysis

    Cost Optimization

    • Partitioning: date/entity-based, avoid over-partitioning (keep >1GB)
    • File sizes: 512MB-1GB for Parquet
    • Lifecycle policies: hot (Standard) → warm (IA) → cold (Glacier)
    • Compute: spot instances for batch, on-demand for streaming, serverless for adhoc
    • Query optimization: partition pruning, clustering, predicate pushdown

    Example: Minimal Batch Pipeline

    # Batch ingestion with validation
    from batch_ingestion import BatchDataIngester
    from storage.delta_lake_manager import DeltaLakeManager
    from data_quality.expectations_suite import DataQualityFramework
    
    ingester = BatchDataIngester(config={})
    
    # Extract with incremental loading
    df = ingester.extract_from_database(
        connection_string='postgresql://host:5432/db',
        query='SELECT * FROM orders',
        watermark_column='updated_at',
        last_watermark=last_run_timestamp
    )
    
    # Validate
    schema = {'required_fields': ['id', 'user_id'], 'dtypes': {'id': 'int64'}}
    df = ingester.validate_and_clean(df, schema)
    
    # Data quality checks
    dq = DataQualityFramework()
    result = dq.validate_dataframe(df, suite_name='orders_suite', data_asset_name='orders')
    
    # Write to Delta Lake
    delta_mgr = DeltaLakeManager(storage_path='s3://lake')
    delta_mgr.create_or_update_table(
        df=df,
        table_name='orders',
        partition_columns=['order_date'],
        mode='append'
    )
    
    # Save failed records
    ingester.save_dead_letter_queue('s3://lake/dlq/orders')
    

    Output Deliverables

    1. Architecture Documentation

    • Architecture diagram with data flow
    • Technology stack with justification
    • Scalability analysis and growth patterns
    • Failure modes and recovery strategies

    2. Implementation Code

    • Ingestion: batch/streaming with error handling
    • Transformation: dbt models (staging → marts) or Spark jobs
    • Orchestration: Airflow/Prefect DAGs with dependencies
    • Storage: Delta/Iceberg table management
    • Data quality: Great Expectations suites and dbt tests

    3. Configuration Files

    • Orchestration: DAG definitions, schedules, retry policies
    • dbt: models, sources, tests, project config
    • Infrastructure: Docker Compose, K8s manifests, Terraform
    • Environment: dev/staging/prod configs

    4. Monitoring & Observability

    • Metrics: execution time, records processed, quality scores
    • Alerts: failures, performance degradation, data freshness
    • Dashboards: Grafana/CloudWatch for pipeline health
    • Logging: structured logs with correlation IDs

    5. Operations Guide

    • Deployment procedures and rollback strategy
    • Troubleshooting guide for common issues
    • Scaling guide for increased volume
    • Cost optimization strategies and savings
    • Disaster recovery and backup procedures

    Success Criteria

    • Pipeline meets defined SLA (latency, throughput)
    • Data quality checks pass with >99% success rate
    • Automatic retry and alerting on failures
    • Comprehensive monitoring shows health and performance
    • Documentation enables team maintenance
    • Cost optimization reduces infrastructure costs by 30-50%
    • Schema evolution without downtime
    • End-to-end data lineage tracked

    Limitations

    • Use this skill only when the task clearly matches the scope described above.
    • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
    • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.
    Recommended Servers
    Parallel Tasks
    Parallel Tasks
    ThinAir Data
    ThinAir Data
    Repository
    sickn33/antigravity-awesome-skills
    Files