Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    jeremylongshore

    databricks-observability

    jeremylongshore/databricks-observability
    Data & Analytics
    1,221

    About

    SKILL.md

    Install

    • Telegram
      Telegram
    • Slack
      Slack
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Set up comprehensive observability for Databricks with metrics, traces, and alerts. Use when implementing monitoring for Databricks jobs, setting up dashboards, or configuring alerting for pipeline...

    SKILL.md

    Databricks Observability

    Overview

    Monitor Databricks jobs, clusters, SQL warehouses, and costs using system tables in the system catalog. System tables provide queryable observability data: system.lakeflow (job runs), system.billing (costs), system.query (SQL history), system.access (audit logs), and system.compute (cluster metrics). Data updates throughout the day, not real-time.

    Prerequisites

    • Databricks Premium or Enterprise with Unity Catalog enabled
    • Access to system.billing, system.lakeflow, system.query, and system.access schemas
    • SQL warehouse for running monitoring queries

    Instructions

    Step 1: Job Health Monitoring

    -- Job success/failure over last 24 hours
    SELECT
        COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS succeeded,
        COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed,
        COUNT(CASE WHEN result_state = 'TIMED_OUT' THEN 1 END) AS timed_out,
        ROUND(100.0 * COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) / COUNT(*), 1) AS success_rate_pct,
        ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_duration_min
    FROM system.lakeflow.job_run_timeline
    WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;
    
    -- Failed jobs with error details
    SELECT job_id, run_name, result_state, start_time, end_time,
           TIMESTAMPDIFF(MINUTE, start_time, end_time) AS duration_min,
           error_message
    FROM system.lakeflow.job_run_timeline
    WHERE result_state = 'FAILED'
      AND start_time > current_timestamp() - INTERVAL 24 HOURS
    ORDER BY start_time DESC;
    

    Step 2: Cluster Utilization and Costs

    -- DBU consumption by cluster (last 7 days)
    SELECT usage_metadata.cluster_id,
           COALESCE(usage_metadata.cluster_name, 'unnamed') AS cluster_name,
           sku_name,
           SUM(usage_quantity) AS total_dbus,
           ROUND(SUM(usage_quantity * p.pricing.default), 2) AS cost_usd
    FROM system.billing.usage u
    LEFT JOIN system.billing.list_prices p ON u.sku_name = p.sku_name
    WHERE u.usage_date >= current_date() - INTERVAL 7 DAYS
    GROUP BY usage_metadata.cluster_id, cluster_name, u.sku_name
    ORDER BY cost_usd DESC
    LIMIT 20;
    

    Step 3: SQL Warehouse Performance

    -- Slow queries (>30s) on SQL warehouses
    SELECT warehouse_id, statement_id, executed_by,
           ROUND(total_duration_ms / 1000, 1) AS duration_sec,
           rows_produced,
           ROUND(bytes_scanned / 1048576, 1) AS scanned_mb,
           LEFT(statement_text, 200) AS query_preview
    FROM system.query.history
    WHERE total_duration_ms > 30000
      AND start_time > current_timestamp() - INTERVAL 24 HOURS
    ORDER BY total_duration_ms DESC
    LIMIT 50;
    
    -- Warehouse queue times (right-sizing indicator)
    SELECT warehouse_id, warehouse_name,
           COUNT(*) AS query_count,
           ROUND(AVG(total_duration_ms) / 1000, 1) AS avg_sec,
           ROUND(MAX(queue_duration_ms) / 1000, 1) AS max_queue_sec
    FROM system.query.history
    WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
    GROUP BY warehouse_id, warehouse_name;
    

    Step 4: Cost-per-Job Analysis

    SELECT j.name AS job_name,
           COUNT(DISTINCT r.run_id) AS run_count,
           ROUND(AVG(TIMESTAMPDIFF(MINUTE, r.start_time, r.end_time)), 1) AS avg_min,
           ROUND(SUM(b.usage_quantity), 1) AS total_dbus,
           ROUND(SUM(b.usage_quantity * p.pricing.default), 2) AS total_cost_usd
    FROM system.lakeflow.job_run_timeline r
    JOIN system.lakeflow.jobs j ON r.job_id = j.job_id
    LEFT JOIN system.billing.usage b
        ON r.run_id = b.usage_metadata.job_run_id
    LEFT JOIN system.billing.list_prices p ON b.sku_name = p.sku_name
    WHERE r.start_time > current_timestamp() - INTERVAL 7 DAYS
    GROUP BY j.name
    ORDER BY total_cost_usd DESC
    LIMIT 15;
    

    Step 5: SQL Alerts for Automated Notifications

    -- Create as SQL Alert: trigger when failure_count > 3
    -- Schedule: every 15 minutes
    -- Notification destination: Slack/email
    
    SELECT COUNT(*) AS failure_count
    FROM system.lakeflow.job_run_timeline
    WHERE result_state = 'FAILED'
      AND start_time > current_timestamp() - INTERVAL 1 HOUR;
    
    from databricks.sdk import WorkspaceClient
    
    w = WorkspaceClient()
    
    # Create SQL alert programmatically
    alert = w.alerts.create(
        name="Hourly Job Failure Alert",
        query_id="<saved-query-id>",
        options={"column": "failure_count", "op": ">", "value": "3"},
        rearm=900,  # re-alert after 15 min if still triggered
    )
    

    Step 6: Export Metrics to External Systems

    from databricks.sdk import WorkspaceClient
    
    w = WorkspaceClient()
    
    # Export cluster state metrics for Prometheus/Datadog
    for cluster in w.clusters.list():
        if cluster.state.value == "RUNNING":
            print(f"databricks_cluster_workers{{name=\"{cluster.cluster_name}\"}} "
                  f"{cluster.num_workers}")
            print(f"databricks_cluster_running{{name=\"{cluster.cluster_name}\"}} 1")
    
    # Export job success rate for Grafana
    runs = list(w.jobs.list_runs(limit=100, completed_only=True))
    success = sum(1 for r in runs if r.state.result_state and r.state.result_state.value == "SUCCESS")
    print(f"databricks_job_success_rate {success / len(runs):.2f}")
    

    Step 7: Audit Log Monitoring

    -- Security: who accessed what in the last 7 days
    SELECT event_time, user_identity.email, action_name,
           request_params, response.status_code
    FROM system.access.audit
    WHERE service_name IN ('unityCatalog', 'jobs', 'clusters')
      AND event_date >= current_date() - 7
      AND action_name NOT IN ('getStatus', 'list')  -- exclude noisy reads
    ORDER BY event_time DESC
    LIMIT 100;
    

    Output

    • Job health dashboard (success rate, duration, failures)
    • Cluster cost breakdown by team and SKU
    • SQL warehouse performance report (slow queries, queue times)
    • Per-job cost analysis
    • Automated SQL alerts with Slack/email notifications
    • External metric export for Prometheus/Grafana

    Error Handling

    Issue Cause Solution
    System tables empty Unity Catalog not enabled Enable in Account Console > Settings
    TABLE_OR_VIEW_NOT_FOUND Schema not accessible Request admin to grant SELECT ON system.billing
    Billing data delayed System table refresh lag (up to 24h) Use for trends and alerts, not real-time
    Query history missing Serverless queries not tracked Use classic SQL warehouse or check retention

    Examples

    Daily Standup Dashboard

    -- Single query for daily pipeline health
    SELECT
        'Last 24h' AS period,
        COUNT(*) AS total_runs,
        COUNT(CASE WHEN result_state = 'SUCCESS' THEN 1 END) AS ok,
        COUNT(CASE WHEN result_state = 'FAILED' THEN 1 END) AS failed,
        ROUND(AVG(TIMESTAMPDIFF(MINUTE, start_time, end_time)), 1) AS avg_min
    FROM system.lakeflow.job_run_timeline
    WHERE start_time > current_timestamp() - INTERVAL 24 HOURS;
    

    Resources

    • System Tables
    • Audit Logs
    • Observability Best Practices
    • SQL Alerts
    Recommended Servers
    Thoughtbox
    Thoughtbox
    Browserbase
    Browserbase
    ThinAir Geo
    ThinAir Geo
    Repository
    jeremylongshore/claude-code-plugins-plus-skills
    Files