Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Accelerating the Agent Economy

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    ankurkumarz

    check-cluster-health

    ankurkumarz/check-cluster-health
    DevOps

    About

    SKILL.md

    Install

    Install via Skills CLI

    or add to your agent
    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    ├─
    ├─
    └─

    About

    Checks comprehensive health check for a Kubernetes Cluster.

    SKILL.md

    Check Cluster Health

    Perform comprehensive health check of Kubernetes cluster infrastructure.

    When to Use

    • Initial investigation of any production issue
    • Before deep-diving into specific pods or services
    • User reports "something is wrong" without specifics
    • Periodic health checks
    • Post-deployment validation
    • After scaling events or cluster changes

    Skill Objective

    Quickly assess the overall state of the Kubernetes cluster to identify:

    • Node health and resource pressure
    • Pod health across all namespaces
    • System component status
    • Recent critical events
    • Resource constraints or bottlenecks

    Investigation Steps

    Step 1: Check Node Health

    Get overview of all nodes in the cluster:

    kubectl get nodes -o wide
    

    Look for:

    • Nodes in NotReady state
    • Node ages (very old or very new nodes)
    • Kubernetes versions (version skew)
    • Internal/External IPs

    Expected Output:

    NAME      STATUS   ROLES           AGE   VERSION
    node-1    Ready    control-plane   45d   v1.28.0
    node-2    Ready    <none>          45d   v1.28.0
    node-3    Ready    <none>          45d   v1.28.0
    node-4    NotReady <none>          45d   v1.28.0  ⚠️
    

    Step 2: Check Node Resource Usage

    Get current CPU and memory utilization:

    kubectl top nodes
    

    Look for:

    • CPU usage > 80%
    • Memory usage > 85%
    • Significant imbalance between nodes

    Expected Output:

    NAME     CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
    node-1   450m        22%    4Gi             50%
    node-2   890m        44%    6Gi             75%
    node-3   1200m       60%    7Gi             87%  ⚠️
    node-4   100m        5%     2Gi             25%
    

    Step 3: Check Node Conditions

    Inspect for resource pressure conditions:

    kubectl describe nodes | grep -A 5 "Conditions:"
    

    Look for:

    • MemoryPressure: True
    • DiskPressure: True
    • PIDPressure: True
    • NetworkUnavailable: True

    Critical Conditions:

    Conditions:
      Type             Status  Reason
      ----             ------  ------
      MemoryPressure   True    NodeHasInsufficientMemory  ⚠️
      DiskPressure     False   NodeHasSufficientDisk
      PIDPressure      False   NodeHasSufficientPID
      Ready            True    KubeletReady
    

    Step 4: Find Problematic Pods

    Get all pods that are not in Running or Succeeded state:

    kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
    

    Alternative - get all pods with issues:

    kubectl get pods --all-namespaces | grep -vE 'Running|Completed|Succeeded'
    

    Look for:

    • CrashLoopBackOff
    • ImagePullBackOff
    • Pending
    • Error
    • Evicted
    • OOMKilled

    Expected Output:

    NAMESPACE   NAME                    READY   STATUS             RESTARTS   AGE
    api         api-service-abc         0/1     CrashLoopBackOff   5          10m  ⚠️
    api         api-service-xyz         0/1     OOMKilled          3          15m  ⚠️
    default     worker-123              0/1     Pending            0          5m   ⚠️
    monitoring  prometheus-456          0/2     ImagePullBackOff   0          20m  ⚠️
    

    Step 5: Check System Components

    Verify kube-system namespace health:

    kubectl get pods -n kube-system
    

    Critical components to check:

    • kube-apiserver
    • kube-controller-manager
    • kube-scheduler
    • etcd
    • coredns (or kube-dns)
    • kube-proxy

    Expected Output:

    NAME                              READY   STATUS    RESTARTS   AGE
    coredns-565d847f94-abcde          1/1     Running   0          45d
    coredns-565d847f94-fghij          1/1     Running   0          45d
    etcd-node-1                       1/1     Running   0          45d
    kube-apiserver-node-1             1/1     Running   0          45d
    kube-controller-manager-node-1    1/1     Running   0          45d
    kube-proxy-klmno                  1/1     Running   0          45d
    kube-scheduler-node-1             1/1     Running   0          45d
    

    Step 6: Review Recent Critical Events

    Get events from the last hour, filtered for warnings and errors:

    kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50 | grep -E 'Warning|Error'
    

    Alternative - more structured:

    kubectl get events --all-namespaces --sort-by='.lastTimestamp' --field-selector type!=Normal
    

    Look for patterns:

    • Repeated OOMKilled events
    • FailedScheduling (resource constraints)
    • FailedMount (volume issues)
    • ImagePullBackOff (registry issues)
    • Evictions (resource pressure)
    • BackOff (crashing containers)

    Expected Output:

    10m  Warning  FailedScheduling   pod/worker-123    0/4 nodes available: insufficient memory
    8m   Warning  BackOff           pod/api-service   Back-off restarting failed container
    5m   Warning  OOMKilled         pod/api-service   Container exceeded memory limit
    3m   Warning  Evicted           pod/cache-789     The node was low on resource: memory
    

    Step 7: Check for Evicted Pods

    Find pods that were evicted due to resource pressure:

    kubectl get pods --all-namespaces --field-selector=status.phase=Failed | grep Evicted
    

    Evictions indicate:

    • Node resource pressure (memory/disk)
    • Need for resource limits/requests tuning
    • Possible need for cluster scaling

    Step 8: Review Resource Allocation

    Check cluster-wide resource allocation:

    kubectl describe nodes | grep -A 7 "Allocated resources:"
    

    Look for:

    • CPU allocation > 80%
    • Memory allocation > 80%
    • Pods per node approaching limits

    Expected Output:

    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests      Limits
      --------           --------      ------
      cpu                3800m (95%)   7200m (180%)  ⚠️
      memory             24Gi (75%)    32Gi (100%)   ⚠️
      ephemeral-storage  0 (0%)        0 (0%)
      hugepages-2Mi      0 (0%)        0 (0%)
    

    MCP Tools to Use

    kubernetes.get_nodes()
    kubernetes.get_node_metrics()
    kubernetes.describe_node(node_name)
    kubernetes.get_pods(namespace="all", field_selector="status.phase!=Running")
    kubernetes.get_pods(namespace="kube-system")
    kubernetes.get_events(namespace="all", since="1h", field_selector="type!=Normal")
    

    Output Format

    Provide a structured summary in this format:

    # CLUSTER HEALTH SUMMARY
    ========================
    
    ## Cluster Overview
    - **Total Nodes:** 5
    - **Healthy Nodes:** 4
    - **Unhealthy Nodes:** 1
    - **Kubernetes Version:** v1.28.0
    
    ## Node Health
    
    ### Healthy Nodes ✓
    - node-1: Ready (CPU: 22%, Memory: 50%)
    - node-2: Ready (CPU: 44%, Memory: 75%)
    - node-3: Ready (CPU: 60%, Memory: 87%) ⚠️ High memory
    
    ### Unhealthy Nodes ⚠️
    - **node-4:** NotReady
      - Condition: KubeletNotReady
      - Reason: Node had insufficient memory
      - Duration: 15 minutes
    
    ## Pod Health Summary
    
    **Total Pods:** 127
    - Running: 120
    - Pending: 4 ⚠️
    - CrashLoopBackOff: 2 ⚠️
    - ImagePullBackOff: 1 ⚠️
    
    ### Critical Pod Issues
    
    1. **api-service-abc** (namespace: api)
       - Status: CrashLoopBackOff
       - Restarts: 5 times in 10 minutes
       - Action needed: Investigate with debug-pod-issues skill
    
    2. **api-service-xyz** (namespace: api)
       - Status: OOMKilled
       - Restarts: 3 times in 15 minutes
       - Action needed: Memory limit investigation required
    
    3. **worker-123** (namespace: default)
       - Status: Pending
       - Reason: Insufficient memory to schedule
       - Action needed: Resource analysis needed
    
    4. **prometheus-456** (namespace: monitoring)
       - Status: ImagePullBackOff
       - Reason: Failed to pull image
       - Action needed: Check registry connectivity
    
    ## System Components ✓
    
    All critical system components healthy:
    - coredns: 2/2 pods running
    - kube-apiserver: Running
    - kube-controller-manager: Running
    - kube-scheduler: Running
    - etcd: Running
    - kube-proxy: DaemonSet 5/5 ready
    
    ## Recent Critical Events (Last 60 minutes)
    
    **OOM Kills:** 3 occurrences
    - 14:23: api-service-xyz OOMKilled (namespace: api)
    - 14:25: api-service-xyz OOMKilled (namespace: api)
    - 14:27: api-service-xyz OOMKilled (namespace: api)
    
    **Scheduling Failures:** 4 occurrences
    - 14:20: worker-123 FailedScheduling: insufficient memory
    - 14:22: worker-456 FailedScheduling: insufficient memory
    - 14:25: worker-789 FailedScheduling: insufficient memory
    - 14:28: cache-abc FailedScheduling: insufficient cpu
    
    **Node Issues:**
    - 14:15: node-4 NodeNotReady: KubeletNotReady
    
    **Evictions:** 2 occurrences
    - 14:18: cache-xyz Evicted: node low on memory
    - 14:22: cache-abc Evicted: node low on memory
    
    ## Resource Pressure Analysis
    
    ### Node-4: MemoryPressure Detected ⚠️
    - Current usage: 28Gi / 32Gi (87%)
    - Condition: MemoryPressure True
    - Impact: Pods may be evicted
    - Action: Investigate high memory consumers
    
    ### Cluster-Wide Resource Allocation
    - **CPU:** 75% allocated (approaching capacity)
    - **Memory:** 82% allocated ⚠️ (critical threshold)
    - **Risk:** New pods may not schedule
    
    ## Issues Detected
    
    ### 🚨 CRITICAL Issues (Require Immediate Action)
    
    1. **Multiple OOM Kills in api namespace**
       - Impact: Service degradation/outages
       - Pods affected: api-service-xyz
       - Recommendation: Increase memory limits or investigate memory leak
       - Next step: Use `debug-pod-issues` skill
    
    2. **Node-4 Unhealthy (NotReady)**
       - Impact: Reduced cluster capacity
       - Duration: 15 minutes
       - Recommendation: Investigate node logs, consider cordoning/draining
       - Next step: SSH to node or check kubelet logs
    
    3. **Cluster Memory Capacity Critical (82% allocated)**
       - Impact: Risk of scheduling failures
       - Pods pending: 4
       - Recommendation: Scale cluster or optimize workloads
       - Next step: Use `analyze-resource-usage` skill
    
    ### ⚠️ WARNING Issues (Should Be Addressed)
    
    4. **Node-3 High Memory Usage (87%)**
       - Impact: Risk of pressure condition
       - Current state: Still Ready
       - Recommendation: Monitor closely, consider rebalancing pods
    
    5. **ImagePullBackOff in monitoring namespace**
       - Impact: Prometheus not available
       - Likely cause: Registry connectivity or credentials
       - Recommendation: Check image repository access
    
    ## Recommended Actions (Priority Order)
    
    ### Immediate (Next 15 minutes)
    1. **Investigate api-service OOM kills** → Use `debug-pod-issues` skill on api-service-xyz
    2. **Check node-4 status** → SSH to node or review kubelet logs
    3. **Review pending pods** → Use `analyze-resource-usage` to understand capacity
    
    ### Short Term (Next hour)
    4. Increase memory limits for api-service pods
    5. Consider scaling cluster (add nodes or upsize)
    6. Fix ImagePullBackOff for prometheus
    7. Investigate memory usage on node-3
    
    ### Long Term (This week)
    8. Implement pod resource requests/limits across all workloads
    9. Set up cluster autoscaling
    10. Review and optimize memory-intensive workloads
    11. Implement monitoring alerts for:
        - Node NotReady conditions
        - OOM kill events
        - Resource allocation thresholds (>80%)
        - Pod evictions
    
    ## Next Steps
    
    Based on the findings, I recommend:
    
    1. **Deep dive into OOM issues** → Skill: `debug-pod-issues`
       - Target: api-service-xyz in api namespace
       
    2. **Analyze resource usage patterns** → Skill: `analyze-resource-usage`
       - Focus on memory consumption and allocation
       
    3. **Check logs for crash patterns** → Skill: `inspect-logs`
       - Target: api-service pods for error patterns
    
    Would you like me to proceed with investigating the OOM kills in the api-service pods?
    

    Red Flags to Watch For

    • 🚨 Node NotReady status - Immediate impact on capacity
    • 🚨 Multiple pods in CrashLoopBackOff - Application issues
    • 🚨 Repeated OOMKilled events - Memory configuration problems
    • 🚨 System component failures - Cluster instability
    • 🚨 High resource allocation (>85%) - Scheduling issues imminent
    • ⚠️ High restart counts (>5 in last hour) - Application instability
    • ⚠️ Pending pods - Resource constraints
    • ⚠️ ImagePullBackOff - Registry or networking issues
    • ⚠️ Volume mount failures - Storage problems
    • ⚠️ Evicted pods - Node resource pressure

    Decision Tree - Next Skill to Use

    Based on findings, recommend next skill:
    
    If OOMKilled or CrashLoopBackOff detected:
      → Use `debug-pod-issues` skill
    
    If high CPU/Memory usage detected:
      → Use `analyze-resource-usage` skill
    
    If connection errors in events:
      → Use `check-network-connectivity` skill
    
    If errors in events but pods running:
      → Use `inspect-logs` skill
    
    If multiple issues:
      → Prioritize by severity, start with pod crashes
    

    Common Patterns & Root Causes

    Pattern: Multiple OOM Kills

    Indicates: Memory limits too low or memory leak Next Action: debug-pod-issues + inspect-logs

    Pattern: Many Pending Pods

    Indicates: Insufficient cluster capacity Next Action: analyze-resource-usage

    Pattern: Node NotReady + Evictions

    Indicates: Node resource exhaustion Next Action: Investigate node directly, consider draining

    Pattern: System Component Failure

    Indicates: Critical cluster issue Next Action: Immediate investigation, possibly escalate

    Pattern: ImagePullBackOff

    Indicates: Registry access issues Next Action: Check network connectivity, registry credentials

    Skill Completion Criteria

    This skill is complete when:

    • ✓ Node health assessed
    • ✓ Pod health across all namespaces evaluated
    • ✓ System components verified
    • ✓ Recent critical events reviewed
    • ✓ Issues categorized by severity
    • ✓ Recommended next steps provided
    • ✓ Clear indication of which skill to use next

    Notes for Agent

    • Always start with this skill for vague issues
    • Provide executive summary before detailed findings
    • Categorize issues by severity (Critical/Warning/Info)
    • Explicitly recommend next skill based on findings
    • Include specific pod names, namespaces, timestamps
    • Highlight patterns, not just individual issues
    • Keep summary concise but comprehensive
    Recommended Servers
    Vercel Domains
    Vercel Domains
    Repository
    ankurkumarz/devops-sre-deepagent
    Files