Smithery Logo
MCPsSkillsDocsPricing
Login
Smithery Logo

Give agents more agency

Resources

DocumentationPrivacy PolicySystem Status

Company

PricingAboutBlog

Connect

© 2026 Smithery. All rights reserved.

    akin-ozer

    k8s-debug

    akin-ozer/k8s-debug
    DevOps
    35
    1 installs

    About

    SKILL.md

    Install

    • Claude Code
      Claude Code
    • Codex
      Codex
    • OpenClaw
      OpenClaw
    • Cursor
      Cursor
    • Amp
      Amp
    • GitHub Copilot
      GitHub Copilot
    • Gemini CLI
      Gemini CLI
    • Kilo Code
      Kilo Code
    • Junie
      Junie
    • Replit
      Replit
    • Windsurf
      Windsurf
    • Cline
      Cline
    • Continue
      Continue
    • OpenCode
      OpenCode
    • OpenHands
      OpenHands
    • Roo Code
      Roo Code
    • Augment
      Augment
    • Goose
      Goose
    • Trae
      Trae
    • Zencoder
      Zencoder
    • Antigravity
      Antigravity
    • Download skill
    ├─
    ├─
    └─

    About

    Comprehensive Kubernetes debugging and troubleshooting toolkit...

    SKILL.md

    Kubernetes Debugging Skill

    Overview

    Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.

    Trigger Phrases

    Use this skill when requests resemble:

    • "My pod is in CrashLoopBackOff; help me find the root cause."
    • "Service DNS works in one pod but not another."
    • "Deployment rollout is stuck."
    • "Pods are Pending and not scheduling."
    • "Cluster health looks degraded after a change."
    • "PVC is pending and pods cannot mount storage."

    Prerequisites

    Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.

    Required

    • kubectl installed and configured.
    • An active cluster context.
    • Read access to namespaces, pods, events, services, and nodes.

    Quick preflight:

    kubectl config current-context
    kubectl auth can-i get pods -A
    kubectl auth can-i get events -A
    kubectl get ns
    

    Optional but Recommended

    • jq for more precise filtering in ./scripts/cluster_health.sh.
    • Metrics API (metrics-server) for kubectl top.
    • In-container debug tools (nslookup, getent, curl, wget, ip) for deep network tests.

    Fallback behavior:

    • If optional tools are missing, scripts continue and print warnings with reduced output.
    • If kubectl top is unavailable, continue with kubectl describe and events.

    When to Use This Skill

    Use this skill for:

    • Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
    • Service connectivity or DNS resolution issues
    • Network policy or ingress problems
    • Volume and storage mount failures
    • Deployment rollout issues
    • Cluster health or performance degradation
    • Resource exhaustion (CPU/memory)
    • Configuration problems (ConfigMaps, Secrets, RBAC)

    Safety Rules for Disruptive Commands

    Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.

    Commands requiring explicit confirmation:

    • kubectl delete pod ... --force --grace-period=0
    • kubectl drain ...
    • kubectl rollout restart ...
    • kubectl rollout undo ...
    • kubectl debug ... --copy-to=...

    Before disruptive actions:

    # Snapshot current state for rollback and incident notes
    kubectl get deploy,rs,pod,svc -n <namespace> -o wide
    kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
    kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt
    

    Reference Navigation Map

    Load only the section needed for the observed symptom.

    Symptom / Need Open Start section
    You need an end-to-end diagnosis path ./references/troubleshooting_workflow.md General Debugging Workflow
    Pod state is Pending, CrashLoopBackOff, or ImagePullBackOff ./references/troubleshooting_workflow.md Pod Lifecycle Troubleshooting
    Service reachability or DNS failure ./references/troubleshooting_workflow.md Network Troubleshooting Workflow
    Node pressure or performance regression ./references/troubleshooting_workflow.md Resource and Performance Workflow
    PVC / PV / storage class issues ./references/troubleshooting_workflow.md Storage Troubleshooting Workflow
    Quick symptom-to-fix lookup ./references/common_issues.md matching issue heading
    Post-mortem fix options for known issues ./references/common_issues.md Solutions sections

    Scripts Overview

    Script Purpose Required args Optional args Output Fallback behavior
    ./scripts/cluster_health.sh Cluster-wide health snapshot (nodes, workloads, events, common failure states) None --strict, K8S_REQUEST_TIMEOUT env var Sectioned report to stdout Continues on check failures, tracks them in summary and exit code
    ./scripts/network_debug.sh Pod-centric network and DNS diagnostics <pod-name> (<namespace> defaults to default) --strict, --insecure, K8S_REQUEST_TIMEOUT env var Sectioned report to stdout Uses secure API probe by default; insecure TLS requires explicit --insecure
    ./scripts/pod_diagnostics.py Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) <pod-name> -n/--namespace, -o/--output Sectioned report to stdout or file Fails fast on missing access; skips optional metrics/log blocks with clear messages

    Script Exit Codes

    ./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:

    • 0: checks completed with no check failures (warnings allowed unless --strict is set).
    • 1: one or more checks failed, or warnings occurred in --strict mode.
    • 2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).

    Deterministic Debugging Workflow

    Follow this systematic approach for any Kubernetes issue:

    1. Preflight and Scope

    kubectl config current-context
    kubectl get ns
    kubectl auth can-i get pods -n <namespace>
    

    If preflight fails, stop and fix access/context first.

    2. Identify the Problem Layer

    Categorize the issue:

    • Application Layer: Application crashes, errors, bugs
    • Pod Layer: Pod not starting, restarting, or pending
    • Service Layer: Network connectivity, DNS issues
    • Node Layer: Node not ready, resource exhaustion
    • Cluster Layer: Control plane issues, API problems
    • Storage Layer: Volume mount failures, PVC issues
    • Configuration Layer: ConfigMap, Secret, RBAC issues

    3. Gather Diagnostics with the Right Script

    Use the appropriate diagnostic script based on scope:

    Pod-Level Diagnostics

    Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:

    python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>
    

    This script gathers:

    • Pod status and description
    • Pod events
    • Container logs (current and previous)
    • Resource usage
    • Node information
    • YAML configuration

    Output can be saved for analysis:

    python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt
    

    Cluster-Level Health Check

    Use ./scripts/cluster_health.sh for overall cluster diagnostics:

    ./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt
    

    This script checks:

    • Cluster info and version
    • Node status and resources
    • Pods across all namespaces
    • Failed/pending pods
    • Recent events
    • Deployments, services, statefulsets, daemonsets
    • PVCs and PVs
    • Component health
    • Common error states (CrashLoopBackOff, ImagePullBackOff)

    Network Diagnostics

    Use ./scripts/network_debug.sh for connectivity issues:

    ./scripts/network_debug.sh <namespace> <pod-name>
    # or force warning sensitivity / insecure TLS only when explicitly needed:
    ./scripts/network_debug.sh --strict <namespace> <pod-name>
    ./scripts/network_debug.sh --insecure <namespace> <pod-name>
    

    This script analyzes:

    • Pod network configuration
    • DNS setup and resolution
    • Service endpoints
    • Network policies
    • Connectivity tests
    • CoreDNS logs

    4. Follow Issue-Specific Reference Workflow

    Based on the identified issue, consult ./references/troubleshooting_workflow.md:

    • Pod Pending: Resource/scheduling workflow
    • CrashLoopBackOff: Application crash workflow
    • ImagePullBackOff: Image pull workflow
    • Service issues: Network connectivity workflow
    • DNS failures: DNS troubleshooting workflow
    • Resource exhaustion: Performance investigation workflow
    • Storage issues: PVC binding workflow
    • Deployment stuck: Rollout workflow

    5. Apply Targeted Fixes

    Refer to ./references/common_issues.md for symptom-specific fixes.

    6. Verify and Close

    Run final verification:

    kubectl get pods -n <namespace> -o wide
    kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
    kubectl rollout status deployment/<name> -n <namespace>
    

    Issue is done when user-visible behavior is healthy and no new critical warning events appear.

    Example Flows

    Example 1: CrashLoopBackOff in payments Namespace

    python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
    kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
    kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe
    

    Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.

    Example 2: Service DNS/Connectivity Failure

    ./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
    kubectl get svc checkout-api -n checkout
    kubectl get endpoints checkout-api -n checkout
    kubectl get networkpolicies -n checkout
    

    Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.

    Essential Manual Commands

    Pod Debugging

    # View pod status
    kubectl get pods -n <namespace> -o wide
    
    # Detailed pod information
    kubectl describe pod <pod-name> -n <namespace>
    
    # View logs
    kubectl logs <pod-name> -n <namespace>
    kubectl logs <pod-name> -n <namespace> --previous  # Previous container
    kubectl logs <pod-name> -n <namespace> -c <container>  # Specific container
    
    # Execute commands in pod
    kubectl exec <pod-name> -n <namespace> -it -- /bin/sh
    
    # Get pod YAML
    kubectl get pod <pod-name> -n <namespace> -o yaml
    

    Service and Network Debugging

    # Check services
    kubectl get svc -n <namespace>
    kubectl describe svc <service-name> -n <namespace>
    
    # Check endpoints
    kubectl get endpoints -n <namespace>
    
    # Test DNS
    kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
    
    # View events
    kubectl get events -n <namespace> --sort-by='.lastTimestamp'
    

    Resource Monitoring

    # Node resources
    kubectl top nodes
    kubectl describe nodes
    
    # Pod resources
    kubectl top pods -n <namespace>
    kubectl top pod <pod-name> -n <namespace> --containers
    

    Emergency Operations

    # Restart deployment
    kubectl rollout restart deployment/<name> -n <namespace>
    
    # Rollback deployment
    kubectl rollout undo deployment/<name> -n <namespace>
    
    # Force delete stuck pod
    kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
    
    # Drain node (maintenance)
    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
    
    # Cordon node (prevent scheduling)
    kubectl cordon <node-name>
    

    Completion Criteria

    Troubleshooting session is complete when all are true:

    • Cluster context and namespace are confirmed.
    • Relevant diagnostic script output is captured.
    • Root cause is identified and tied to evidence (events/logs/config/state).
    • Any disruptive action was preceded by snapshot and rollback plan.
    • Fix verification commands show healthy state.
    • Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.

    Related Tools

    Useful additional tools for Kubernetes debugging:

    • kubectl-debug: Advanced debugging plugin
    • stern: Multi-pod log tailing
    • kubectx/kubens: Context and namespace switching
    • k9s: Terminal UI for Kubernetes
    • lens: Desktop IDE for Kubernetes
    • Prometheus/Grafana: Monitoring and alerting
    • Jaeger/Zipkin: Distributed tracing
    Recommended Servers
    Blockscout MCP Server
    Blockscout MCP Server
    vastlint - IAB XML VAST validator and linter
    vastlint - IAB XML VAST validator and linter
    Hostsmith
    Hostsmith
    Repository
    akin-ozer/cc-devops-skills
    Files