k8s-debug

akin-ozer/k8s-debug

DevOps

1 installs

About

SKILL.md

k8s-debug

akin-ozer/k8s-debug

DevOps

1 installs

About

Comprehensive Kubernetes debugging and troubleshooting toolkit...

SKILL.md

Kubernetes Debugging Skill

Overview

Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.

Trigger Phrases

Use this skill when requests resemble:

"My pod is in CrashLoopBackOff; help me find the root cause."
"Service DNS works in one pod but not another."
"Deployment rollout is stuck."
"Pods are Pending and not scheduling."
"Cluster health looks degraded after a change."
"PVC is pending and pods cannot mount storage."

Prerequisites

Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.

Required

kubectl installed and configured.
An active cluster context.
Read access to namespaces, pods, events, services, and nodes.

Quick preflight:

kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns

Optional but Recommended

jq for more precise filtering in ./scripts/cluster_health.sh.
Metrics API (metrics-server) for kubectl top.
In-container debug tools (nslookup, getent, curl, wget, ip) for deep network tests.

Fallback behavior:

If optional tools are missing, scripts continue and print warnings with reduced output.
If kubectl top is unavailable, continue with kubectl describe and events.

When to Use This Skill

Use this skill for:

Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
Service connectivity or DNS resolution issues
Network policy or ingress problems
Volume and storage mount failures
Deployment rollout issues
Cluster health or performance degradation
Resource exhaustion (CPU/memory)
Configuration problems (ConfigMaps, Secrets, RBAC)

Safety Rules for Disruptive Commands

Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.

Commands requiring explicit confirmation:

kubectl delete pod ... --force --grace-period=0
kubectl drain ...
kubectl rollout restart ...
kubectl rollout undo ...
kubectl debug ... --copy-to=...

Before disruptive actions:

# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt

Reference Navigation Map

Load only the section needed for the observed symptom.

Symptom / Need	Open	Start section
You need an end-to-end diagnosis path	`./references/troubleshooting_workflow.md`	`General Debugging Workflow`
Pod state is `Pending`, `CrashLoopBackOff`, or `ImagePullBackOff`	`./references/troubleshooting_workflow.md`	`Pod Lifecycle Troubleshooting`
Service reachability or DNS failure	`./references/troubleshooting_workflow.md`	`Network Troubleshooting Workflow`
Node pressure or performance regression	`./references/troubleshooting_workflow.md`	`Resource and Performance Workflow`
PVC / PV / storage class issues	`./references/troubleshooting_workflow.md`	`Storage Troubleshooting Workflow`
Quick symptom-to-fix lookup	`./references/common_issues.md`	matching issue heading
Post-mortem fix options for known issues	`./references/common_issues.md`	`Solutions` sections

Scripts Overview

Script	Purpose	Required args	Optional args	Output	Fallback behavior
`./scripts/cluster_health.sh`	Cluster-wide health snapshot (nodes, workloads, events, common failure states)	None	`--strict`, `K8S_REQUEST_TIMEOUT` env var	Sectioned report to stdout	Continues on check failures, tracks them in summary and exit code
`./scripts/network_debug.sh`	Pod-centric network and DNS diagnostics	`<pod-name>` (`<namespace>` defaults to `default`)	`--strict`, `--insecure`, `K8S_REQUEST_TIMEOUT` env var	Sectioned report to stdout	Uses secure API probe by default; insecure TLS requires explicit `--insecure`
`./scripts/pod_diagnostics.py`	Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context)	`<pod-name>`	`-n/--namespace`, `-o/--output`	Sectioned report to stdout or file	Fails fast on missing access; skips optional metrics/log blocks with clear messages

Script Exit Codes

./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:

0: checks completed with no check failures (warnings allowed unless --strict is set).
1: one or more checks failed, or warnings occurred in --strict mode.
2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).

Deterministic Debugging Workflow

Follow this systematic approach for any Kubernetes issue:

1. Preflight and Scope

kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>

If preflight fails, stop and fix access/context first.

2. Identify the Problem Layer

Categorize the issue:

Application Layer: Application crashes, errors, bugs
Pod Layer: Pod not starting, restarting, or pending
Service Layer: Network connectivity, DNS issues
Node Layer: Node not ready, resource exhaustion
Cluster Layer: Control plane issues, API problems
Storage Layer: Volume mount failures, PVC issues
Configuration Layer: ConfigMap, Secret, RBAC issues

3. Gather Diagnostics with the Right Script

Use the appropriate diagnostic script based on scope:

Pod-Level Diagnostics

Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>

This script gathers:

Pod status and description
Pod events
Container logs (current and previous)
Resource usage
Node information
YAML configuration

Output can be saved for analysis:

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

Cluster-Level Health Check

Use ./scripts/cluster_health.sh for overall cluster diagnostics:

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

This script checks:

Cluster info and version
Node status and resources
Pods across all namespaces
Failed/pending pods
Recent events
Deployments, services, statefulsets, daemonsets
PVCs and PVs
Component health
Common error states (CrashLoopBackOff, ImagePullBackOff)

Network Diagnostics

Use ./scripts/network_debug.sh for connectivity issues:

./scripts/network_debug.sh <namespace> <pod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>

This script analyzes:

Pod network configuration
DNS setup and resolution
Service endpoints
Network policies
Connectivity tests
CoreDNS logs

4. Follow Issue-Specific Reference Workflow

Based on the identified issue, consult ./references/troubleshooting_workflow.md:

Pod Pending: Resource/scheduling workflow
CrashLoopBackOff: Application crash workflow
ImagePullBackOff: Image pull workflow
Service issues: Network connectivity workflow
DNS failures: DNS troubleshooting workflow
Resource exhaustion: Performance investigation workflow
Storage issues: PVC binding workflow
Deployment stuck: Rollout workflow

5. Apply Targeted Fixes

Refer to ./references/common_issues.md for symptom-specific fixes.

6. Verify and Close

Run final verification:

kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>

Issue is done when user-visible behavior is healthy and no new critical warning events appear.

Example Flows

Example 1: CrashLoopBackOff in `payments` Namespace

python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe

Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.

Example 2: Service DNS/Connectivity Failure

./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout

Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.

Essential Manual Commands

Pod Debugging

# View pod status
kubectl get pods -n <namespace> -o wide

# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>

# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container
kubectl logs <pod-name> -n <namespace> -c <container>  # Specific container

# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml

Service and Network Debugging

# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints -n <namespace>

# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Resource Monitoring

# Node resources
kubectl top nodes
kubectl describe nodes

# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers

Emergency Operations

# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>

# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>

# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Cordon node (prevent scheduling)
kubectl cordon <node-name>

Completion Criteria

Troubleshooting session is complete when all are true:

Cluster context and namespace are confirmed.
Relevant diagnostic script output is captured.
Root cause is identified and tied to evidence (events/logs/config/state).
Any disruptive action was preceded by snapshot and rollback plan.
Fix verification commands show healthy state.
Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.

Related Tools

Useful additional tools for Kubernetes debugging:

kubectl-debug: Advanced debugging plugin
stern: Multi-pod log tailing
kubectx/kubens: Context and namespace switching
k9s: Terminal UI for Kubernetes
lens: Desktop IDE for Kubernetes
Prometheus/Grafana: Monitoring and alerting
Jaeger/Zipkin: Distributed tracing

About

SKILL.md

About

Comprehensive Kubernetes debugging and troubleshooting toolkit...

SKILL.md

Kubernetes Debugging Skill

Overview

Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.

Trigger Phrases

Use this skill when requests resemble:

"My pod is in CrashLoopBackOff; help me find the root cause."
"Service DNS works in one pod but not another."
"Deployment rollout is stuck."
"Pods are Pending and not scheduling."
"Cluster health looks degraded after a change."
"PVC is pending and pods cannot mount storage."

Prerequisites

Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.

Required

kubectl installed and configured.
An active cluster context.
Read access to namespaces, pods, events, services, and nodes.

Quick preflight:

kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns

Optional but Recommended

jq for more precise filtering in ./scripts/cluster_health.sh.
Metrics API (metrics-server) for kubectl top.
In-container debug tools (nslookup, getent, curl, wget, ip) for deep network tests.

Fallback behavior:

If optional tools are missing, scripts continue and print warnings with reduced output.
If kubectl top is unavailable, continue with kubectl describe and events.

When to Use This Skill

Use this skill for:

Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
Service connectivity or DNS resolution issues
Network policy or ingress problems
Volume and storage mount failures
Deployment rollout issues
Cluster health or performance degradation
Resource exhaustion (CPU/memory)
Configuration problems (ConfigMaps, Secrets, RBAC)

Safety Rules for Disruptive Commands

Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.

Commands requiring explicit confirmation:

kubectl delete pod ... --force --grace-period=0
kubectl drain ...
kubectl rollout restart ...
kubectl rollout undo ...
kubectl debug ... --copy-to=...

Before disruptive actions:

# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n <namespace> -o wide
kubectl get pod <pod-name> -n <namespace> -o yaml > before-<pod-name>.yaml
kubectl get events -n <namespace> --sort-by='.lastTimestamp' > before-events.txt

Reference Navigation Map

Load only the section needed for the observed symptom.

Symptom / Need	Open	Start section
You need an end-to-end diagnosis path	`./references/troubleshooting_workflow.md`	`General Debugging Workflow`
Pod state is `Pending`, `CrashLoopBackOff`, or `ImagePullBackOff`	`./references/troubleshooting_workflow.md`	`Pod Lifecycle Troubleshooting`
Service reachability or DNS failure	`./references/troubleshooting_workflow.md`	`Network Troubleshooting Workflow`
Node pressure or performance regression	`./references/troubleshooting_workflow.md`	`Resource and Performance Workflow`
PVC / PV / storage class issues	`./references/troubleshooting_workflow.md`	`Storage Troubleshooting Workflow`
Quick symptom-to-fix lookup	`./references/common_issues.md`	matching issue heading
Post-mortem fix options for known issues	`./references/common_issues.md`	`Solutions` sections

Scripts Overview

Script	Purpose	Required args	Optional args	Output	Fallback behavior
`./scripts/cluster_health.sh`	Cluster-wide health snapshot (nodes, workloads, events, common failure states)	None	`--strict`, `K8S_REQUEST_TIMEOUT` env var	Sectioned report to stdout	Continues on check failures, tracks them in summary and exit code
`./scripts/network_debug.sh`	Pod-centric network and DNS diagnostics	`<pod-name>` (`<namespace>` defaults to `default`)	`--strict`, `--insecure`, `K8S_REQUEST_TIMEOUT` env var	Sectioned report to stdout	Uses secure API probe by default; insecure TLS requires explicit `--insecure`
`./scripts/pod_diagnostics.py`	Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context)	`<pod-name>`	`-n/--namespace`, `-o/--output`	Sectioned report to stdout or file	Fails fast on missing access; skips optional metrics/log blocks with clear messages

Script Exit Codes

./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:

0: checks completed with no check failures (warnings allowed unless --strict is set).
1: one or more checks failed, or warnings occurred in --strict mode.
2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).

Deterministic Debugging Workflow

Follow this systematic approach for any Kubernetes issue:

1. Preflight and Scope

kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n <namespace>

If preflight fails, stop and fix access/context first.

2. Identify the Problem Layer

Categorize the issue:

Application Layer: Application crashes, errors, bugs
Pod Layer: Pod not starting, restarting, or pending
Service Layer: Network connectivity, DNS issues
Node Layer: Node not ready, resource exhaustion
Cluster Layer: Control plane issues, API problems
Storage Layer: Volume mount failures, PVC issues
Configuration Layer: ConfigMap, Secret, RBAC issues

3. Gather Diagnostics with the Right Script

Use the appropriate diagnostic script based on scope:

Pod-Level Diagnostics

Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace>

This script gathers:

Pod status and description
Pod events
Container logs (current and previous)
Resource usage
Node information
YAML configuration

Output can be saved for analysis:

python3 ./scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt

Cluster-Level Health Check

Use ./scripts/cluster_health.sh for overall cluster diagnostics:

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

This script checks:

Cluster info and version
Node status and resources
Pods across all namespaces
Failed/pending pods
Recent events
Deployments, services, statefulsets, daemonsets
PVCs and PVs
Component health
Common error states (CrashLoopBackOff, ImagePullBackOff)

Network Diagnostics

Use ./scripts/network_debug.sh for connectivity issues:

./scripts/network_debug.sh <namespace> <pod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict <namespace> <pod-name>
./scripts/network_debug.sh --insecure <namespace> <pod-name>

This script analyzes:

Pod network configuration
DNS setup and resolution
Service endpoints
Network policies
Connectivity tests
CoreDNS logs

4. Follow Issue-Specific Reference Workflow

Based on the identified issue, consult ./references/troubleshooting_workflow.md:

Pod Pending: Resource/scheduling workflow
CrashLoopBackOff: Application crash workflow
ImagePullBackOff: Image pull workflow
Service issues: Network connectivity workflow
DNS failures: DNS troubleshooting workflow
Resource exhaustion: Performance investigation workflow
Storage issues: PVC binding workflow
Deployment stuck: Rollout workflow

5. Apply Targeted Fixes

Refer to ./references/common_issues.md for symptom-specific fixes.

6. Verify and Close

Run final verification:

kubectl get pods -n <namespace> -o wide
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/<name> -n <namespace>

Issue is done when user-visible behavior is healthy and no new critical warning events appear.

Example Flows

Example 1: CrashLoopBackOff in `payments` Namespace

python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe

Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.

Example 2: Service DNS/Connectivity Failure

./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout

Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.

Essential Manual Commands

Pod Debugging

# View pod status
kubectl get pods -n <namespace> -o wide

# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>

# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container
kubectl logs <pod-name> -n <namespace> -c <container>  # Specific container

# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml

Service and Network Debugging

# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints -n <namespace>

# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Resource Monitoring

# Node resources
kubectl top nodes
kubectl describe nodes

# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers

Emergency Operations

# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>

# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>

# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Cordon node (prevent scheduling)
kubectl cordon <node-name>

Completion Criteria

Troubleshooting session is complete when all are true:

Cluster context and namespace are confirmed.
Relevant diagnostic script output is captured.
Root cause is identified and tied to evidence (events/logs/config/state).
Any disruptive action was preceded by snapshot and rollback plan.
Fix verification commands show healthy state.
Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.

Related Tools

Useful additional tools for Kubernetes debugging:

kubectl-debug: Advanced debugging plugin
stern: Multi-pod log tailing
kubectx/kubens: Context and namespace switching
k9s: Terminal UI for Kubernetes
lens: Desktop IDE for Kubernetes
Prometheus/Grafana: Monitoring and alerting
Jaeger/Zipkin: Distributed tracing

k8s-debug

About

SKILL.md

k8s-debug

About

SKILL.md

Kubernetes Debugging Skill

Overview

Trigger Phrases

Prerequisites

Required

Optional but Recommended

When to Use This Skill

Safety Rules for Disruptive Commands

Reference Navigation Map

Scripts Overview

Script Exit Codes

Deterministic Debugging Workflow

1. Preflight and Scope

2. Identify the Problem Layer

3. Gather Diagnostics with the Right Script

Pod-Level Diagnostics

Cluster-Level Health Check

Network Diagnostics

4. Follow Issue-Specific Reference Workflow

5. Apply Targeted Fixes

6. Verify and Close

Example Flows

Example 1: CrashLoopBackOff in payments Namespace

Example 2: Service DNS/Connectivity Failure

Essential Manual Commands

Pod Debugging

Service and Network Debugging

Resource Monitoring

Emergency Operations

Completion Criteria

Related Tools

About

SKILL.md

About

SKILL.md

Kubernetes Debugging Skill

Overview

Trigger Phrases

Prerequisites

Required

Optional but Recommended

When to Use This Skill

Safety Rules for Disruptive Commands

Reference Navigation Map

Scripts Overview

Script Exit Codes

Deterministic Debugging Workflow

1. Preflight and Scope

2. Identify the Problem Layer

3. Gather Diagnostics with the Right Script

Pod-Level Diagnostics

Cluster-Level Health Check

Network Diagnostics

4. Follow Issue-Specific Reference Workflow

5. Apply Targeted Fixes

6. Verify and Close

Example Flows

Example 1: CrashLoopBackOff in payments Namespace

Example 2: Service DNS/Connectivity Failure

Essential Manual Commands

Pod Debugging

Service and Network Debugging

Resource Monitoring

Emergency Operations

Completion Criteria

Related Tools

Example 1: CrashLoopBackOff in `payments` Namespace

Example 1: CrashLoopBackOff in `payments` Namespace