Checks comprehensive health check for a Kubernetes Cluster.
Perform comprehensive health check of Kubernetes cluster infrastructure.
Quickly assess the overall state of the Kubernetes cluster to identify:
Get overview of all nodes in the cluster:
kubectl get nodes -o wide
Look for:
Expected Output:
NAME STATUS ROLES AGE VERSION
node-1 Ready control-plane 45d v1.28.0
node-2 Ready <none> 45d v1.28.0
node-3 Ready <none> 45d v1.28.0
node-4 NotReady <none> 45d v1.28.0 ⚠️
Get current CPU and memory utilization:
kubectl top nodes
Look for:
Expected Output:
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
node-1 450m 22% 4Gi 50%
node-2 890m 44% 6Gi 75%
node-3 1200m 60% 7Gi 87% ⚠️
node-4 100m 5% 2Gi 25%
Inspect for resource pressure conditions:
kubectl describe nodes | grep -A 5 "Conditions:"
Look for:
Critical Conditions:
Conditions:
Type Status Reason
---- ------ ------
MemoryPressure True NodeHasInsufficientMemory ⚠️
DiskPressure False NodeHasSufficientDisk
PIDPressure False NodeHasSufficientPID
Ready True KubeletReady
Get all pods that are not in Running or Succeeded state:
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
Alternative - get all pods with issues:
kubectl get pods --all-namespaces | grep -vE 'Running|Completed|Succeeded'
Look for:
Expected Output:
NAMESPACE NAME READY STATUS RESTARTS AGE
api api-service-abc 0/1 CrashLoopBackOff 5 10m ⚠️
api api-service-xyz 0/1 OOMKilled 3 15m ⚠️
default worker-123 0/1 Pending 0 5m ⚠️
monitoring prometheus-456 0/2 ImagePullBackOff 0 20m ⚠️
Verify kube-system namespace health:
kubectl get pods -n kube-system
Critical components to check:
Expected Output:
NAME READY STATUS RESTARTS AGE
coredns-565d847f94-abcde 1/1 Running 0 45d
coredns-565d847f94-fghij 1/1 Running 0 45d
etcd-node-1 1/1 Running 0 45d
kube-apiserver-node-1 1/1 Running 0 45d
kube-controller-manager-node-1 1/1 Running 0 45d
kube-proxy-klmno 1/1 Running 0 45d
kube-scheduler-node-1 1/1 Running 0 45d
Get events from the last hour, filtered for warnings and errors:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50 | grep -E 'Warning|Error'
Alternative - more structured:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' --field-selector type!=Normal
Look for patterns:
Expected Output:
10m Warning FailedScheduling pod/worker-123 0/4 nodes available: insufficient memory
8m Warning BackOff pod/api-service Back-off restarting failed container
5m Warning OOMKilled pod/api-service Container exceeded memory limit
3m Warning Evicted pod/cache-789 The node was low on resource: memory
Find pods that were evicted due to resource pressure:
kubectl get pods --all-namespaces --field-selector=status.phase=Failed | grep Evicted
Evictions indicate:
Check cluster-wide resource allocation:
kubectl describe nodes | grep -A 7 "Allocated resources:"
Look for:
Expected Output:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3800m (95%) 7200m (180%) ⚠️
memory 24Gi (75%) 32Gi (100%) ⚠️
ephemeral-storage 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
kubernetes.get_nodes()
kubernetes.get_node_metrics()
kubernetes.describe_node(node_name)
kubernetes.get_pods(namespace="all", field_selector="status.phase!=Running")
kubernetes.get_pods(namespace="kube-system")
kubernetes.get_events(namespace="all", since="1h", field_selector="type!=Normal")
Provide a structured summary in this format:
# CLUSTER HEALTH SUMMARY
========================
## Cluster Overview
- **Total Nodes:** 5
- **Healthy Nodes:** 4
- **Unhealthy Nodes:** 1
- **Kubernetes Version:** v1.28.0
## Node Health
### Healthy Nodes ✓
- node-1: Ready (CPU: 22%, Memory: 50%)
- node-2: Ready (CPU: 44%, Memory: 75%)
- node-3: Ready (CPU: 60%, Memory: 87%) ⚠️ High memory
### Unhealthy Nodes ⚠️
- **node-4:** NotReady
- Condition: KubeletNotReady
- Reason: Node had insufficient memory
- Duration: 15 minutes
## Pod Health Summary
**Total Pods:** 127
- Running: 120
- Pending: 4 ⚠️
- CrashLoopBackOff: 2 ⚠️
- ImagePullBackOff: 1 ⚠️
### Critical Pod Issues
1. **api-service-abc** (namespace: api)
- Status: CrashLoopBackOff
- Restarts: 5 times in 10 minutes
- Action needed: Investigate with debug-pod-issues skill
2. **api-service-xyz** (namespace: api)
- Status: OOMKilled
- Restarts: 3 times in 15 minutes
- Action needed: Memory limit investigation required
3. **worker-123** (namespace: default)
- Status: Pending
- Reason: Insufficient memory to schedule
- Action needed: Resource analysis needed
4. **prometheus-456** (namespace: monitoring)
- Status: ImagePullBackOff
- Reason: Failed to pull image
- Action needed: Check registry connectivity
## System Components ✓
All critical system components healthy:
- coredns: 2/2 pods running
- kube-apiserver: Running
- kube-controller-manager: Running
- kube-scheduler: Running
- etcd: Running
- kube-proxy: DaemonSet 5/5 ready
## Recent Critical Events (Last 60 minutes)
**OOM Kills:** 3 occurrences
- 14:23: api-service-xyz OOMKilled (namespace: api)
- 14:25: api-service-xyz OOMKilled (namespace: api)
- 14:27: api-service-xyz OOMKilled (namespace: api)
**Scheduling Failures:** 4 occurrences
- 14:20: worker-123 FailedScheduling: insufficient memory
- 14:22: worker-456 FailedScheduling: insufficient memory
- 14:25: worker-789 FailedScheduling: insufficient memory
- 14:28: cache-abc FailedScheduling: insufficient cpu
**Node Issues:**
- 14:15: node-4 NodeNotReady: KubeletNotReady
**Evictions:** 2 occurrences
- 14:18: cache-xyz Evicted: node low on memory
- 14:22: cache-abc Evicted: node low on memory
## Resource Pressure Analysis
### Node-4: MemoryPressure Detected ⚠️
- Current usage: 28Gi / 32Gi (87%)
- Condition: MemoryPressure True
- Impact: Pods may be evicted
- Action: Investigate high memory consumers
### Cluster-Wide Resource Allocation
- **CPU:** 75% allocated (approaching capacity)
- **Memory:** 82% allocated ⚠️ (critical threshold)
- **Risk:** New pods may not schedule
## Issues Detected
### 🚨 CRITICAL Issues (Require Immediate Action)
1. **Multiple OOM Kills in api namespace**
- Impact: Service degradation/outages
- Pods affected: api-service-xyz
- Recommendation: Increase memory limits or investigate memory leak
- Next step: Use `debug-pod-issues` skill
2. **Node-4 Unhealthy (NotReady)**
- Impact: Reduced cluster capacity
- Duration: 15 minutes
- Recommendation: Investigate node logs, consider cordoning/draining
- Next step: SSH to node or check kubelet logs
3. **Cluster Memory Capacity Critical (82% allocated)**
- Impact: Risk of scheduling failures
- Pods pending: 4
- Recommendation: Scale cluster or optimize workloads
- Next step: Use `analyze-resource-usage` skill
### ⚠️ WARNING Issues (Should Be Addressed)
4. **Node-3 High Memory Usage (87%)**
- Impact: Risk of pressure condition
- Current state: Still Ready
- Recommendation: Monitor closely, consider rebalancing pods
5. **ImagePullBackOff in monitoring namespace**
- Impact: Prometheus not available
- Likely cause: Registry connectivity or credentials
- Recommendation: Check image repository access
## Recommended Actions (Priority Order)
### Immediate (Next 15 minutes)
1. **Investigate api-service OOM kills** → Use `debug-pod-issues` skill on api-service-xyz
2. **Check node-4 status** → SSH to node or review kubelet logs
3. **Review pending pods** → Use `analyze-resource-usage` to understand capacity
### Short Term (Next hour)
4. Increase memory limits for api-service pods
5. Consider scaling cluster (add nodes or upsize)
6. Fix ImagePullBackOff for prometheus
7. Investigate memory usage on node-3
### Long Term (This week)
8. Implement pod resource requests/limits across all workloads
9. Set up cluster autoscaling
10. Review and optimize memory-intensive workloads
11. Implement monitoring alerts for:
- Node NotReady conditions
- OOM kill events
- Resource allocation thresholds (>80%)
- Pod evictions
## Next Steps
Based on the findings, I recommend:
1. **Deep dive into OOM issues** → Skill: `debug-pod-issues`
- Target: api-service-xyz in api namespace
2. **Analyze resource usage patterns** → Skill: `analyze-resource-usage`
- Focus on memory consumption and allocation
3. **Check logs for crash patterns** → Skill: `inspect-logs`
- Target: api-service pods for error patterns
Would you like me to proceed with investigating the OOM kills in the api-service pods?
Based on findings, recommend next skill:
If OOMKilled or CrashLoopBackOff detected:
→ Use `debug-pod-issues` skill
If high CPU/Memory usage detected:
→ Use `analyze-resource-usage` skill
If connection errors in events:
→ Use `check-network-connectivity` skill
If errors in events but pods running:
→ Use `inspect-logs` skill
If multiple issues:
→ Prioritize by severity, start with pod crashes
Indicates: Memory limits too low or memory leak Next Action: debug-pod-issues + inspect-logs
Indicates: Insufficient cluster capacity Next Action: analyze-resource-usage
Indicates: Node resource exhaustion Next Action: Investigate node directly, consider draining
Indicates: Critical cluster issue Next Action: Immediate investigation, possibly escalate
Indicates: Registry access issues Next Action: Check network connectivity, registry credentials
This skill is complete when: