Pod troubleshooting
Your pod is not running. This page tells you why and what to do about it. Start with the status you see in kubectl get pods, follow the decision tree, fix the issue.
Start here
kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
The describe output has the answer 95% of the time. Look at the Events section at the bottom.
Pending
The pod exists but no node will accept it.
Decision tree
1. Check events for scheduling failures:
kubectl describe pod <pod> -n <ns> | grep -A 5 "Events:"
2. What does the event say?
| Event message | Cause | Fix |
|---|---|---|
Insufficient cpu or Insufficient memory | Node has no room for the pod's resource requests | Reduce requests, add nodes, or increase max-count on autoscaler |
0/N nodes are available: N node(s) had taint | Pod does not tolerate node taints | Add the correct toleration to the pod spec |
0/N nodes are available: N node(s) didn't match Pod's node affinity/selector | Node affinity or nodeSelector does not match any node | Fix the label selector or add a node pool with matching labels |
persistentvolumeclaim "X" not found | PVC does not exist or is in a different namespace | Create the PVC in the correct namespace |
0/N nodes are available: N pod has unbound immediate PersistentVolumeClaims | PV cannot be provisioned | Check StorageClass exists, disk quota, zone mismatch |
Too many pods | Node hit max pod limit (default 110 on kubenet, 250 on Azure CNI Overlay) | Use fewer pods per node or add nodes |
3. Cluster Autoscaler not scaling up?
kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50
Common reasons: max-count reached, pod has a nodeSelector that no pool satisfies, pod requests more resources than any VM SKU can provide.
If a pod is Pending and the autoscaler is not responding, check if the pod has nodeSelector or affinity rules that match a node pool with min-count: 0 and max-count: 0. The autoscaler cannot create nodes in a pool with max-count 0.
CrashLoopBackOff
The container starts, runs for a few seconds, then exits. Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 min).
Decision tree
1. Get the logs from the last crash:
kubectl logs <pod> -n <ns> --previous
2. Check the exit code:
kubectl describe pod <pod> -n <ns> | grep -A 3 "Last State"
| Exit code | Meaning | Most common cause |
|---|---|---|
| 0 | Graceful exit | Application completed and exited. If this is a web server, it should not exit. Check the entrypoint. |
| 1 | Application error | Unhandled exception, missing config file, wrong database URL |
| 127 | Command not found | Wrong command or args in the container spec. The binary does not exist in the image. |
| 137 | OOMKilled (SIGKILL) | Container exceeded its memory limit. See OOMKilled section below. |
| 139 | Segfault (SIGSEGV) | Application bug. Check core dumps. |
| 143 | SIGTERM | Graceful shutdown signal. If the container restarts after this, check liveness probe. |
3. Common fixes:
- Missing environment variable or secret: Check
kubectl describe podforCreateContainerConfigErrorevents. Verify all referenced secrets and configmaps exist. - Liveness probe failing: The probe kills a healthy-but-slow container. Increase
initialDelaySecondsandperiodSeconds. - Application needs time to start: Add a
startupProbewith generous timeout before the liveness probe kicks in.
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
# Gives the app 300 seconds (5 min) to start
Do not "fix" CrashLoopBackOff by removing the liveness probe. The probe is telling you the app is unhealthy. Fix the app.
ImagePullBackOff
Kubernetes cannot download the container image.
Decision tree
1. Check the exact error:
kubectl describe pod <pod> -n <ns> | grep -A 3 "Failed"
| Error message | Cause | Fix |
|---|---|---|
repository does not exist or may require authorization | Wrong image name or private registry without auth | Verify image name, attach ACR with az aks update --attach-acr |
unauthorized: authentication required | Registry credentials missing or expired | For ACR, use az aks check-acr. For Docker Hub, create an imagePullSecret |
manifest unknown | Tag does not exist in the registry | Check available tags with az acr repository show-tags --name myACR --repository myapp |
ErrImagePull followed by Back-off | Transient network issue or rate limit | Wait and retry. Docker Hub rate limits anonymous pulls to 100/6h. Use ACR. |
2. ACR-specific checks:
# Verify ACR is attached
az aks check-acr --resource-group myRG --name myAKS --acr myACR.azurecr.io
# If not attached
az aks update --resource-group myRG --name myAKS --attach-acr myACR
# Verify the image exists
az acr repository show-tags --name myACR --repository myapp -o tsv
3. Using imagePullSecrets (non-ACR registries):
kubectl create secret docker-registry my-reg-cred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass \
-n <namespace>
Then reference it in the pod spec:
spec:
imagePullSecrets:
- name: my-reg-cred
OOMKilled
The container used more memory than its limit allows. The kernel kills it with signal 137.
Decision tree
1. Confirm it is OOMKilled:
kubectl describe pod <pod> -n <ns> | grep -E "OOMKilled|Exit Code: 137"
2. Check current memory usage vs limit:
kubectl top pod <pod> -n <ns> --containers
3. Decide the fix:
| Situation | Fix |
|---|---|
| App genuinely needs more memory | Increase resources.limits.memory. But also investigate if there is a leak. |
| Memory limit is way too low | Set limit to 1.5-2x the typical usage. Use VPA in recommendation mode to find the right value. |
| Memory leak (usage grows over time) | Fix the application. Common causes: unbounded caches, connection pool leaks, large file processing without streaming. |
JVM app killed despite -Xmx being set | JVM uses more memory than heap alone (thread stacks, metaspace, native buffers). Set limit to Xmx + 500Mi at minimum. |
4. Use VPA for right-sizing recommendations:
# Install VPA and set it to recommendation-only mode
# Then check recommendations
kubectl get vpa -n <ns> -o yaml
Never set memory limits equal to requests for JVM or .NET applications. These runtimes need headroom above heap for GC, JIT, and thread stacks. A good starting point: requests = typical usage, limits = requests * 2.
CreateContainerConfigError
The container cannot start because a referenced resource is missing.
Common causes
kubectl describe pod <pod> -n <ns> | grep -A 5 "Warning"
| Error | Fix |
|---|---|
secret "X" not found | Create the secret in the correct namespace |
configmap "X" not found | Create the configmap in the correct namespace |
key "Y" not found in secret "X" | The secret exists but is missing the expected key. Check kubectl get secret X -n <ns> -o jsonpath='{.data}' |
projected volume mount failed | Workload Identity token volume issue. Check OIDC issuer and service account annotations. |
Stuck terminating
A pod is stuck in Terminating state and will not go away.
Decision tree
1. Check for finalizers:
kubectl get pod <pod> -n <ns> -o jsonpath='{.metadata.finalizers}'
If finalizers are present, something is waiting to clean up. Removing them forcefully can cause resource leaks.
2. Check for PDB blocking eviction:
kubectl get pdb -n <ns>
If the PDB prevents disruption of the last remaining pod, the drain operation blocks.
3. Force delete (last resort):
kubectl delete pod <pod> -n <ns> --grace-period=0 --force
Force-deleting a pod with a PersistentVolume attached can cause the volume to stay attached to the old node. The replacement pod cannot mount it. Use force delete only when you are sure there are no volume dependencies.
Quick diagnosis script
Run this when you need a fast overview of all issues in a namespace:
NS=my-namespace
echo "=== Non-running pods ==="
kubectl get pods -n $NS --field-selector=status.phase!=Running,status.phase!=Succeeded
echo "=== Warning events (last 1h) ==="
kubectl get events -n $NS --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20
echo "=== Resource pressure ==="
kubectl top pods -n $NS --sort-by=memory | head -10
echo "=== PVC status ==="
kubectl get pvc -n $NS | grep -v Bound