Skip to main content

Pod troubleshooting

Your pod is not running. This page tells you why and what to do about it. Start with the status you see in kubectl get pods, follow the decision tree, fix the issue.

Start here

kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

The describe output has the answer 95% of the time. Look at the Events section at the bottom.


Pending

The pod exists but no node will accept it.

Decision tree

1. Check events for scheduling failures:

kubectl describe pod <pod> -n <ns> | grep -A 5 "Events:"

2. What does the event say?

Event messageCauseFix
Insufficient cpu or Insufficient memoryNode has no room for the pod's resource requestsReduce requests, add nodes, or increase max-count on autoscaler
0/N nodes are available: N node(s) had taintPod does not tolerate node taintsAdd the correct toleration to the pod spec
0/N nodes are available: N node(s) didn't match Pod's node affinity/selectorNode affinity or nodeSelector does not match any nodeFix the label selector or add a node pool with matching labels
persistentvolumeclaim "X" not foundPVC does not exist or is in a different namespaceCreate the PVC in the correct namespace
0/N nodes are available: N pod has unbound immediate PersistentVolumeClaimsPV cannot be provisionedCheck StorageClass exists, disk quota, zone mismatch
Too many podsNode hit max pod limit (default 110 on kubenet, 250 on Azure CNI Overlay)Use fewer pods per node or add nodes

3. Cluster Autoscaler not scaling up?

kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50

Common reasons: max-count reached, pod has a nodeSelector that no pool satisfies, pod requests more resources than any VM SKU can provide.

tip

If a pod is Pending and the autoscaler is not responding, check if the pod has nodeSelector or affinity rules that match a node pool with min-count: 0 and max-count: 0. The autoscaler cannot create nodes in a pool with max-count 0.


CrashLoopBackOff

The container starts, runs for a few seconds, then exits. Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 min).

Decision tree

1. Get the logs from the last crash:

kubectl logs <pod> -n <ns> --previous

2. Check the exit code:

kubectl describe pod <pod> -n <ns> | grep -A 3 "Last State"
Exit codeMeaningMost common cause
0Graceful exitApplication completed and exited. If this is a web server, it should not exit. Check the entrypoint.
1Application errorUnhandled exception, missing config file, wrong database URL
127Command not foundWrong command or args in the container spec. The binary does not exist in the image.
137OOMKilled (SIGKILL)Container exceeded its memory limit. See OOMKilled section below.
139Segfault (SIGSEGV)Application bug. Check core dumps.
143SIGTERMGraceful shutdown signal. If the container restarts after this, check liveness probe.

3. Common fixes:

  • Missing environment variable or secret: Check kubectl describe pod for CreateContainerConfigError events. Verify all referenced secrets and configmaps exist.
  • Liveness probe failing: The probe kills a healthy-but-slow container. Increase initialDelaySeconds and periodSeconds.
  • Application needs time to start: Add a startupProbe with generous timeout before the liveness probe kicks in.
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
# Gives the app 300 seconds (5 min) to start
warning

Do not "fix" CrashLoopBackOff by removing the liveness probe. The probe is telling you the app is unhealthy. Fix the app.


ImagePullBackOff

Kubernetes cannot download the container image.

Decision tree

1. Check the exact error:

kubectl describe pod <pod> -n <ns> | grep -A 3 "Failed"
Error messageCauseFix
repository does not exist or may require authorizationWrong image name or private registry without authVerify image name, attach ACR with az aks update --attach-acr
unauthorized: authentication requiredRegistry credentials missing or expiredFor ACR, use az aks check-acr. For Docker Hub, create an imagePullSecret
manifest unknownTag does not exist in the registryCheck available tags with az acr repository show-tags --name myACR --repository myapp
ErrImagePull followed by Back-offTransient network issue or rate limitWait and retry. Docker Hub rate limits anonymous pulls to 100/6h. Use ACR.

2. ACR-specific checks:

# Verify ACR is attached
az aks check-acr --resource-group myRG --name myAKS --acr myACR.azurecr.io

# If not attached
az aks update --resource-group myRG --name myAKS --attach-acr myACR

# Verify the image exists
az acr repository show-tags --name myACR --repository myapp -o tsv

3. Using imagePullSecrets (non-ACR registries):

kubectl create secret docker-registry my-reg-cred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=pass \
-n <namespace>

Then reference it in the pod spec:

spec:
imagePullSecrets:
- name: my-reg-cred

OOMKilled

The container used more memory than its limit allows. The kernel kills it with signal 137.

Decision tree

1. Confirm it is OOMKilled:

kubectl describe pod <pod> -n <ns> | grep -E "OOMKilled|Exit Code: 137"

2. Check current memory usage vs limit:

kubectl top pod <pod> -n <ns> --containers

3. Decide the fix:

SituationFix
App genuinely needs more memoryIncrease resources.limits.memory. But also investigate if there is a leak.
Memory limit is way too lowSet limit to 1.5-2x the typical usage. Use VPA in recommendation mode to find the right value.
Memory leak (usage grows over time)Fix the application. Common causes: unbounded caches, connection pool leaks, large file processing without streaming.
JVM app killed despite -Xmx being setJVM uses more memory than heap alone (thread stacks, metaspace, native buffers). Set limit to Xmx + 500Mi at minimum.

4. Use VPA for right-sizing recommendations:

# Install VPA and set it to recommendation-only mode
# Then check recommendations
kubectl get vpa -n <ns> -o yaml
tip

Never set memory limits equal to requests for JVM or .NET applications. These runtimes need headroom above heap for GC, JIT, and thread stacks. A good starting point: requests = typical usage, limits = requests * 2.


CreateContainerConfigError

The container cannot start because a referenced resource is missing.

Common causes

kubectl describe pod <pod> -n <ns> | grep -A 5 "Warning"
ErrorFix
secret "X" not foundCreate the secret in the correct namespace
configmap "X" not foundCreate the configmap in the correct namespace
key "Y" not found in secret "X"The secret exists but is missing the expected key. Check kubectl get secret X -n <ns> -o jsonpath='{.data}'
projected volume mount failedWorkload Identity token volume issue. Check OIDC issuer and service account annotations.

Stuck terminating

A pod is stuck in Terminating state and will not go away.

Decision tree

1. Check for finalizers:

kubectl get pod <pod> -n <ns> -o jsonpath='{.metadata.finalizers}'

If finalizers are present, something is waiting to clean up. Removing them forcefully can cause resource leaks.

2. Check for PDB blocking eviction:

kubectl get pdb -n <ns>

If the PDB prevents disruption of the last remaining pod, the drain operation blocks.

3. Force delete (last resort):

kubectl delete pod <pod> -n <ns> --grace-period=0 --force
warning

Force-deleting a pod with a PersistentVolume attached can cause the volume to stay attached to the old node. The replacement pod cannot mount it. Use force delete only when you are sure there are no volume dependencies.


Quick diagnosis script

Run this when you need a fast overview of all issues in a namespace:

NS=my-namespace

echo "=== Non-running pods ==="
kubectl get pods -n $NS --field-selector=status.phase!=Running,status.phase!=Succeeded

echo "=== Warning events (last 1h) ==="
kubectl get events -n $NS --field-selector type=Warning --sort-by='.lastTimestamp' | tail -20

echo "=== Resource pressure ==="
kubectl top pods -n $NS --sort-by=memory | head -10

echo "=== PVC status ==="
kubectl get pvc -n $NS | grep -v Bound

Resources