Skip to main content

SRE on-call guide

You got paged. Your AKS cluster has an issue. This is your step-by-step playbook for the first 30 minutes.

Do not start debugging randomly. Follow the sequence below. Most AKS incidents fall into a small number of patterns, and structured triage resolves them faster than intuition.

First 5 minutes: assess scope

Before you touch anything, determine the blast radius. Run these commands in order:

# 1. Can you reach the API server?
kubectl cluster-info

# 2. Are nodes healthy?
kubectl get nodes -o wide

# 3. How many pods are unhealthy?
kubectl get pods --all-namespaces --field-selector status.phase!=Running,status.phase!=Succeeded

# 4. Are there recent events signaling trouble?
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -30

# 5. Check system pods — if these are broken, the cluster is broken
kubectl get pods -n kube-system

Use the output to classify the incident:

Blast radiusSymptomsLikely cause
Single podOne pod in CrashLoopBackOff or ErrorApplication bug, bad config, missing secret
Single serviceAll pods of one deployment unhealthyBad rollout, resource exhaustion, image pull failure
Single nodeMultiple pods on one node failingNode not ready, disk pressure, OOM
Whole clusterAPI server unreachable, all nodes affectedControl plane issue, networking failure, certificate expiry

Triage by symptom

Users reporting errors

Work from the edge inward:

# 1. Check the failing pods
kubectl get pods -n <namespace> -l app=<service>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100

# 2. Check the ingress controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50

# 3. Check backend service endpoints
kubectl get endpoints <service-name> -n <namespace>
tip

If kubectl get endpoints returns an empty subset, the service selector does not match any running pods. Check label mismatches first — this is the most common cause of "service returns 503."

High latency

# 1. Check resource pressure on nodes
kubectl top nodes
kubectl top pods -n <namespace> --sort-by=cpu

# 2. Check for throttled pods
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.qosClass}{"\n"}{end}'

# 3. Check node conditions
kubectl describe nodes | grep -A5 "Conditions:"

# 4. Check external dependencies (DNS resolution time, upstream APIs)
kubectl exec -it <pod-name> -n <namespace> -- nslookup <external-service>
warning

If kubectl top nodes shows CPU or memory above 80 percent on multiple nodes, the cluster autoscaler may be struggling to keep up. Check pending pods with kubectl get pods --field-selector status.phase=Pending --all-namespaces and review autoscaler status with kubectl describe configmap cluster-autoscaler-status -n kube-system.

Complete outage

# 1. Check API server — if this fails, use Azure portal
kubectl cluster-info

# 2. Check all nodes
kubectl get nodes

# 3. Check DNS (CoreDNS)
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# 4. Check networking (kube-proxy, CNI)
kubectl get pods -n kube-system -l component=kube-proxy
kubectl get pods -n kube-system | grep -i azure-cni

If kubectl itself is unreachable, go directly to the Azure portal. Check Resource Health under the AKS resource and look at Activity Log for recent control plane operations.

Pods not scheduling

# 1. Check pending pods and their events
kubectl get pods --field-selector status.phase=Pending --all-namespaces
kubectl describe pod <pending-pod> -n <namespace>

# 2. Check cluster autoscaler
kubectl get events -n kube-system | grep -i "cluster-autoscaler"

# 3. Check resource quotas
kubectl get resourcequotas --all-namespaces

# 4. Check node capacity vs requests
kubectl describe nodes | grep -A10 "Allocated resources:"

Common reasons for scheduling failures:

Event messageCauseFix
Insufficient cpuNodes are full, autoscaler at maxIncrease --max-count on the node pool or reduce resource requests
Insufficient memorySame as above for memorySame as above
node(s) had taintPod missing tolerationAdd toleration to pod spec or use the correct node pool
no persistent volumes availablePVC cannot bindCheck storage class exists and quota is not exhausted
Too many podsNode at max pod limitDefault is 30 for Azure CNI — increase or add nodes

Common AKS incident patterns

Node not ready

# Check node status
kubectl describe node <node-name> | grep -A20 "Conditions:"

# Check kubelet logs via Azure portal or node SSH
# Common causes: disk pressure, OOM killer, kubelet crash

Disk pressure: The node's OS disk is full. Usual culprits are container logs and unused images. Mitigation: enable ephemeral OS disks on the node pool to get a fresh disk on every reboot.

OOM: The kernel OOM killer terminated the kubelet or a system process. Check dmesg output on the node. Mitigation: set resource limits on all pods and use Burstable or Guaranteed QoS classes.

Certificate expiry

AKS auto-rotates cluster certificates, but custom webhook certificates and ingress TLS certs are your responsibility.

# Check API server certificate expiry
kubectl get nodes -o jsonpath='{.items[0].status.conditions[?(@.type=="Ready")].message}'

# Check webhook certificates
kubectl get validatingwebhookconfigurations -o yaml | grep -i "caBundle"
kubectl get mutatingwebhookconfigurations -o yaml | grep -i "caBundle"

# Rotate AKS cluster certificates if expired
az aks rotate-certs --resource-group <rg> --name <cluster>
warning

az aks rotate-certs causes a rolling restart of all nodes. Schedule this during a maintenance window. It takes 30 minutes or more for large clusters.

Quota exhaustion

# Check Azure subscription quotas
az vm list-usage --location <region> -o table

# Check Kubernetes resource quotas
kubectl get resourcequotas --all-namespaces

# Check PVC usage
kubectl get pvc --all-namespaces

Failed upgrade

# Check upgrade status
az aks show --resource-group <rg> --name <cluster> --query "provisioningState"

# Check for stuck nodes
kubectl get nodes -o wide
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed

# Check PodDisruptionBudgets blocking drain
kubectl get pdb --all-namespaces

A common cause of stuck upgrades is a PDB that does not allow any disruptions. If minAvailable equals the replica count, the node can never drain. Fix the PDB, then the upgrade proceeds.

DNS resolution failures

# Test DNS from inside a pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# Check CoreDNS ConfigMap for custom entries
kubectl get configmap coredns-custom -n kube-system -o yaml

Escalation matrix

SeverityCriteriaActionAzure support level
Sev A (Critical)Production workload completely down, no workaroundOpen Azure support ticket immediately, request call-backBusiness Critical or Unified
Sev B (High)Production degraded, workaround existsOpen ticket within 1 hour, include diagnosticsStandard or above
Sev C (Medium)Non-production issue or intermittent problemOpen ticket during business hoursStandard
Sev D (Low)Question or guidance requestUse Microsoft Q&A or open advisory ticketAny

What to include in every support ticket:

- Subscription ID
- Resource group and cluster name
- Region
- Kubernetes version (kubectl version --short)
- Timestamp of when the issue started (UTC)
- Output of: kubectl get nodes -o wide
- Output of: kubectl get pods --all-namespaces --field-selector status.phase!=Running
- Output of: kubectl get events --sort-by='.lastTimestamp' --all-namespaces | tail -50
- Azure Activity Log entries for the cluster (last 2 hours)
- Correlation ID from any failed Azure operations
info

Run az aks get-credentials before generating diagnostics. For managed AAD clusters, use az aks get-credentials --admin if your regular token is expired — you need working kubectl access to collect the data above.

Post-incident checklist

Complete this within 48 hours of every Sev A or Sev B incident:

  1. Timeline — Write a minute-by-minute account of what happened, when it was detected, and when it was resolved.
  2. Root cause — Identify the actual root cause, not the proximate trigger.
  3. Detection gap — How long between the issue starting and the first alert? If more than 5 minutes, fix monitoring.
  4. Resolution steps — Document exactly what commands were run to resolve the issue.
  5. Prevention — Identify at least one action item that prevents recurrence. Assign an owner and a due date.
  6. Runbook update — If this incident type is not covered in the runbook, add it.

Essential tools for on-call

kubectl aliases

Add these to your shell profile:

alias k='kubectl'
alias kg='kubectl get'
alias kd='kubectl describe'
alias kl='kubectl logs'
alias kgp='kubectl get pods'
alias kgn='kubectl get nodes'
alias kge='kubectl get events --sort-by=".lastTimestamp"'
alias kns='kubectl config set-context --current --namespace'

# Quick health check
alias khealth='kubectl get nodes && echo "---" && kubectl get pods -A --field-selector status.phase!=Running,status.phase!=Succeeded'

Azure portal resources

  • Resource Health — Shows if the AKS control plane is healthy. Go to the AKS resource, select Diagnose and solve problems.
  • Container Insights live logs — Real-time log streaming without kubectl. Go to the AKS resource, select Monitoring > Logs > Live data.
  • Activity Log — Shows recent control plane operations (scale, upgrade, restart). Filter to the last 2 hours.

CLI diagnostics

# Collect AKS diagnostics bundle
az aks kollect --resource-group <rg> --name <cluster> --storage-account <sa>

# Check AKS managed control plane health
az aks show --resource-group <rg> --name <cluster> --query "powerState"

Runbook template

Copy this template and customize it for your team. Store it in your team's wiki or incident management system.

# Runbook: [Incident type]

## Symptoms
- [What does the alert look like?]
- [What do users report?]

## Triage steps
1. [First command to run]
2. [Second command to run]
3. [How to confirm this is the right runbook]

## Resolution
1. [Step-by-step fix]
2. [Verification that the fix worked]

## Escalation
- If step N fails, escalate to [team/person]
- Azure support severity: [A/B/C]

## Prevention
- [What monitoring should catch this earlier]
- [What config change prevents recurrence]

Resources