Skip to main content

Cluster Autoscaler

HPA scales pods. Cluster Autoscaler scales the nodes those pods run on. Without it, HPA creates pods that sit in Pending state forever because there is nowhere to schedule them.

How it works

The logic is straightforward:

  1. Scale up: A pod is unschedulable (no node has enough resources). CA provisions a new node.
  2. Scale down: A node is underutilized (below threshold) for a sustained period. CA drains and removes it.
info

Cluster Autoscaler does NOT look at CPU/memory utilization on nodes. It looks at scheduling failures (scale up) and pod packing density (scale down). This is a critical distinction most teams get wrong.

Enabling on AKS

# Enable on existing node pool
az aks nodepool update \
--resource-group myRG \
--cluster-name myAKS \
--name nodepool1 \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 10

# Or during cluster creation
az aks create \
--resource-group myRG \
--name myAKS \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 10 \
--node-vm-size Standard_D4s_v5

Key parameters

ParameterRecommended ValueRationale
min-count2Survive a single node failure
max-countBudget-dependentSet this based on cost ceiling, not wishful thinking
scan-interval10s (default)Default is fine. Faster scanning wastes API calls
scale-down-delay-after-add10mNew nodes need time to receive pods. Removing them immediately is wasteful
scale-down-utilization-threshold0.5Node must be below 50% utilized to be a removal candidate
scale-down-unneeded-time10mNode must be underutilized for 10 min before removal
warning

Set scale-down-delay-after-add to at least 10 minutes. Without this, CA adds a node, pods schedule, some finish quickly, node looks underutilized, CA removes it, pods go Pending, CA adds again. This thrashing wastes money and creates instability.

Node autoprovision (NAP) vs Cluster Autoscaler

FeatureCluster AutoscalerNAP (AKS Automatic)
Node pool creationManual (you define pools)Automatic (picks VM SKU)
SKU selectionYou choose upfrontMatches workload requirements
Multiple workload typesRequires multiple poolsHandles automatically
GPU/Spot supportManual pool configAutomatic based on tolerations
ComplexityMediumLow

If you are on AKS Automatic, NAP handles node scaling for you. It reads pod resource requests and tolerations, then picks the best VM SKU and creates node pools on demand. Stop managing node pools manually.

If you are on AKS Standard, use Cluster Autoscaler with purpose-built node pools.

Best practice: multiple node pools

Do not run everything on a single Standard_D4s_v5 node pool. Segment by workload class:

# General workloads
az aks nodepool add --name general \
--node-vm-size Standard_D4s_v5 \
--enable-cluster-autoscaler --min-count 2 --max-count 10

# Memory-intensive (caches, in-memory DBs)
az aks nodepool add --name highmem \
--node-vm-size Standard_E4s_v5 \
--enable-cluster-autoscaler --min-count 0 --max-count 5 \
--labels workload=memory-intensive

# GPU workloads (ML inference)
az aks nodepool add --name gpu \
--node-vm-size Standard_NC6s_v3 \
--enable-cluster-autoscaler --min-count 0 --max-count 3 \
--labels workload=gpu --node-taints gpu=true:NoSchedule
tip

Set min-count 0 on specialized pools (GPU, high-memory). Let them scale to zero when no workloads need them. Only your general pool needs a non-zero minimum.

Common mistakes

Setting max-count too low. During a traffic spike, CA hits the ceiling and your pods stay Pending. Monitor unschedulable pod events and increase max-count before you need it.

Not using Pod Disruption Budgets. When CA removes a node, it evicts pods. Without a PDB, all replicas on that node can terminate simultaneously. Always set PDBs for production workloads.

Ignoring node startup time. A new AKS node takes 2-4 minutes to become Ready. CA cannot provide instant capacity. Plan for this latency with appropriate HPA headroom.

Single node pool for everything. A memory-heavy pod on a CPU-optimized node wastes resources. Use node affinity and taints to match workloads to appropriate VM SKUs.

Forgetting availability zones. Configure --zones 1 2 3 on node pools. CA respects zone topology and distributes nodes across zones for high availability.

Monitoring Cluster Autoscaler

# Check CA status
kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml

# View CA logs
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50

# Check for unschedulable pods
kubectl get pods --field-selector=status.phase=Pending

Resources