Skip to main content

GPU node pools

GPUs are expensive. Configure them correctly, scale to zero when idle, and use spot instances for training jobs that can checkpoint.

GPU VM families

SeriesGPUVRAMUse CaseOpinion
NC A100 v4A10080 GBMost ML/AI workloadsDefault choice for inference and fine-tuning
ND H100 v5H10080 GBLarge model trainingWhen A100 isn't enough (100B+ param models)
NC T4 v3T416 GBLight inference, dev/testBudget option, good for testing
NV v3M608 GBVisualization onlyNot for ML/AI
Opinion

Use Standard_NC24ads_A100_v4 for most ML/AI workloads. It handles inference, fine-tuning, and moderate training. Only move to ND H100 for large-scale distributed training. Use NC T4 for dev/test and light inference where cost matters more than throughput.

Creating a GPU node pool

# Add GPU node pool with autoscaling (scales to 0 when idle)
az aks nodepool add \
--resource-group myrg \
--cluster-name myaks \
--name gpua100 \
--node-vm-size Standard_NC24ads_A100_v4 \
--node-count 0 \
--min-count 0 \
--max-count 4 \
--enable-cluster-autoscaler \
--node-taints "sku=gpu:NoSchedule" \
--labels workload=gpu \
--zones 1 2 3
warning

Always taint GPU nodes with NoSchedule. Without taints, the scheduler will place regular workloads on your expensive GPU nodes. The taint ensures only pods with matching tolerations land there.

NVIDIA device plugin

AKS automatically installs the NVIDIA device plugin on GPU nodes. You don't need to install it manually. It exposes nvidia.com/gpu as a schedulable resource.

Requesting GPU in pod spec

apiVersion: v1
kind: Pod
metadata:
name: gpu-inference
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: model-server
image: myacr.azurecr.io/inference-server:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
nodeSelector:
workload: gpu
info

GPUs cannot be shared between containers natively. If you request nvidia.com/gpu: 1, you get a whole GPU. For sharing, look at NVIDIA MIG (Multi-Instance GPU) or time-slicing -- covered in Inference Serving.

Spot instances for GPU

Use spot for training jobs that can checkpoint. Never use spot for inference serving.

# Spot GPU pool -- saves 60-90% but can be evicted
az aks nodepool add \
--resource-group myrg \
--cluster-name myaks \
--name gpuspot \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--min-count 0 \
--max-count 8 \
--enable-cluster-autoscaler \
--node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" \
--labels workload=gpu-spot
WorkloadUse Spot?Why
Model training (with checkpointing)YesSave 60-90%, restart from checkpoint on eviction
Batch inference (non-realtime)YesRe-queue failed batches
Real-time inference servingNoEviction causes user-facing downtime
Fine-tuning (hours-long)Yes, with checkpointsSave significantly on long jobs

Cost management

GPUs are 5-10x more expensive than general compute. Manage costs aggressively:

  1. Scale to zero: Set --min-count 0 on GPU pools. The autoscaler removes nodes when no GPU pods are pending.
  2. Use spot for training: 60-90% cheaper for interruptible work.
  3. Right-size GPU requests: Don't request 4 GPUs when 1 suffices. Each unused GPU wastes $2-10/hour.
  4. Schedule training off-peak: Spot availability is higher during off-peak hours.
# Check current GPU utilization before adding capacity
kubectl top pods -l workload=gpu --containers

Common mistakes

  1. Not tainting GPU nodes -- Regular pods fill expensive GPU nodes.
  2. Setting min-count > 0 for intermittent workloads -- Paying for idle GPUs 24/7.
  3. Using spot for production inference -- Users get errors when nodes are evicted.
  4. Forgetting availability zones -- GPU SKUs have limited zone availability. Check first.

Resources