GPU node pools
GPUs are expensive. Configure them correctly, scale to zero when idle, and use spot instances for training jobs that can checkpoint.
GPU VM families
| Series | GPU | VRAM | Use Case | Opinion |
|---|---|---|---|---|
| NC A100 v4 | A100 | 80 GB | Most ML/AI workloads | Default choice for inference and fine-tuning |
| ND H100 v5 | H100 | 80 GB | Large model training | When A100 isn't enough (100B+ param models) |
| NC T4 v3 | T4 | 16 GB | Light inference, dev/test | Budget option, good for testing |
| NV v3 | M60 | 8 GB | Visualization only | Not for ML/AI |
Use Standard_NC24ads_A100_v4 for most ML/AI workloads. It handles inference, fine-tuning, and moderate training. Only move to ND H100 for large-scale distributed training. Use NC T4 for dev/test and light inference where cost matters more than throughput.
Creating a GPU node pool
# Add GPU node pool with autoscaling (scales to 0 when idle)
az aks nodepool add \
--resource-group myrg \
--cluster-name myaks \
--name gpua100 \
--node-vm-size Standard_NC24ads_A100_v4 \
--node-count 0 \
--min-count 0 \
--max-count 4 \
--enable-cluster-autoscaler \
--node-taints "sku=gpu:NoSchedule" \
--labels workload=gpu \
--zones 1 2 3
Always taint GPU nodes with NoSchedule. Without taints, the scheduler will place regular workloads on your expensive GPU nodes. The taint ensures only pods with matching tolerations land there.
NVIDIA device plugin
AKS automatically installs the NVIDIA device plugin on GPU nodes. You don't need to install it manually. It exposes nvidia.com/gpu as a schedulable resource.
Requesting GPU in pod spec
apiVersion: v1
kind: Pod
metadata:
name: gpu-inference
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: model-server
image: myacr.azurecr.io/inference-server:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
nodeSelector:
workload: gpu
GPUs cannot be shared between containers natively. If you request nvidia.com/gpu: 1, you get a whole GPU. For sharing, look at NVIDIA MIG (Multi-Instance GPU) or time-slicing -- covered in Inference Serving.
Spot instances for GPU
Use spot for training jobs that can checkpoint. Never use spot for inference serving.
# Spot GPU pool -- saves 60-90% but can be evicted
az aks nodepool add \
--resource-group myrg \
--cluster-name myaks \
--name gpuspot \
--node-vm-size Standard_NC24ads_A100_v4 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--min-count 0 \
--max-count 8 \
--enable-cluster-autoscaler \
--node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" \
--labels workload=gpu-spot
| Workload | Use Spot? | Why |
|---|---|---|
| Model training (with checkpointing) | Yes | Save 60-90%, restart from checkpoint on eviction |
| Batch inference (non-realtime) | Yes | Re-queue failed batches |
| Real-time inference serving | No | Eviction causes user-facing downtime |
| Fine-tuning (hours-long) | Yes, with checkpoints | Save significantly on long jobs |
Cost management
GPUs are 5-10x more expensive than general compute. Manage costs aggressively:
- Scale to zero: Set
--min-count 0on GPU pools. The autoscaler removes nodes when no GPU pods are pending. - Use spot for training: 60-90% cheaper for interruptible work.
- Right-size GPU requests: Don't request 4 GPUs when 1 suffices. Each unused GPU wastes $2-10/hour.
- Schedule training off-peak: Spot availability is higher during off-peak hours.
# Check current GPU utilization before adding capacity
kubectl top pods -l workload=gpu --containers
Common mistakes
- Not tainting GPU nodes -- Regular pods fill expensive GPU nodes.
- Setting min-count > 0 for intermittent workloads -- Paying for idle GPUs 24/7.
- Using spot for production inference -- Users get errors when nodes are evicted.
- Forgetting availability zones -- GPU SKUs have limited zone availability. Check first.