Backup and disaster recovery
Treat Kubernetes as cattle, not pets. Your Git repo IS your primary backup for manifests. AKS Backup exists for the things Git cannot capture: persistent volume data and runtime cluster state.
The fundamental question
Before designing your DR strategy, answer this: is your workload stateless or stateful?
| Workload Type | Recovery Strategy | Backup Needed? |
|---|---|---|
| Stateless apps (APIs, frontends) | Redeploy from Git + CI/CD | No. Just redeploy. |
| Stateful apps (databases, queues) | PV snapshots + restore | Yes. Non-negotiable. |
| Mixed (app + attached storage) | Git for manifests, AKS Backup for PVs | Yes, for the PV layer. |
For stateless applications, your Git repository plus your CI/CD pipeline IS your disaster recovery plan. You do not need AKS Backup to recover a deployment manifest. You need it to recover the data on a PersistentVolume.
AKS backup
AKS Backup is the Azure-native backup solutionusing Backup Vault and Trusted Access. It is managed, integrated, and does not require you to run any agents inside your cluster.
What gets backed up
- Kubernetes resources: Deployments, Services, ConfigMaps, Secrets, CRDs
- Persistent Volumes: CSI disk snapshots (Azure Disk, Azure Files)
- Cluster-scoped resources: Namespaces, ClusterRoles, StorageClasses
Enable AKS backup
# Install the backup extension
az k8s-extension create \
--name azure-aks-backup \
--cluster-name myCluster \
--resource-group myRG \
--cluster-type managedClusters \
--extension-type Microsoft.DataProtection.Kubernetes
# Create a backup vault
az dataprotection backup-vault create \
--vault-name myVault \
--resource-group myRG \
--location eastus2 \
--storage-setting "[{type:GeoRedundant,datastore-type:VaultStore}]"
# Configure backup policy (daily, 30-day retention)
az dataprotection backup-policy create \
--vault-name myVault \
--resource-group myRG \
--name daily-30d \
--policy @backup-policy.json
The cost is negligible compared to losing your workload state. A single unrecoverable PV loss will cost you more in incident response than a year of backup storage.
Cross-region restore
Use a geo-redundant backup vault to enable cross-region restore. When your primary region goes down, you can restore workloads into a cluster in the paired region.
Requirements:
- Backup vault must be
GeoRedundant(notLocallyRedundant) - Target cluster must exist in the secondary region
- Network policies and ingress must be pre-configured in the DR cluster
AKS backup vs Velero
| Criteria | AKS Backup | Velero |
|---|---|---|
| Managed by | Microsoft | You |
| Storage backend | Azure Backup Vault | S3-compatible (you manage) |
| PV snapshots | Native CSI integration | Requires plugins |
| Cross-region | Built-in with geo-redundant vault | Manual replication setup |
| Support | Microsoft support ticket | Community / vendor |
| Cost model | Per-protected-instance | Storage + compute you manage |
AKS Backup is integrated, managed, and does not require you to maintain an S3-compatible backend or worry about Velero version compatibility. If you already have Velero running and it works, keep it. For new clusters, choose AKS Backup.
Disaster recovery patterns
Active-passive (recommended for most teams)
Two clusters in different regions. Primary handles all traffic. Secondary is warm (running, but no traffic). Failover via Azure Traffic Manager or Front Door DNS switch.
- RTO: 5-15 minutes (DNS propagation)
- RPO: Depends on backup frequency (hourly = up to 1 hour of data loss)
- Cost: ~1.5x a single cluster (secondary runs smaller node pools)
Active-active (mission-critical only)
Two clusters both serving traffic via Azure Front Door or Traffic Manager. No failover needed because both are always active.
- RTO: Near-zero (traffic shifts automatically)
- RPO: Near-zero (both clusters have current state)
- Cost: 2x a single cluster
- Complexity: High. Requires stateless apps or distributed data layer.
GitOps-based recovery
For fully stateless workloads: delete the broken cluster, create a new one, point Flux/ArgoCD at your Git repo, and let it reconcile. No backup needed.
# Disaster strikes. Recovery:
az aks create --resource-group dr-rg --name recovery-cluster ...
flux bootstrap github --owner=myorg --repository=k8s-manifests --path=clusters/prod
Backup scope decisions
Not everything needs to be backed up. Be deliberate about what you protect.
| Resource | Back Up? | Rationale |
|---|---|---|
| Deployments, Services | No | Already in Git. Redeploy from source of truth. |
| ConfigMaps, Secrets | Maybe | Only if not managed by GitOps or external secret store |
| PersistentVolumes | Yes | Data not stored anywhere else |
| CRDs and CRs | Yes | Operator state may not be in Git |
| Namespaces | No | Recreated during deployment |
| RBAC (Roles, Bindings) | Maybe | Only if manually managed, not GitOps |
The rule is simple: if it exists only inside the cluster and nowhere else, back it up. If it can be reconstructed from Git, CI/CD, or an external system, do not waste backup storage on it.
Testing your DR plan
A backup you have never restored is not a backup. Schedule quarterly DR drills.
# Restore to a test cluster (not production!)
az dataprotection backup-instance restore trigger \
--vault-name myVault \
--resource-group myRG \
--backup-instance-name myCluster-backup \
--restore-request @restore-config.json
Validate after restore:
- All expected namespaces exist
- PVs are attached and contain expected data
- Services are reachable and responding
- CRDs and custom resources are intact
Common mistakes
- Backing up only manifests -- Your manifests are already in Git. Back up what Git cannot store: PV data.
- Never testing restore -- A backup you have never restored is not a backup. Test quarterly.
- LocallyRedundant vault for production -- If the region fails, your backups fail too.
- No DR runbook -- When the incident happens at 3 AM, you need step-by-step instructions, not a wiki page.
- Assuming PVs survive cluster deletion -- They do not. If you delete the cluster, attached disks are deleted too unless you have snapshots.
- Backing up everything -- Wastes storage and makes restore slower. Be selective.