Skip to main content

Platform extensions

A production AKS cluster needs more than just Kubernetes. These are the ecosystem tools that fill the gaps between what Kubernetes provides and what production workloads actually need.

cert-manager

Automated TLS certificate lifecycle management. It watches Ingress and Gateway resources, requests certificates from issuers like Let's Encrypt or Azure Key Vault, and renews them before expiry. Manual certificate management does not scale — one forgotten renewal takes down production at 2 AM.

Install

helm repo add jetstack https://charts.jetstack.io --force-update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--set crds.enabled=true

Use a ClusterIssuer instead of namespace-scoped Issuer resources. One issuer serves every namespace.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: platform-team@example.com
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
warning

Do not use the Let's Encrypt staging issuer in production "just to test." Staging certificates are not trusted by browsers and will cause silent failures in health checks and monitoring tools.

Common mistakes

MistakeImpactFix
Namespace-scoped Issuer per teamDuplicated config, inconsistent renewalUse ClusterIssuer
Missing RBAC for DNS-01 challengesCertificate issuance fails silentlyGrant the cert-manager identity DNS Zone Contributor on your Azure DNS zone
Not monitoring certificate expiryOutages from expired certsAdd a Prometheus alert on certmanager_certificate_expiration_timestamp_seconds

external-dns

Automatically creates and updates DNS records in Azure DNS when you create Ingress or Service resources. If your process involves opening the Azure portal to add an A record, you have a gap in your GitOps pipeline.

Install

helm repo add external-dns https://kubernetes-sigs.github.io/external-dns
helm install external-dns external-dns/external-dns \
--namespace external-dns --create-namespace \
--set provider.name=azure \
--set azure.resourceGroup=<YOUR_DNS_RG> \
--set azure.subscriptionId=<YOUR_SUB_ID> \
--set azure.tenantId=<YOUR_TENANT_ID> \
--set policy=upsert-only \
--set registry=txt --set txtOwnerId=aks-cluster-01
tip

Set policy=upsert-only in production. The default sync policy deletes DNS records that are no longer backed by a Kubernetes resource, which can wipe records managed outside the cluster.

Use workload identity for authentication:

az role assignment create \
--assignee-object-id <EXTERNAL_DNS_MI_OBJECT_ID> \
--role "DNS Zone Contributor" \
--scope /subscriptions/<SUB_ID>/resourceGroups/<RG>/providers/Microsoft.Network/dnsZones/<ZONE>

Common mistakes

MistakeImpactFix
Using sync policy with shared DNS zonesDeletes records owned by other systemsUse upsert-only and set a unique txtOwnerId
Running multiple instances without owner IDsConflicting updates, record flappingEvery cluster gets its own txtOwnerId
Forgetting --txt-prefixTXT ownership records collide with real TXT recordsSet --txt-prefix=extdns-

External secrets operator

Syncs secrets from Azure Key Vault into Kubernetes Secret objects. Use ESO instead of the Azure Key Vault CSI driver for most workloads — ESO supports templating, automatic rotation, and works with any pod without requiring CSI volume mounts.

Install

helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets \
--namespace external-secrets --create-namespace \
--set installCRDs=true

Create a ClusterSecretStore with workload identity:

apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: azure-keyvault
spec:
provider:
azurekv:
authType: WorkloadIdentity
vaultUrl: https://my-vault.vault.azure.net
serviceAccountRef:
name: external-secrets-sa
namespace: external-secrets

Then declare secrets per namespace:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: azure-keyvault
kind: ClusterSecretStore
target:
name: app-secrets
data:
- secretKey: db-password
remoteRef:
key: my-app-db-password
info

The CSI Secrets Store driver mounts secrets as files and requires every pod to declare a volume. External Secrets Operator creates standard Kubernetes Secrets that work with envFrom, env, and volume mounts without any changes to your pod spec. Prefer ESO unless you specifically need file-based secret injection.

Common mistakes

MistakeImpactFix
Setting refreshInterval to 0Secrets never rotate after initial syncUse 1h or shorter for sensitive credentials
One SecretStore per namespaceDuplicated Key Vault config across namespacesUse ClusterSecretStore
Not setting target.creationPolicy: OwnerOrphaned Kubernetes Secrets after ExternalSecret deletionSet creationPolicy: Owner to garbage-collect secrets

Dapr

Dapr provides building blocks for microservices: service invocation, pub/sub, state management, and bindings. If you are building services that publish events or manage state, Dapr abstracts the infrastructure so your code does not couple to a specific broker or store.

Install

Use the AKS extension, not Helm. The AKS extension is managed by Microsoft, handles upgrades, and integrates with Azure support.

az k8s-extension create \
--cluster-type managedClusters \
--cluster-name <CLUSTER_NAME> \
--resource-group <RG> \
--name dapr \
--extension-type Microsoft.Dapr \
--auto-upgrade-minor-version true
warning

Do not install Dapr via Helm on AKS. The AKS extension provides lifecycle management, monitoring integration, and support coverage that a Helm install does not.

Enable Dapr by annotating the pod spec:

annotations:
dapr.io/enabled: "true"
dapr.io/app-id: "order-service"
dapr.io/app-port: "8080"
dapr.io/log-level: "info"

Common mistakes

MistakeImpactFix
Installing Dapr via Helm on AKSNo support coverage, manual upgradesUse the AKS extension
Enabling Dapr on every podUnnecessary sidecar overhead for simple servicesOnly annotate pods that use Dapr building blocks
Skipping mTLS configurationService-to-service traffic is unencryptedDapr enables mTLS by default; do not disable it

Gateway API

Gateway API is the successor to the Ingress resource. It provides a standard, role-oriented API for L4/L7 traffic routing with support for traffic splitting, header-based routing, and cross-namespace references. Use it instead of Ingress for new workloads.

Install

On AKS, use Application Gateway for Containers (AGC) as the Gateway API implementation:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: main-gateway
namespace: gateway-infra
spec:
gatewayClassName: azure-alb-external
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
certificateRefs:
- name: wildcard-cert
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: app-route
spec:
parentRefs:
- name: main-gateway
namespace: gateway-infra
hostnames: ["app.example.com"]
rules:
- matches:
- path: { type: PathPrefix, value: / }
backendRefs:
- name: app-service
port: 80
tip

Define the Gateway resource in an infrastructure namespace owned by the platform team. Application teams create HTTPRoute resources in their own namespaces with parentRefs pointing to the shared gateway. This enforces separation of concerns.

Common mistakes

MistakeImpactFix
Using Ingress when Gateway API is availableLocked into annotation-based config, limited routingMigrate to Gateway API for new workloads
One Gateway per applicationWasted load balancer resources, higher costShare a Gateway across applications using HTTPRoute
Missing ReferenceGrant for cross-namespace refsRoutes silently fail to attachCreate ReferenceGrant in the target namespace

OpenTelemetry collector

A vendor-neutral telemetry pipeline that receives, processes, and exports traces, metrics, and logs. Instrument once with OpenTelemetry SDKs, then route to Azure Monitor, Prometheus, or any OTLP-compatible backend without changing application code.

Install

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector \
--namespace otel --create-namespace \
--set mode=deployment

Use the DaemonSet mode for log and metric collection, Deployment mode for trace aggregation:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel
namespace: otel
spec:
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
otlp:
endpoint: "azure-monitor-endpoint:443"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
warning

Always configure the memory_limiter processor. Without it, a burst of telemetry data can OOM-kill the collector pod and create a gap in your observability pipeline.

Common mistakes

MistakeImpactFix
Skipping the memory_limiter processorCollector OOM under loadAdd memory_limiter as the first processor in every pipeline
Running only Deployment modeMisses node-level metrics and logsUse DaemonSet for collection, Deployment for aggregation
Exporting everything without samplingHigh cost, storage bloatConfigure tail sampling for traces at 10-20% in non-production

Kyverno

A policy engine for Kubernetes that uses YAML instead of Rego (OPA Gatekeeper). Kyverno validates, mutates, generates, and cleans up resources. Use Kyverno instead of OPA Gatekeeper unless your organization already has a Rego investment.

Install

helm repo add kyverno https://kyverno.github.io/kyverno
helm install kyverno kyverno/kyverno \
--namespace kyverno --create-namespace \
--set replicaCount=3

Start with Audit mode, then switch to Enforce once you confirm policies do not break existing workloads:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Audit
rules:
- name: check-limits
match:
any:
- resources:
kinds: [Pod]
validate:
message: "CPU and memory limits are required."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
info

Use Kyverno instead of OPA Gatekeeper unless your organization already has a Rego investment. Kyverno policies are easier to write, review in PRs, and debug. The mutation and generation capabilities also reduce boilerplate across namespaces.

Common mistakes

MistakeImpactFix
Starting with Enforce modeBlocks existing workloads that violate policiesStart with Audit, review violations, then switch to Enforce
Not excluding system namespacesPolicies block kube-system componentsAdd exclude rules for kube-system, cert-manager, and other platform namespaces
Too many mutation policiesHard to debug why a resource looks different from the manifestDocument mutations and keep them minimal

Which extensions do you need?

Not every cluster needs every extension. Use this table to pick the right set for your workload type.

ExtensionWeb appMicroservicesEvent-drivenML/batch
cert-managerRequiredRequiredRecommendedOptional
external-dnsRequiredRequiredRecommendedOptional
External Secrets OperatorRequiredRequiredRequiredRequired
DaprOptionalRequiredRequiredNot needed
Gateway APIRequiredRequiredOptionalNot needed
OpenTelemetry CollectorRequiredRequiredRequiredRecommended
KyvernoRequiredRequiredRequiredRequired

Anti-patterns

Installing everything on day one. Start with what you need. Each extension adds CRDs, pods, and upgrade burden. Add extensions when you have a concrete use case.

Using Helm when an AKS add-on exists. AKS provides managed versions of Dapr, KEDA, Flux, and others. These integrate with Azure support and upgrade automatically. Check before reaching for Helm:

az k8s-extension list --cluster-type managedClusters \
--cluster-name <CLUSTER_NAME> --resource-group <RG> -o table

Running extensions without resource limits. Every extension runs pods in your cluster. Set requests and limits on all extension workloads to prevent resource starvation.

Over-scoping RBAC. Use workload identity with the narrowest role possible. Do not assign Contributor at the subscription level when DNS Zone Contributor on a single zone is sufficient.

Resources