Skip to main content

Multi-tenancy and governance

Most organizations should run fewer, larger clusters instead of one cluster per team. Multi-tenancy makes that work without teams stepping on each other. A well-governed shared cluster is cheaper, easier to patch, and simpler to monitor — but without guardrails it becomes the wild west within weeks.

Tenancy models

Pick one model and stick with it. Mixing models in the same cluster creates confusion.

ModelWhen to useIsolation level
Namespace-per-teamDefault choice. Teams share a cluster, each gets a namespace with quotas and RBAC.Logical
Namespace-per-environmentSmall orgs where one team owns dev/staging/prod namespaces in the same cluster.Logical
Cluster-per-tenantRegulatory requirements (PCI, HIPAA), zero-trust between tenants, or GPU workloads that need full node control.Physical

Use namespace-per-team unless you have a documented reason not to. Cluster-per-tenant doubles your operational burden.

tip

Start with namespace-per-team. You can promote a tenant to their own cluster later. Going the other direction is painful.

Namespace isolation checklist

Every new tenant namespace needs all five of these. Skip one and you have a gap.

#ResourcePurpose
1ResourceQuotaCaps total CPU, memory, storage, and object counts
2LimitRangeSets default requests/limits so no pod runs unbounded
3NetworkPolicyDenies all traffic by default, then allows specific flows
4RoleBindingScoped RBAC — team members get only namespace-level access
5Entra ID group mappingBind roles to Azure AD groups, not individual users

Resource quotas and LimitRange

Without quotas, one team can consume the entire cluster. Set quotas on day one, not after an incident.

apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
persistentvolumeclaims: "10"
pods: "50"
services.loadbalancers: "2"
---
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-alpha
spec:
limits:
- default: { cpu: 500m, memory: 512Mi }
defaultRequest: { cpu: 100m, memory: 128Mi }
max: { cpu: "2", memory: 4Gi }
min: { cpu: 50m, memory: 64Mi }
type: Container
warning

If you set a ResourceQuota on CPU or memory, every pod in that namespace must specify requests and limits. Pods without them will be rejected. Always pair quotas with a LimitRange to set defaults.

Network policies

Use Azure CNI with Cilium or Calico — not kubenet. Kubenet does not enforce network policies. Apply a deny-all policy first, then punch holes for DNS and your ingress controller:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: deny-all, namespace: team-alpha }
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-dns, namespace: team-alpha }
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to: [{ namespaceSelector: {} }]
ports: [{ protocol: UDP, port: 53 }, { protocol: TCP, port: 53 }]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-ingress-controller, namespace: team-alpha }
spec:
podSelector: {}
policyTypes: [Ingress]
ingress:
- from:
- namespaceSelector:
matchLabels: { kubernetes.io/metadata.name: ingress-system }
info

Test network policies in staging first. A misconfigured egress policy that blocks DNS will take down every workload in the namespace instantly.

RBAC patterns

Use built-in ClusterRoles for permissions, namespace-scoped RoleBindings for access. Never grant cluster-admin to application teams.

RoleBuilt-in ClusterRoleCan doCannot do
Team adminadminDeployments, services, configmaps, secrets, HPAModify quotas, network policies, or RBAC
DevelopereditDeploy workloads, view logs, exec into podsDelete PVs, modify LoadBalancer services
ViewerviewRead all resources, view logsCreate, update, or delete anything
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { name: team-alpha-admin, namespace: team-alpha }
subjects:
- kind: Group
name: "<entra-group-object-id>"
apiGroup: rbac.authorization.k8s.io
roleRef: { kind: ClusterRole, name: admin, apiGroup: rbac.authorization.k8s.io }
# Swap 'admin' to 'edit' (developer) or 'view' (read-only)
warning

Bind roles to Entra ID groups, not individual users. When someone leaves or changes teams, you update group membership in one place instead of hunting through RoleBindings.

Azure Policy for AKS

Use built-in policy initiatives — do not write custom policies until you have exhausted the built-in library.

PolicyEffectWhy
Pod security baseline initiativeDenyBlocks privileged containers, host networking, host PID/IPC
Container images from allowed registries onlyDenyPrevents pulling from Docker Hub or unknown registries
Containers must have resource limitsDenyBelt and suspenders on top of LimitRange
Containers must not run as rootAudit, then DenyStart with audit to find violations, flip to deny once clean
Pods must use approved labelsAuditRequired for cost allocation
az policy assignment create \
--name "aks-pod-security-baseline" \
--policy-set-definition "a8640138-9b0a-4a28-b8cb-1666c838647d" \
--scope "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.ContainerService/managedClusters/<cluster>" \
--params '{"effect": {"value": "Deny"}}'

Cost allocation

You cannot split costs if you cannot attribute resources to teams. Enable the cost analysis add-on before onboarding the second tenant:

az aks update --resource-group <rg> --name <cluster> --enable-cost-analysis

Enforce these labels on every namespace via Azure Policy:

LabelExamplePurpose
cost-centercc-12345Maps to finance cost center
teamplatform-engineeringOwnership
environmentproductionDistinguishes prod from dev spend
tip

Use Azure Policy to deny namespaces missing the cost-center and team labels. Without enforcement, label discipline decays within weeks.

Node pool isolation

Use dedicated node pools when logical isolation is not enough.

ScenarioRecommendation
GPU workloadsDedicated pool with GPU VMs and NoSchedule taints
Compliance-sensitive tenantsDedicated pool, no shared scheduling
Noisy neighborsTaint a pool, tolerate only the noisy workload
Burstable dev/testB-series pool, autoscaler minimum zero
az aks nodepool add \
--resource-group <rg> --cluster-name <cluster> --name gpupool \
--node-count 2 --node-vm-size Standard_NC6s_v3 \
--node-taints "sku=gpu:NoSchedule" --labels team=ml-team

Pods targeting this pool need a toleration and node selector:

tolerations:
- { key: "sku", operator: "Equal", value: "gpu", effect: "NoSchedule" }
nodeSelector: { team: ml-team }

Common mistakes

MistakeConsequenceFix
No resource quotasOne team's leak OOM-kills cluster-wideResourceQuota on every namespace
Overly permissive RBACDevs delete other teams' resourcesNamespace-scoped RoleBinding only
No network policiesAny pod reaches any podDefault deny-all per namespace
Shared service accountsCannot audit who did whatOne SA per workload
RBAC bound to usersStale accounts, sprawlEntra ID groups exclusively
No LimitRange with quotaPods rejected on deployAlways pair both

New team onboarding template

Run this once per team. It creates the namespace with all five isolation primitives.

#!/bin/bash
set -euo pipefail
NS="${1:?Usage: $0 <namespace> <admin-group-id> <viewer-group-id>}"
ADMIN="${2:?Provide Entra admin group object ID}"
VIEWER="${3:?Provide Entra viewer group object ID}"

kubectl create namespace "$NS" --dry-run=client -o yaml | \
kubectl label --local -f - cost-center="CHANGE-ME" team="$NS" environment="production" \
--dry-run=client -o yaml | kubectl apply -f -

kubectl apply -n "$NS" -f - <<EOF
apiVersion: v1
kind: ResourceQuota
metadata: { name: quota }
spec:
hard: { requests.cpu: "8", requests.memory: 16Gi, limits.cpu: "16", limits.memory: 32Gi, pods: "50", services.loadbalancers: "2" }
---
apiVersion: v1
kind: LimitRange
metadata: { name: default-limits }
spec:
limits:
- default: { cpu: 500m, memory: 512Mi }
defaultRequest: { cpu: 100m, memory: 128Mi }
max: { cpu: "2", memory: 4Gi }
min: { cpu: 50m, memory: 64Mi }
type: Container
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: deny-all }
spec: { podSelector: {}, policyTypes: [Ingress, Egress] }
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: allow-dns }
spec:
podSelector: {}
policyTypes: [Egress]
egress:
- to: [{ namespaceSelector: {} }]
ports: [{ protocol: UDP, port: 53 }, { protocol: TCP, port: 53 }]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { name: admin }
subjects: [{ kind: Group, name: "$ADMIN", apiGroup: rbac.authorization.k8s.io }]
roleRef: { kind: ClusterRole, name: admin, apiGroup: rbac.authorization.k8s.io }
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { name: viewers }
subjects: [{ kind: Group, name: "$VIEWER", apiGroup: rbac.authorization.k8s.io }]
roleRef: { kind: ClusterRole, name: view, apiGroup: rbac.authorization.k8s.io }
EOF

echo "Done. Update cost-center: kubectl label ns $NS cost-center=<value> --overwrite"

Resources