Skip to main content

Network troubleshooting

Networking issues in AKS are the hardest to debug because failures are silent. A pod gets no traffic and there is no error log telling you why. This page gives you a systematic approach.

Start here

Before diving into specific failures, collect baseline information:

# Cluster networking model
az aks show -g myRG -n myCluster --query "networkProfile" -o table

# Node status and IPs
kubectl get nodes -o wide

# All services and their endpoints
kubectl get svc -A

# All network policies
kubectl get networkpolicy -A

Service not reachable

A ClusterIP or LoadBalancer service exists but clients get no response.

Decision tree

1. Does the service have endpoints?

kubectl get endpoints <service-name> -n <namespace>
ResultCauseFix
No endpoints listedNo pods match the service selectorFix pod labels to match service spec.selector
Endpoints exist but IPs are wrongPods exist but are not ReadyCheck readiness probes, fix the health check
Endpoints exist and look correctProblem is elsewhereContinue to step 2

2. Do pod labels match the service selector?

# Show service selector
kubectl get svc <service-name> -n <ns> -o jsonpath='{.spec.selector}'

# Show pod labels
kubectl get pods -n <ns> --show-labels

The selector labels must be an exact subset of the pod labels. A single typo breaks everything.

3. Are the pods actually Ready?

kubectl get pods -n <ns> -o wide | grep -v "1/1"

If pods show 0/1 or Running but not Ready, the readiness probe is failing. The service will not send traffic to pods that are not Ready.

4. Does the port match?

kubectl get svc <service-name> -n <ns> -o yaml | grep -A 5 "ports:"
warning

The service port is what clients connect to. The targetPort must match the port your container actually listens on. These are often different and misconfigured.

5. Test connectivity from inside the cluster:

# Run a debug pod
kubectl run nettest --image=nicolaka/netshoot --rm -it -- bash

# From inside the debug pod
curl -v http://<service-name>.<namespace>.svc.cluster.local:<port>

Ingress not working

External traffic is not reaching your application through an ingress resource.

Decision tree

1. Is the ingress controller running?

# For NGINX ingress
kubectl get pods -n ingress-nginx

# For Application Gateway Ingress Controller (AGIC)
kubectl get pods -n kube-system -l app=ingress-appgw

If the controller pod is not Running, fix that first. Nothing else matters.

2. Does the ingress resource exist and have an address?

kubectl get ingress -A
kubectl describe ingress <name> -n <ns>
SymptomCauseFix
ADDRESS column is emptyController has not reconciled the resourceCheck controller logs for errors
ADDRESS shows an IP but requests timeoutLoad balancer is healthy but backend is notCheck the backend service and pods
404 from the ingress controllerNo matching rule for the host/pathFix host and path in the ingress spec
502 Bad GatewayBackend service exists but pods are not respondingCheck pod health, readiness probes, and targetPort

3. Is TLS configured correctly?

# Check the secret exists
kubectl get secret <tls-secret-name> -n <ns>

# Verify the certificate
kubectl get secret <tls-secret-name> -n <ns> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates -subject
tip

Expired certificates are the number one cause of TLS ingress failures. Set up cert-manager with Let's Encrypt to automate renewal. Never manage TLS certificates manually.

4. Is DNS pointing to the ingress?

nslookup myapp.example.com
# The IP should match the ingress ADDRESS
kubectl get ingress <name> -n <ns> -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

DNS resolution failures

Pods cannot resolve service names, external hostnames, or both.

Decision tree

1. Is CoreDNS running?

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

If CoreDNS pods are in CrashLoopBackOff, the entire cluster DNS is broken. Fix this immediately.

2. Can pods resolve internal names?

kubectl run dnstest --image=nicolaka/netshoot --rm -it -- \
nslookup kubernetes.default.svc.cluster.local
ResultCauseFix
Resolution succeedsInternal DNS works, problem is externalContinue to step 3
connection timed out; no servers could be reachedCoreDNS is unreachableCheck CoreDNS pods and the kube-dns service in kube-system
server can't findService name is wrong or does not existVerify the service exists in the expected namespace

3. Can pods resolve external names?

kubectl run dnstest --image=nicolaka/netshoot --rm -it -- \
nslookup microsoft.com

If internal resolution works but external fails, check the CoreDNS configuration:

kubectl get configmap coredns -n kube-system -o yaml

4. Is custom DNS overriding Azure DNS?

az network vnet show -g myRG -n myVNet --query "dhcpOptions.dnsServers"
warning

If you set custom DNS servers on the VNet, all DNS queries from pods go to those servers first. If those servers cannot resolve Kubernetes internal names, service discovery breaks completely. Use the conditional forwarding approach: forward cluster.local to CoreDNS, everything else to your custom DNS.


Egress blocked

Pods cannot reach external services, registries, or Azure APIs.

Decision tree

1. Check NSG rules on the subnet:

az network nsg list -g MC_myRG_myCluster_eastus2 -o table
az network nsg rule list -g MC_myRG_myCluster_eastus2 --nsg-name <nsg-name> -o table

2. Check if Azure Firewall or an NVA is blocking traffic:

# Show the route table on the AKS subnet
az network route-table list -g MC_myRG_myCluster_eastus2 -o table
az network route-table route list -g MC_myRG_myCluster_eastus2 --route-table-name <table> -o table

If a UDR sends 0.0.0.0/0 to a firewall, that firewall must allow AKS required outbound traffic. See the required rules in the Resources section.

3. Check network policies blocking egress:

kubectl get networkpolicy -n <ns> -o yaml

Look for policyTypes that include Egress. If an egress policy exists, it must explicitly allow the destination.

4. Test outbound connectivity from a pod:

kubectl run egresstest --image=nicolaka/netshoot --rm -it -- bash

# Test HTTPS
curl -v https://mcr.microsoft.com
# Test DNS
nslookup mcr.microsoft.com
# Test specific port
nc -zv <destination-ip> <port>
info

AKS clusters with outboundType: userDefinedRouting require you to explicitly allow all egress. The minimum required destinations include mcr.microsoft.com, management.azure.com, login.microsoftonline.com, and your Azure region's service tags. Missing any of these causes node provisioning failures.


Private cluster cannot connect

You cannot run kubectl commands against a private AKS cluster.

Decision tree

1. Can your machine resolve the API server DNS name?

nslookup <cluster-name>.<private-dns-zone>.privatelink.<region>.azmk8s.io

If this fails, your machine cannot see the private DNS zone. You need DNS forwarding or a direct link to the private DNS zone.

2. Are you on a network that can reach the API server?

Private clusters have no public IP on the API server. You must be on:

  • The same VNet or a peered VNet
  • A VPN connected to the VNet
  • An ExpressRoute circuit connected to the VNet
  • A jumpbox VM inside the VNet

3. Is the private DNS zone linked to your VNet?

az network private-dns zone list -g MC_myRG_myCluster_eastus2 -o table
az network private-dns link vnet list -g MC_myRG_myCluster_eastus2 -z <zone-name> -o table

4. Are authorized IP ranges blocking you?

az aks show -g myRG -n myCluster --query "apiServerAccessProfile" -o yaml

If authorizedIpRanges is set, your client IP must be in the list. Use --api-server-authorized-ip-ranges "" to clear them temporarily for debugging.

tip

For day-to-day private cluster access, use az aks command invoke. It runs kubectl commands through Azure's control plane without needing VPN or jumpbox access.

az aks command invoke -g myRG -n myCluster --command "kubectl get pods -A"

Network policy blocking traffic

Pods are running and services have endpoints, but traffic is still blocked.

Decision tree

1. Which policies affect the target pod?

# List all network policies in the namespace
kubectl get networkpolicy -n <ns>

# Check which ones select your pod
kubectl get networkpolicy -n <ns> -o json | \
jq '.items[] | select(.spec.podSelector.matchLabels | to_entries[] | .key as $k | .value as $v | "'<pod-labels>'" | contains($k + "=" + $v)) | .metadata.name'

Simpler approach: read each policy in the namespace and check if its podSelector matches your pod labels.

2. Understand the default deny behavior:

ScenarioResult
No network policies in namespaceAll traffic allowed (default)
Policy with podSelector: {} and Ingress in policyTypesAll ingress blocked for all pods unless explicitly allowed
Policy selecting specific pods with Ingress typeOnly those pods have ingress restricted; other pods are unaffected
Policy with both Ingress and Egress in policyTypesBoth directions blocked for selected pods unless allowed

3. Common mistakes:

MistakeWhat happensFix
Allowing ingress by port but wrong protocolTCP is the default. If your app uses UDP, you must specify protocol: UDPAdd explicit protocol to the port rule
Missing namespaceSelector on ingress from another namespaceTraffic from other namespaces is blocked even if the pod selector matchesAdd namespaceSelector with the source namespace labels
Egress policy missing DNS egress rulePods cannot resolve any DNS names, causing all external connectivity to failAllow egress to kube-system on port 53 (TCP and UDP)
warning

If you add a network policy with policyTypes: ["Ingress"] and an empty ingress: [] list, you have created a default deny for all matched pods. This is the most common accidental outage caused by network policies.


Quick diagnosis script

Run this to collect networking state in one shot:

#!/bin/bash
NS=${1:-default}
echo "=== Nodes ==="
kubectl get nodes -o wide
echo ""
echo "=== Services in $NS ==="
kubectl get svc -n "$NS" -o wide
echo ""
echo "=== Endpoints in $NS ==="
kubectl get endpoints -n "$NS"
echo ""
echo "=== Ingress in $NS ==="
kubectl get ingress -n "$NS"
echo ""
echo "=== Network Policies in $NS ==="
kubectl get networkpolicy -n "$NS"
echo ""
echo "=== CoreDNS pods ==="
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
echo ""
echo "=== Recent CoreDNS logs ==="
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=20
echo ""
echo "=== DNS test (internal) ==="
kubectl run dnscheck --image=busybox:1.36 --rm -it --restart=Never -- \
nslookup kubernetes.default.svc.cluster.local 2>&1 || true
echo ""
echo "=== DNS test (external) ==="
kubectl run dnscheck2 --image=busybox:1.36 --rm -it --restart=Never -- \
nslookup microsoft.com 2>&1 || true

Resources