Azure Monitor and Container Insights
Enable Container Insights on every cluster. There is no excuse for flying blind. The cost is minimal compared to debugging a production outage with zero telemetry. This is the foundation of AKS observability.
What Container Insights gives you
Container Insights is the Azure Monitor agent running as a DaemonSet in your cluster. It collects:
| Data Type | What You Get | Where It Lands |
|---|---|---|
| Node metrics | CPU, memory, disk, network per node | Azure Monitor Metrics |
| Pod metrics | CPU/memory requests vs actual usage | Azure Monitor Metrics |
| Container logs | stdout/stderr from every container | Log Analytics workspace |
| Live data | Real-time log streaming in the portal | Direct stream |
| Kubernetes events | Pod scheduling, restarts, failures | Log Analytics workspace |
| Recommended alerts | Pre-built alerts for common failures | Azure Monitor Alerts |
Enable Container Insights
Use the CLI. Do this at cluster creation or immediately after.
# Create a Log Analytics workspace (same region as your cluster)
az monitor log-analytics workspace create \
--resource-group myRG \
--workspace-name myAKS-logs \
--location eastus2
# Enable monitoring addon
az aks enable-addons \
--resource-group myRG \
--name myCluster \
--addons monitoring \
--workspace-resource-id "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.OperationalInsights/workspaces/myAKS-logs"
Pick a Log Analytics workspace in the same region as your cluster. Cross-region ingestion adds latency and egress cost.
Log tiers: this is where people waste money
Log Analytics has two tiers. Use them deliberately.
| Tier | Cost | Query | Retention | Use For |
|---|---|---|---|---|
| Basic | ~$0.65/GB ingested | Limited (8-day query window) | 8 days minimum | High-volume container logs, debug output |
| Analytics (Standard) | ~$2.76/GB ingested | Full KQL, alerts, dashboards | 30-730 days | Critical alerts, SLO queries, audit logs |
Use Basic logs tier for high-volume container logs. Use Analytics tier for tables you actively query and alert on. Do not pay Analytics prices for debug logs you query once a quarter.
Configure table-level plans in the portal under Log Analytics workspace > Tables, or via CLI:
az monitor log-analytics workspace table update \
--resource-group myRG \
--workspace-name myAKS-logs \
--name ContainerLogV2 \
--plan Basic
Syslog collection
Enable syslog collection via Data Collection Rules (DCR). This captures Linux system logs from your nodes -- essential for diagnosing kubelet, containerd, and kernel-level issues.
az aks update \
--resource-group myRG \
--name myCluster \
--enable-syslog \
--data-collection-settings dcr-settings.json
KQL queries you will actually use
These queries cover 90% of real-world troubleshooting:
// Pods in CrashLoopBackOff (last 1 hour)
KubePodInventory
| where TimeGenerated > ago(1h)
| where PodStatus == "Failed" or ContainerStatusReason == "CrashLoopBackOff"
| project TimeGenerated, Namespace, PodName, ContainerStatusReason
| order by TimeGenerated desc
// OOMKilled containers (last 24 hours)
KubePodInventory
| where TimeGenerated > ago(24h)
| where ContainerLastStatus contains "OOMKilled"
| project TimeGenerated, Namespace, PodName, ContainerName
| summarize OOMCount=count() by Namespace, PodName
// Node memory pressure
InsightsMetrics
| where TimeGenerated > ago(1h)
| where Name == "memoryRssBytes" and ObjectName == "K8SNode"
| extend UsedGB = Val / (1024*1024*1024)
| summarize AvgMemGB=avg(UsedGB) by NodeName=Tags["hostName"], bin(TimeGenerated, 5m)
| where AvgMemGB > 12 // adjust threshold to your node size
Common mistakes
- Not enabling Container Insights at all -- you cannot retroactively get logs from before you enabled it.
- Using Analytics tier for everything -- a busy cluster can generate 50+ GB/day of container logs. At Analytics pricing, that is $130+/day.
- Ignoring recommended alerts -- Container Insights comes with pre-built alert rules. Enable them. They catch OOM, node pressure, and pod failures.
- Wrong workspace region -- cross-region ingestion adds latency and cost. Always co-locate.
Container Insights v2 uses ContainerLogV2 table with structured JSON parsing. If you are still on the legacy ContainerLog table, migrate. The v2 schema is cheaper to query and easier to filter.
Decision: when to use Container Insights vs Prometheus
| Scenario | Use Container Insights | Use Prometheus |
|---|---|---|
| Cluster-level health | Yes | Also fine |
| Log aggregation and search | Yes | No (Prometheus is metrics-only) |
| Custom app metrics | No | Yes |
| Long-term metrics retention | Limited | Better with managed Prometheus |
| Alerting on log patterns | Yes | No |
Use both. They are complementary, not competing.