Skip to main content

Observability comparison

AKS has three observability paths that overlap in confusing ways. This page clarifies what each one does, where they overlap, and what combination to actually use.

The three tools

ToolWhat it isHow it runs
Container InsightsAzure Monitor agent that collects logs, node/pod metrics, and Kubernetes eventsDaemonSet deployed via the monitoring addon
Managed PrometheusAzure-hosted Prometheus-compatible metrics store with pre-built scrapingMetrics addon deployed as a DaemonSet, data stored in Azure Monitor workspace
OpenTelemetryVendor-neutral SDK and collector for traces, metrics, and logsYou deploy the OTel Collector and instrument your application code

Full comparison

AspectContainer InsightsManaged PrometheusOpenTelemetry
Primary dataLogs + Kubernetes eventsMetrics (time series)Traces + custom metrics + logs
Storage backendLog Analytics workspaceAzure Monitor workspace (Prometheus-compatible)Depends on exporter (Azure Monitor, Jaeger, Zipkin)
DashboardsAzure portal workbooksAzure Managed GrafanaGrafana, Jaeger UI, or any compatible backend
AlertingAzure Monitor alerts on log queriesPrometheus alert rules in GrafanaDepends on backend
Custom metricsLimited (custom log queries via KQL)Full PromQL support, scrape any /metrics endpointFull SDK instrumentation in your application code
Infrastructure metricsNode CPU, memory, disk, networkSame plus any Prometheus exporter metricsNot designed for infrastructure metrics
Application-level datastdout/stderr logs onlyApplication metrics if exposed via /metricsDistributed traces, custom spans, application metrics
Cost modelPay per GB of logs ingested into Log AnalyticsPay per metrics samples ingestedFree (SDK is open source), pay for the backend you export to
Setup effortOne CLI commandOne CLI command + Grafana workspaceDeploy collector, instrument code, configure exporters
When to useAlways (baseline observability)When you need metrics beyond the portalWhen you need distributed tracing or custom instrumentation

Use all three. Each covers a layer the others cannot.

LayerToolWhy
Logs and Kubernetes eventsContainer InsightsCaptures stdout/stderr from every container plus scheduling events. This is your audit trail.
Metrics and dashboardsManaged Prometheus + GrafanaPromQL is the standard for metrics. Grafana dashboards are better than Azure portal workbooks. Community dashboards work out of the box.
Distributed tracingOpenTelemetryThe only option for request-level tracing across microservices. Container Insights and Prometheus cannot do this.
# 1. Container Insights (logs + events)
az aks enable-addons \
--resource-group myRG \
--name myCluster \
--addons monitoring \
--workspace-resource-id "<log-analytics-workspace-id>"

# 2. Managed Prometheus (metrics)
az aks update \
--resource-group myRG \
--name myCluster \
--enable-azure-monitor-metrics \
--azure-monitor-workspace-resource-id "<azure-monitor-workspace-id>" \
--grafana-resource-id "<grafana-workspace-id>"

# 3. OpenTelemetry Collector (tracing)
# Deploy via Helm or the OTel Operator in your cluster
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector \
--namespace otel-system --create-namespace \
--set mode=deployment \
--set config.exporters.azuremonitor.connection_string="<app-insights-connection-string>"

What overlaps and what does not

Container Insights and Managed Prometheus both collect node and pod metrics. This causes confusion.

Metric typeContainer InsightsManaged PrometheusOverlap?
Node CPU/memoryYes (sent to Log Analytics)Yes (sent to Azure Monitor workspace)Yes, duplicate
Pod CPU/memoryYesYesYes, duplicate
Container restart countYesYesYes, duplicate
Kubernetes eventsYesNoNo overlap
Container logs (stdout/stderr)YesNoNo overlap
Custom application metricsNoYes (if app exposes /metrics)No overlap
PromQL queriesNoYesNo overlap
Distributed tracesNoNoNeither covers this
info

The metric duplication between Container Insights and Managed Prometheus is intentional. Container Insights feeds the Azure portal experience. Managed Prometheus feeds Grafana. You pay for both, but the Prometheus path is significantly cheaper for pure metrics workloads. If cost is a concern, reduce Container Insights to logs-only and use Prometheus for all metrics.


Cost considerations

Observability costs are driven by data volume. Unfiltered, a 20-node cluster can generate 50+ GB of logs per day.

Where the money goes

SourceCost driverTypical impact
Container Insights logsGB ingested into Log AnalyticsHigh. This is the most expensive component.
Container Insights metricsIncluded with logs addonModerate. Bundled, but still metered.
Managed PrometheusMetrics samples ingestedLow. Prometheus is very efficient for time-series data.
Managed GrafanaPer-instance pricingFixed. One Grafana instance serves the whole team.
OpenTelemetryDepends on the backendVariable. Application Insights charges per event.

Reduce log volume immediately

Do not collect everything. Filter aggressively from day one.

# ConfigMap to exclude noisy namespaces from Container Insights
# Apply as: kubectl apply -f container-insights-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
data:
schema-version: v1
config-version: v1
log-data-collection-settings: |-
[log_collection_settings]
[log_collection_settings.stdout]
enabled = true
exclude_namespaces = ["kube-system","gatekeeper-system","azure-arc"]
[log_collection_settings.stderr]
enabled = true
exclude_namespaces = ["kube-system","gatekeeper-system"]
[log_collection_settings.env_var]
enabled = false
tip

Start by excluding kube-system logs. These are high-volume and rarely useful for application debugging. If you need kube-system data, query it through Kubernetes events in Container Insights instead of raw container logs.

Cost comparison at scale

Cluster sizeContainer Insights (logs + metrics)Managed Prometheus onlySavings with Prometheus for metrics
5 nodes, light workloads~$150/month~$30/month80% on metrics cost
20 nodes, moderate workloads~$800/month~$80/month90% on metrics cost
50+ nodes, heavy workloads~$3,000+/month~$150/month95% on metrics cost

These are estimates. Actual costs depend on log volume, retention, and query frequency.


Migration path

Do not try to set up everything at once. Follow this order:

Phase 1: Container Insights only

Enable Container Insights. Get logs, basic metrics, and portal dashboards working. This takes 10 minutes and covers 80% of what you need on day one.

Phase 2: add managed Prometheus and Grafana

When you need better dashboards, custom metrics, or PromQL alerting, add Managed Prometheus. Import community Grafana dashboards for Kubernetes. Consider reducing Container Insights to logs-only to avoid paying for duplicate metrics.

Phase 3: add OpenTelemetry

When you have multiple microservices and need to trace requests across them, instrument your application with the OpenTelemetry SDK. Deploy the OTel Collector to export traces to Application Insights or Jaeger.

info

Most teams never need phase 3. If you run a monolith or a small number of services, distributed tracing adds complexity without proportional value. Add it when you cannot debug cross-service latency issues with logs and metrics alone.


Anti-patterns

Avoid these common mistakes.

Anti-patternWhy it is wrongWhat to do instead
Running self-hosted Prometheus alongside Managed PrometheusDouble the operational burden, double the cost, data in two placesUse Managed Prometheus. It is fully compatible and Azure handles availability.
Collecting all logs from all namespaces without filteringLog Analytics costs scale linearly with volume. You will get a surprise bill.Exclude kube-system, gatekeeper-system, and any other infrastructure namespace you do not need.
Using Container Insights for metrics alertingKQL-based metric alerts have higher latency and cost more than Prometheus alert rulesUse Prometheus recording rules and Grafana alerts for metrics. Use Container Insights alerts only for log-based conditions.
Deploying Jaeger, Zipkin, and Application Insights simultaneouslyThree tracing backends means three places to look and none of them have complete dataPick one tracing backend. Application Insights integrates natively with Azure. Jaeger is better if you want to stay vendor-neutral.
Skipping resource limits on the OTel CollectorThe collector can consume unbounded memory during traffic spikesAlways set memory limits on the collector pod and configure the memory limiter processor in the OTel pipeline

Decision tree: which tool do I need right now?

Question 1: Do you have any observability on this cluster?

  • No: Enable Container Insights. Stop here until you have a real need for more.

Question 2: Do you need better dashboards or custom metrics alerting?

  • Yes: Add Managed Prometheus + Grafana.

Question 3: Are you debugging latency across multiple services?

  • Yes: Add OpenTelemetry with distributed tracing.

Question 4: Is your Container Insights bill too high?

  • Yes: Filter log collection, switch metrics to Prometheus, reduce Log Analytics retention to 30 days.

Resources