Observability comparison

AKS has three observability paths that overlap in confusing ways. This page clarifies what each one does, where they overlap, and what combination to actually use.

The three tools

Tool	What it is	How it runs
Container Insights	Azure Monitor agent that collects logs, node/pod metrics, and Kubernetes events	DaemonSet deployed via the monitoring addon
Managed Prometheus	Azure-hosted Prometheus-compatible metrics store with pre-built scraping	Metrics addon deployed as a DaemonSet, data stored in Azure Monitor workspace
OpenTelemetry	Vendor-neutral SDK and collector for traces, metrics, and logs	You deploy the OTel Collector and instrument your application code

Full comparison

Aspect	Container Insights	Managed Prometheus	OpenTelemetry
Primary data	Logs + Kubernetes events	Metrics (time series)	Traces + custom metrics + logs
Storage backend	Log Analytics workspace	Azure Monitor workspace (Prometheus-compatible)	Depends on exporter (Azure Monitor, Jaeger, Zipkin)
Dashboards	Azure portal workbooks	Azure Managed Grafana	Grafana, Jaeger UI, or any compatible backend
Alerting	Azure Monitor alerts on log queries	Prometheus alert rules in Grafana	Depends on backend
Custom metrics	Limited (custom log queries via KQL)	Full PromQL support, scrape any `/metrics` endpoint	Full SDK instrumentation in your application code
Infrastructure metrics	Node CPU, memory, disk, network	Same plus any Prometheus exporter metrics	Not designed for infrastructure metrics
Application-level data	stdout/stderr logs only	Application metrics if exposed via `/metrics`	Distributed traces, custom spans, application metrics
Cost model	Pay per GB of logs ingested into Log Analytics	Pay per metrics samples ingested	Free (SDK is open source), pay for the backend you export to
Setup effort	One CLI command	One CLI command + Grafana workspace	Deploy collector, instrument code, configure exporters
When to use	Always (baseline observability)	When you need metrics beyond the portal	When you need distributed tracing or custom instrumentation

The recommended stack

Use all three. Each covers a layer the others cannot.

Layer	Tool	Why
Logs and Kubernetes events	Container Insights	Captures stdout/stderr from every container plus scheduling events. This is your audit trail.
Metrics and dashboards	Managed Prometheus + Grafana	PromQL is the standard for metrics. Grafana dashboards are better than Azure portal workbooks. Community dashboards work out of the box.
Distributed tracing	OpenTelemetry	The only option for request-level tracing across microservices. Container Insights and Prometheus cannot do this.

Enable the recommended stack

# 1. Container Insights (logs + events)
az aks enable-addons \
  --resource-group myRG \
  --name myCluster \
  --addons monitoring \
  --workspace-resource-id "<log-analytics-workspace-id>"

# 2. Managed Prometheus (metrics)
az aks update \
  --resource-group myRG \
  --name myCluster \
  --enable-azure-monitor-metrics \
  --azure-monitor-workspace-resource-id "<azure-monitor-workspace-id>" \
  --grafana-resource-id "<grafana-workspace-id>"

# 3. OpenTelemetry Collector (tracing)
# Deploy via Helm or the OTel Operator in your cluster
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-collector open-telemetry/opentelemetry-collector \
  --namespace otel-system --create-namespace \
  --set mode=deployment \
  --set config.exporters.azuremonitor.connection_string="<app-insights-connection-string>"

What overlaps and what does not

Container Insights and Managed Prometheus both collect node and pod metrics. This causes confusion.

Metric type	Container Insights	Managed Prometheus	Overlap?
Node CPU/memory	Yes (sent to Log Analytics)	Yes (sent to Azure Monitor workspace)	Yes, duplicate
Pod CPU/memory	Yes	Yes	Yes, duplicate
Container restart count	Yes	Yes	Yes, duplicate
Kubernetes events	Yes	No	No overlap
Container logs (stdout/stderr)	Yes	No	No overlap
Custom application metrics	No	Yes (if app exposes `/metrics`)	No overlap
PromQL queries	No	Yes	No overlap
Distributed traces	No	No	Neither covers this

info

The metric duplication between Container Insights and Managed Prometheus is intentional. Container Insights feeds the Azure portal experience. Managed Prometheus feeds Grafana. You pay for both, but the Prometheus path is significantly cheaper for pure metrics workloads. If cost is a concern, reduce Container Insights to logs-only and use Prometheus for all metrics.

Cost considerations

Observability costs are driven by data volume. Unfiltered, a 20-node cluster can generate 50+ GB of logs per day.

Where the money goes

Source	Cost driver	Typical impact
Container Insights logs	GB ingested into Log Analytics	High. This is the most expensive component.
Container Insights metrics	Included with logs addon	Moderate. Bundled, but still metered.
Managed Prometheus	Metrics samples ingested	Low. Prometheus is very efficient for time-series data.
Managed Grafana	Per-instance pricing	Fixed. One Grafana instance serves the whole team.
OpenTelemetry	Depends on the backend	Variable. Application Insights charges per event.

Reduce log volume immediately

Do not collect everything. Filter aggressively from day one.

# ConfigMap to exclude noisy namespaces from Container Insights
# Apply as: kubectl apply -f container-insights-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: container-azm-ms-agentconfig
  namespace: kube-system
data:
  schema-version: v1
  config-version: v1
  log-data-collection-settings: |-
    [log_collection_settings]
      [log_collection_settings.stdout]
        enabled = true
        exclude_namespaces = ["kube-system","gatekeeper-system","azure-arc"]
      [log_collection_settings.stderr]
        enabled = true
        exclude_namespaces = ["kube-system","gatekeeper-system"]
      [log_collection_settings.env_var]
        enabled = false

tip

Start by excluding kube-system logs. These are high-volume and rarely useful for application debugging. If you need kube-system data, query it through Kubernetes events in Container Insights instead of raw container logs.

Cost comparison at scale

Cluster size	Container Insights (logs + metrics)	Managed Prometheus only	Savings with Prometheus for metrics
5 nodes, light workloads	~$150/month	~$30/month	80% on metrics cost
20 nodes, moderate workloads	~$800/month	~$80/month	90% on metrics cost
50+ nodes, heavy workloads	~$3,000+/month	~$150/month	95% on metrics cost

These are estimates. Actual costs depend on log volume, retention, and query frequency.

Migration path

Do not try to set up everything at once. Follow this order:

Phase 1: Container Insights only

Enable Container Insights. Get logs, basic metrics, and portal dashboards working. This takes 10 minutes and covers 80% of what you need on day one.

Phase 2: add managed Prometheus and Grafana

When you need better dashboards, custom metrics, or PromQL alerting, add Managed Prometheus. Import community Grafana dashboards for Kubernetes. Consider reducing Container Insights to logs-only to avoid paying for duplicate metrics.

Phase 3: add OpenTelemetry

When you have multiple microservices and need to trace requests across them, instrument your application with the OpenTelemetry SDK. Deploy the OTel Collector to export traces to Application Insights or Jaeger.

info

Most teams never need phase 3. If you run a monolith or a small number of services, distributed tracing adds complexity without proportional value. Add it when you cannot debug cross-service latency issues with logs and metrics alone.

Anti-patterns

Avoid these common mistakes.

Anti-pattern	Why it is wrong	What to do instead
Running self-hosted Prometheus alongside Managed Prometheus	Double the operational burden, double the cost, data in two places	Use Managed Prometheus. It is fully compatible and Azure handles availability.
Collecting all logs from all namespaces without filtering	Log Analytics costs scale linearly with volume. You will get a surprise bill.	Exclude `kube-system`, `gatekeeper-system`, and any other infrastructure namespace you do not need.
Using Container Insights for metrics alerting	KQL-based metric alerts have higher latency and cost more than Prometheus alert rules	Use Prometheus recording rules and Grafana alerts for metrics. Use Container Insights alerts only for log-based conditions.
Deploying Jaeger, Zipkin, and Application Insights simultaneously	Three tracing backends means three places to look and none of them have complete data	Pick one tracing backend. Application Insights integrates natively with Azure. Jaeger is better if you want to stay vendor-neutral.
Skipping resource limits on the OTel Collector	The collector can consume unbounded memory during traffic spikes	Always set memory limits on the collector pod and configure the memory limiter processor in the OTel pipeline

Decision tree: which tool do I need right now?

Question 1: Do you have any observability on this cluster?

No: Enable Container Insights. Stop here until you have a real need for more.

Question 2: Do you need better dashboards or custom metrics alerting?

Yes: Add Managed Prometheus + Grafana.

Question 3: Are you debugging latency across multiple services?

Yes: Add OpenTelemetry with distributed tracing.

Question 4: Is your Container Insights bill too high?

Yes: Filter log collection, switch metrics to Prometheus, reduce Log Analytics retention to 30 days.

The three tools​

Full comparison​

The recommended stack​

Enable the recommended stack​

What overlaps and what does not​

Cost considerations​

Where the money goes​

Reduce log volume immediately​

Cost comparison at scale​

Migration path​

Phase 1: Container Insights only​

Phase 2: add managed Prometheus and Grafana​

Phase 3: add OpenTelemetry​

Anti-patterns​

Decision tree: which tool do I need right now?​

Resources​