Horizontal pod autoscaler

ALWAYS configure HPA for production workloads. A fixed replica count means you are either wasting money during low traffic or about to crash during high traffic. There is no middle ground.

How HPA works

The HPA controller checks metrics every 15 seconds (configurable). It compares current utilization against your target and adjusts replica count accordingly:

desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))

warning

HPA uses requests-based utilization, not actual node capacity. If your pod requests 100m CPU and uses 80m, HPA sees 80% utilization. Get your requests wrong and HPA makes wrong decisions. Always right-size requests first.

Essential configuration

Parameter	Recommendation	Why
`minReplicas`	2+ for production	Single replica = zero availability during restarts
`maxReplicas`	Set to what your budget allows	Unbounded scaling will bankrupt you
`targetCPUUtilization`	60-70%	Leaves headroom for traffic spikes before new pods are ready
`stabilizationWindowSeconds`	300 (scale-down)	Prevents thrashing during fluctuating load

Production-ready HPA example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 65
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

tip

Scale up aggressively, scale down conservatively. The cost of over-provisioning for a few minutes is far less than the cost of dropping requests.

Custom metrics: when CPU is not enough

CPU-based scaling is table stakes. Production workloads need custom metrics:

Queue depth -- scale workers based on pending messages
Request latency -- scale when p95 latency exceeds SLO
Active connections -- scale WebSocket or gRPC services
Business metrics -- orders per minute, active sessions

Use Prometheus Adapter or KEDA to expose custom metrics to HPA.

Common mistakes

Setting min = max replicas. This disables autoscaling entirely. If you want a fixed count, just set replicas on the Deployment and skip HPA.

Not setting resource requests. Without requests, HPA has no baseline to calculate utilization. It will not scale at all. Every container must have CPU and memory requests.

Targeting 90%+ utilization. By the time you hit 90%, new pods take 30-60 seconds to become ready. Your existing pods are saturated and dropping requests during that window.

Ignoring scale-down behavior. Default scale-down is aggressive. A brief dip in traffic can trigger scale-down, and when traffic returns, you wait for new pods to start. Set stabilization to at least 5 minutes.

Using HPA with VPA on the same metric. HPA and Vertical Pod Autoscaler fight each other if both target CPU. Use VPA for memory only if you need both.

Verifying HPA status

kubectl get hpa api-hpa
kubectl describe hpa api-hpa
# Check if metrics are being collected
kubectl top pods -l app=api-server

Decision: when to use HPA vs alternatives

Scenario	Use
Steady traffic with predictable patterns	HPA on CPU
API with SLO targets	HPA on latency custom metric
Queue-based workers	KEDA (scale to zero)
Batch jobs triggered by events	KEDA
Sudden unpredictable bursts	HPA + Cluster Autoscaler

How HPA works​

Essential configuration​

Production-ready HPA example​

Custom metrics: when CPU is not enough​

Common mistakes​

Verifying HPA status​

Decision: when to use HPA vs alternatives​

Resources​