Skip to main content

Model Inference Serving

Start with KAITO for simplicity. Graduate to vLLM when you need to squeeze maximum throughput from expensive GPUs.

Serving framework comparison

FrameworkThroughputEase of SetupBest For
KAITOGood (default settings)Trivial (1 YAML)Getting started, standard models
vLLMHighest (PagedAttention)Medium (manual deploy)Production LLM serving at scale
TGIHighMediumHuggingFace ecosystem models
TritonHigh (multi-framework)ComplexMulti-model, multi-framework serving
CustomVariesHardProprietary models, special requirements
Opinion

KAITO for simplicity. vLLM for maximum throughput on production LLM serving. This is the decision tree for 90% of teams.

vLLM on AKS

vLLM uses PagedAttention and continuous batching to achieve the highest tokens/second on GPU hardware. If you're serving LLMs at scale, this is the framework to use.

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama2
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama2
template:
metadata:
labels:
app: vllm-llama2
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-2-7b-chat-hf"
- "--tensor-parallel-size"
- "1"
- "--max-model-len"
- "4096"
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
nodeSelector:
workload: gpu
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama2
spec:
selector:
app: vllm-llama2
ports:
- port: 8000
targetPort: 8000
type: ClusterIP

Autoscaling inference

Use KEDA with custom metrics to scale inference replicas based on actual demand.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
name: vllm-llama2
minReplicaCount: 1
maxReplicaCount: 8
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_pending_requests
query: sum(vllm_num_requests_waiting)
threshold: "10"
info

Scale on queue depth (pending requests), not GPU utilization. GPU utilization stays high even when throughput is fine. Queue depth tells you when users are actually waiting.

Multi-model GPU sharing

One GPU per model is wasteful for small models or low-traffic endpoints. Options:

StrategyHow It WorksWhen to Use
Time-slicingRound-robin GPU access between containersMultiple small models, acceptable latency
MIG (Multi-Instance GPU)Physically partition A100 into independent slicesIsolation between models, A100/H100 only
Multiple models in one processvLLM/Triton serve multiple modelsSame framework, shared memory
# NVIDIA time-slicing configuration (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin
namespace: kube-system
data:
config.yaml: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4

This makes each physical GPU appear as 4 schedulable GPUs. Pods share the GPU via time-slicing.

Performance optimization checklist

  1. Enable continuous batching -- vLLM does this by default. TGI needs --max-batch-prefill-tokens.
  2. Set appropriate max model length -- Shorter context = more concurrent requests.
  3. Use quantization for inference -- AWQ or GPTQ reduces memory, minimal quality loss.
  4. Tensor parallelism for large models -- Split across GPUs when model doesn't fit in one.
  5. Preload models -- Use init containers or PVCs with cached weights. Don't download on every pod start.
Common Mistake

Downloading model weights from HuggingFace on every pod restart. A 13B model is 26GB. Use a PVC with pre-downloaded weights or an init container that caches to a shared volume.

When to graduate from KAITO to custom

SignalAction
Need custom quantization (AWQ/GPTQ)Deploy vLLM directly with quantized model
Throughput is insufficientvLLM with tuned batch sizes and parallelism
Model not in KAITO supported listManual deployment required
Need request routing between modelsDeploy with Triton or custom gateway
Need to serve 5+ models on shared GPUsTime-slicing + custom deployment

Resources