KAITO: AI model inference

KAITO is the fastest path from model selection to serving on AKS. Use it unless you need custom inference frameworks or maximum throughput optimization.

What KAITO does

KAITO (Kubernetes AI Toolchain Operator) deploys large language models on AKS with a single custom resource. It handles the hard parts: GPU node provisioning, model download, serving setup, and health management.

KAITO Workflow

Opinion

KAITO handles the hard parts: GPU node provisioning, model download, serving setup. Don't reinvent this. If you're deploying a supported model, KAITO saves weeks of infrastructure work.

Supported models

Model Family	Examples	GPU Requirement
Llama	Llama-2-7b, Llama-2-13b, Llama-2-70b	1-8 GPUs depending on size
Mistral	Mistral-7b, Mixtral-8x7b	1-4 GPUs
Falcon	Falcon-7b, Falcon-40b	1-4 GPUs
Phi	Phi-2, Phi-3-mini	1 GPU

Deploying a custom model from HuggingFace

KAITO is not limited to preset models. You can deploy any compatible model from HuggingFace:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: custom-model
spec:
  resource:
    instanceType: Standard_NC24ads_A100_v4
    count: 1
    labelSelector:
      matchLabels:
        apps: custom-model
  inference:
    model:
      name: "SmolLM2-1.7B-Instruct"
      registry: "HuggingFace"

Deploying a preset model

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: llama2-7b
  annotations:
    kaito.sh/enablelb: "True"
spec:
  resource:
    instanceType: Standard_NC24ads_A100_v4
    count: 1
    labelSelector:
      matchLabels:
        apps: llama2-7b
  inference:
    preset:
      name: llama-2-7b-chat

That's the entire deployment. Apply this YAML and KAITO:

Provisions a GPU node (if none available)
Downloads the model weights from the configured source
Starts a vLLM or Hugging Face inference server
Creates a Service for API access

Installation

# Enable KAITO on your AKS cluster
az aks update \
  --resource-group myrg \
  --name myaks \
  --enable-ai-toolchain-operator

# Verify KAITO pods are running
kubectl get pods -n kube-system -l app=ai-toolchain-operator

Accessing the model

# Get the service endpoint
kubectl get svc llama2-7b -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

# Test inference
curl -X POST http://<SERVICE_IP>/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is Kubernetes?", "max_tokens": 200}'

How KAITO compares

Feature	KAITO	Manual Deployment
Time to deploy	Minutes	Days to weeks
Node provisioning	Automatic	Manual nodepool creation
Model download	Handled	You manage storage + download
Serving framework	Pre-configured	You pick, configure, tune
Customization	Preset models + custom models via HuggingFace	Full control
Throughput tuning	Default settings	You optimize batch size, quantization

When NOT to Use KAITO

You need custom quantization (GPTQ, AWQ, GGUF)
You need maximum throughput optimization (custom vLLM configs)
You need multi-model serving on shared GPUs

KAITO supports custom models from HuggingFace (e.g., SmolLM2-1.7B-Instruct) in addition to preset models, so an unsupported model alone is no longer a blocker.

In these cases, deploy vLLM or TGI directly. See Inference Serving.

Workspace management

# Check workspace status
kubectl get workspace llama2-7b

# View inference logs
kubectl logs -l apps=llama2-7b --tail=50

# Delete workspace (removes model + node)
kubectl delete workspace llama2-7b

Common mistakes

Not checking GPU quota -- KAITO provisions GPU nodes. If your subscription lacks quota, it fails silently.
Deploying 70B models on 1 GPU -- Large models need multiple GPUs. Check model requirements.
Leaving workspaces running -- GPU nodes are expensive. Delete workspaces when not in use.
Expecting production throughput from defaults -- KAITO optimizes for simplicity, not maximum tokens/second.

What KAITO does​

Supported models​

Deploying a custom model from HuggingFace​

Deploying a preset model​

Installation​

Accessing the model​

How KAITO compares​

Workspace management​

Common mistakes​

Resources​