Deploying Dynamo on Kubernetes — NVIDIA Dynamo Documentation
Deploying Dynamo on Kubernetes #
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
Important Terminology #
Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.
-
Used for: Resource isolation, RBAC, organizing deployments
-
Example:
dynamo-system,dynamo-cloud,team-a-namespace
Dynamo Namespace: The logical namespace used by Dynamo components for service discovery.
-
Used for: Runtime component communication, service discovery
-
Specified in:
.spec.services.<ServiceName>.dynamoNamespacefield -
Example:
my-llm,production-model,dynamo-dev
These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.
Pre-deployment Checks #
Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the pre-deployment checks for more details.
1. Install Platform First #
# 1. Set environment
export NAMESPACE=dynamo-system
export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases
# 2. Install CRDs (skip if on shared cluster where CRDs already exist)
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default
# 3. Install Platform
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace
Copy to clipboard
For Shared/Multi-Tenant Clusters:
If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:
--set dynamo-operator.namespaceRestriction.enabled=true
Copy to clipboard
For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.
2. Choose Your Backend #
Each backend has deployment examples and configuration options:
| Backend | Aggregated | Aggregated + Router | Disaggregated | Disaggregated + Router | Disaggregated + Planner | Disaggregated Multi-node |
|---|---|---|---|---|---|---|
| SGLang | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| TensorRT-LLM | ✅ | ✅ | ✅ | ✅ | 🚧 | ✅ |
| vLLM | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
3. Deploy Your First Model #
export NAMESPACE=dynamo-system
kubectl create namespace ${NAMESPACE}
# to pull model from HF
export HF_TOKEN=<Token-Here>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="$HF_TOKEN" \
-n ${NAMESPACE};
# Deploy any example (this uses vLLM with Qwen model using aggregated serving)
kubectl apply -f examples/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}
# Check status
kubectl get dynamoGraphDeployment -n ${NAMESPACE}
# Test it
kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE}
curl http://localhost:8000/v1/models
Copy to clipboard
For SLA-based autoscaling, see SLA Planner Quick Start Guide.
Understanding Dynamo’s Custom Resources #
Dynamo provides two main Kubernetes Custom Resources for deploying models:
DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration #
The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:
-
Model name and backend framework
-
SLA targets (latency requirements)
-
GPU type (optional)
Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:
-
SLA-driven configuration generation
-
Automated resource optimization
-
Users who want simplicity over control
Note: DGDR generates a DGD spec which you can then use to deploy.
DynamoGraphDeployment (DGD) - Direct Configuration #
A lower-level interface that defines your complete inference pipeline:
-
Model configuration
-
Resource allocation (GPUs, memory)
-
Scaling policies
-
Frontend/backend connections
Use this when you need fine-grained control or have already completed profiling.
Refer to the API Reference and Documentation for more details.
📖 API Reference & Documentation #
For detailed technical specifications of Dynamo’s Kubernetes resources:
-
API Reference - Complete CRD field specifications for all Dynamo resources
-
Create Deployment - Step-by-step deployment creation with DynamoGraphDeployment
-
Operator Guide - Dynamo operator configuration and management
Choosing Your Architecture Pattern #
When creating a deployment, select the architecture pattern that best fits your use case:
-
Development / Testing - Use
agg.yamlas the base configuration -
Production with Load Balancing - Use
agg_router.yamlto enable scalable, load-balanced inference -
High Performance / Disaggregated - Use
disagg_router.yamlfor maximum throughput and modular scalability
Frontend and Worker Components #
You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:
-
Provides OpenAI-compatible
/v1/chat/completionsendpoint -
Auto-discovers backend workers via service discovery (Kubernetes-native by default)
-
Routes requests and handles load balancing
-
Validates and preprocesses requests
Customizing Your Deployment #
Example structure:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-llm
spec:
services:
Frontend:
dynamoNamespace: my-llm
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: your-image
VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker
dynamoNamespace: dynamo-dev
componentType: worker
replicas: 1
envFromSecret: hf-token-secret # for HuggingFace models
resources:
limits:
gpu: "1"
extraPodSpec:
mainContainer:
image: your-image
command: ["/bin/sh", "-c"]
args:
- python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]
Copy to clipboard
Worker command examples per backend:
# vLLM worker
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B
# SGLang worker
args:
- >-
python3 -m dynamo.sglang
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--tp 1
--trust-remote-code
# TensorRT-LLM worker
args:
- python3 -m dynamo.trtllm
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B
--extra-engine-args /workspace/examples/backends/trtllm/engine_configs/deepseek-r1-distill-llama-8b/agg.yaml
Copy to clipboard
Key customization points include:
-
Model Configuration: Specify model in the args command
-
Resource Allocation: Configure GPU requirements under
resources.limits -
Scaling: Set
replicasfor number of worker instances -
Routing Mode: Enable KV-cache routing by setting
DYN_ROUTER_MODE=kvin Frontend envs -
Worker Specialization: Add
--is-prefill-workerflag for disaggregated prefill workers
Additional Resources #
-
Examples - Complete working examples
-
Create Custom Deployments - Build your own CRDs
-
Managing Models with DynamoModel - Deploy LoRA adapters and manage models
-
Operator Documentation - How the platform works
-
Service Discovery - Discovery backends and configuration
-
Helm Charts - For advanced users
-
GitOps Deployment with FluxCD - For advanced users
-
Logging - For logging setup
-
Multinode Deployment - For multinode deployment
-
Grove - For grove details and custom installation
-
Monitoring - For monitoring setup
-
Model Caching with Fluid - For model caching with Fluid