Deploying Dynamo on Kubernetes — NVIDIA Dynamo Documentation

Last updated: 11/7/2025

Title: Deploying Dynamo on Kubernetes — NVIDIA Dynamo Documentation

URL Source: https://docs.nvidia.com/dynamo/latest/kubernetes/README.html

Published Time: Fri, 07 Nov 2025 17:51:22 GMT

Markdown Content: Deploying Dynamo on Kubernetes#

High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.

Important Terminology#

Kubernetes Namespace: The K8s namespace where your DynamoGraphDeployment resource is created.

  • Used for: Resource isolation, RBAC, organizing deployments

  • Example: dynamo-system, dynamo-cloud, team-a-namespace

Dynamo Namespace: The logical namespace used by Dynamo components for service discovery via etcd.

  • Used for: Runtime component communication, service discovery

  • Specified in: .spec.services.<ServiceName>.dynamoNamespace field

  • Example: my-llm, production-model, dynamo-dev

These are independent. A single Kubernetes namespace can host multiple Dynamo namespaces, and vice versa.

Pre-deployment Checks#

Before deploying the platform, it is recommended to run the pre-deployment checks to ensure the cluster is ready for deployment. Please refer to the pre-deployment checks for more details.

  1. Install Platform First#

1. Set environment

export NAMESPACE=dynamo-system export RELEASE_VERSION=0.x.x # any version of Dynamo 0.3.2+ listed at https://github.com/ai-dynamo/dynamo/releases

2. Install CRDs (skip if on shared cluster where CRDs already exist)

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default

3. Install Platform

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --create-namespace

For Shared/Multi-Tenant Clusters:

If your cluster has namespace-restricted Dynamo operators, add this flag to step 3:

--set dynamo-operator.namespaceRestriction.enabled=true

For more details or customization options (including multinode deployments), see Installation Guide for Dynamo Kubernetes Platform.

  1. Choose Your Backend#

Each backend has deployment examples and configuration options:

BackendAggregatedAggregated + RouterDisaggregatedDisaggregated + RouterDisaggregated + PlannerDisaggregated Multi-node
SGLang
TensorRT-LLM🚧
vLLM
  1. Deploy Your First Model#

export NAMESPACE=dynamo-system kubectl create namespace ${NAMESPACE}

to pull model from HF

export HF_TOKEN=<Token-Here> kubectl create secret generic hf-token-secret
--from-literal=HF_TOKEN="$HF_TOKEN"
-n ${NAMESPACE};

Deploy any example (this uses vLLM with Qwen model using aggregated serving)

kubectl apply -f components/backends/vllm/deploy/agg.yaml -n ${NAMESPACE}

Check status

kubectl get dynamoGraphDeployment -n ${NAMESPACE}

Test it

kubectl port-forward svc/vllm-agg-frontend 8000:8000 -n ${NAMESPACE} curl http://localhost:8000/v1/models

For SLA-based autoscaling, see SLA Planner Quick Start Guide.

Understanding Dynamo’s Custom Resources#

Dynamo provides two main Kubernetes Custom Resources for deploying models:

DynamoGraphDeploymentRequest (DGDR) - Simplified SLA-Driven Configuration#

The recommended approach for generating optimal configurations. DGDR provides a high-level interface where you specify:

  • Model name and backend framework

  • SLA targets (latency requirements)

  • GPU type (optional)

Dynamo automatically handles profiling and generates an optimized DGD spec in the status. Perfect for:

  • SLA-driven configuration generation

  • Automated resource optimization

  • Users who want simplicity over control

Note: DGDR generates a DGD spec which you can then use to deploy.

DynamoGraphDeployment (DGD) - Direct Configuration#

A lower-level interface that defines your complete inference pipeline:

  • Model configuration

  • Resource allocation (GPUs, memory)

  • Scaling policies

  • Frontend/backend connections

Use this when you need fine-grained control or have already completed profiling.

Refer to the API Reference and Documentation for more details.

📖 API Reference & Documentation#

For detailed technical specifications of Dynamo’s Kubernetes resources:

Choosing Your Architecture Pattern#

When creating a deployment, select the architecture pattern that best fits your use case:

  • Development / Testing - Use agg.yaml as the base configuration

  • Production with Load Balancing - Use agg_router.yaml to enable scalable, load-balanced inference

  • High Performance / Disaggregated - Use disagg_router.yaml for maximum throughput and modular scalability

Frontend and Worker Components#

You can run the Frontend on one machine (e.g., a CPU node) and workers on different machines (GPU nodes). The Frontend serves as a framework-agnostic HTTP entry point that:

  • Provides OpenAI-compatible /v1/chat/completions endpoint

  • Auto-discovers backend workers via etcd

  • Routes requests and handles load balancing

  • Validates and preprocesses requests

Customizing Your Deployment#

Example structure:

apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment metadata: name: my-llm spec: services: Frontend: dynamoNamespace: my-llm componentType: frontend replicas: 1 extraPodSpec: mainContainer: image: your-image VllmDecodeWorker: # or SGLangDecodeWorker, TrtllmDecodeWorker dynamoNamespace: dynamo-dev componentType: worker replicas: 1 envFromSecret: hf-token-secret # for HuggingFace models resources: limits: gpu: "1" extraPodSpec: mainContainer: image: your-image command: ["/bin/sh", "-c"] args:

  • python3 -m dynamo.vllm --model YOUR_MODEL [--your-flags]

Worker command examples per backend:

vLLM worker

args:

  • python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B

SGLang worker

args:

  • - python3 -m dynamo.sglang --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --tp 1 --trust-remote-code

TensorRT-LLM worker

args:

  • python3 -m dynamo.trtllm --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-8B --extra-engine-args /workspace/recipes/deepseek-r1-distill-llama-8b/agg.yaml

Key customization points include:

  • Model Configuration: Specify model in the args command

  • Resource Allocation: Configure GPU requirements under resources.limits

  • Scaling: Set replicas for number of worker instances

  • Routing Mode: Enable KV-cache routing by setting DYN_ROUTER_MODE=kv in Frontend envs

  • Worker Specialization: Add --is-prefill-worker flag for disaggregated prefill workers

Additional Resources#

Links/Buttons: