LLM Deployment using TensorRT-LLM — NVIDIA Dynamo Documentation
LLM Deployment using TensorRT-LLM #
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
Use the Latest Release #
We recommend using the latest stable release of dynamo to avoid breaking changes:
You can find the latest release here and check out the corresponding branch with:
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
Copy to clipboard
Table of Contents #
Feature Support Matrix #
Core Dynamo Features #
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| Disaggregated Serving | ✅ | |
| Conditional Disaggregation | 🚧 | Not supported yet |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Load Based Planner | 🚧 | Planned |
| KVBM | ✅ |
Large Scale P/D and WideEP Features #
| Feature | TensorRT-LLM | Notes |
|---|---|---|
| WideEP | ✅ | |
| DP Rank Routing | ✅ | |
| GB200 Support | ✅ |
TensorRT-LLM Quick Start #
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
Start Infrastructure Services (Local Development Only) #
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
docker compose -f deploy/docker-compose.yml up -d
Copy to clipboard
Note
-
etcd is optional but is the default local discovery backend. You can also use
--kv_store fileto use file system based discovery. -
NATS is optional - only needed if using KV routing with events (default). You can disable it with
--no-kv-eventsflag for prediction-based routing -
On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets
DYN_DISCOVERY_BACKEND=kubernetesto enable native K8s service discovery (DynamoWorkerMetadata CRD)
Build container #
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# On an x86 machine:
./container/build.sh --framework trtllm
# On an ARM machine:
./container/build.sh --framework trtllm --platform linux/arm64
# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main
Copy to clipboard
Run container #
./container/run.sh --framework trtllm -it
Copy to clipboard
Single Node Examples #
Important
Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the python3 -m dynamo.frontend <args> to start up the ingress and using python3 -m dynamo.trtllm <args> to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the KV Cache Routing documentation.
Aggregated #
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
Copy to clipboard
Aggregated with KV Routing #
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh
Copy to clipboard
Disaggregated #
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh
Copy to clipboard
Disaggregated with KV Routing #
Important
In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh
Copy to clipboard
Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 #
cd $DYNAMO_HOME/examples/backends/trtllm
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
Copy to clipboard
Notes:
-
There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
-
MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally,
ignore_eosshould generally be omitted or set tofalsewhen using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
Advanced Examples #
Below we provide a selected list of advanced examples. Please open up an issue if you’d like to see a specific example!
Multinode Deployment #
For comprehensive instructions on multinode serving, see the multinode-examples.md guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see Llama4+eagle guide to learn how to use these scripts when a single worker fits on the single node.
Speculative Decoding #
Kubernetes Deployment #
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see TensorRT-LLM Kubernetes Deployment Guide.
Client #
See client section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.
Benchmarking #
To benchmark your deployment with AIPerf, see this utility script, configuring the
model name and host based on your deployment: perf.sh
KV Cache Transfer in Disaggregated Serving #
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the KV cache transfer guide.
Request Migration #
You can enable request migration to handle worker failures gracefully. Use the --migration-limit flag to specify how many times a request can be migrated to another worker:
# For decode and aggregated workers
python3 -m dynamo.trtllm ... --migration-limit=3
Copy to clipboard
Important
Prefill workers do not support request migration and must use --migration-limit=0 (the default). Prefill workers only process prompts and return KV cache state - they don’t maintain long-running generation requests that would benefit from migration.
See the Request Migration Architecture documentation for details on how this works.
Request Cancellation #
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
Cancellation Support Matrix #
| Prefill | Decode | |
|---|---|---|
| Aggregated | ✅ | ✅ |
| Disaggregated | ✅ | ✅ |
For more details, see the Request Cancellation Architecture documentation.
Client #
See client section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.
Benchmarking #
To benchmark your deployment with AIPerf, see this utility script, configuring the
model name and host based on your deployment: perf.sh
Multimodal support #
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the TensorRT-LLM Multimodal Guide.
Logits Processing #
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
How it works #
-
Interface: Implement
dynamo.logits_processing.BaseLogitsProcessorwhich defines__call__(input_ids, logits)and modifieslogitsin-place. -
TRT-LLM adapter: Use
dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)to convert Dynamo processors into TRT-LLM-compatible processors and assign them toSamplingParams.logits_processor. -
Examples: See example processors in
lib/bindings/python/src/dynamo/logits_processing/examples/( temperature, hello_world).
Quick test: HelloWorld processor #
You can enable a test-only processor that forces the model to respond with “Hello world!”. This is useful to verify the wiring without modifying your model or engine code.
cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
Copy to clipboard
Notes:
-
When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
-
Expected chat response contains “Hello world”.
Bring your own processor #
Implement a processor by conforming to BaseLogitsProcessor and modify logits in-place. For example, temperature scaling:
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor
class TemperatureProcessor(BaseLogitsProcessor):
def __init__(self, temperature: float = 1.0):
if temperature <= 0:
raise ValueError("Temperature must be positive")
self.temperature = temperature
def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
if self.temperature == 1.0:
return
logits.div_(self.temperature)
Copy to clipboard
Wire it into TRT-LLM by adapting and attaching to SamplingParams:
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor
processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
Copy to clipboard
Current limitations #
-
Per-request processing only (batch size must be 1); beam width > 1 is not supported.
-
Processors must modify logits in-place and not return a new tensor.
-
If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
Performance Sweep #
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model. This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
Dynamo KV Block Manager Integration #
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: Running KVBM in TensorRT-LLM .