LLM Deployment using TensorRT-LLM #

This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

Use the Latest Release #

We recommend using the latest stable release of dynamo to avoid breaking changes:

You can find the latest release here and check out the corresponding branch with:

git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Copy to clipboard

Table of Contents #

Feature Support Matrix #

Core Dynamo Features #

Feature	TensorRT-LLM	Notes
Disaggregated Serving	✅
Conditional Disaggregation	🚧	Not supported yet
KV-Aware Routing	✅
SLA-Based Planner	✅
Load Based Planner	🚧	Planned
KVBM	✅

Large Scale P/D and WideEP Features #

Feature	TensorRT-LLM	Notes
WideEP	✅
DP Rank Routing	✅
GB200 Support	✅

TensorRT-LLM Quick Start #

Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

Start Infrastructure Services (Local Development Only) #

For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

docker compose -f deploy/docker-compose.yml up -d

Copy to clipboard

Note

etcd is optional but is the default local discovery backend. You can also use --kv_store file to use file system based discovery.
NATS is optional - only needed if using KV routing with events (default). You can disable it with --no-kv-events flag for prediction-based routing
On Kubernetes, neither is required when using the Dynamo operator, which explicitly sets DYN_DISCOVERY_BACKEND=kubernetes to enable native K8s service discovery (DynamoWorkerMetadata CRD)

Build container #

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs

# On an x86 machine:
./container/build.sh --framework trtllm

# On an ARM machine:
./container/build.sh --framework trtllm --platform linux/arm64

# Build the container with the default experimental TensorRT-LLM commit
# WARNING: This is for experimental feature testing only.
# The container should not be used in a production environment.
./container/build.sh --framework trtllm --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git --tensorrtllm-commit main

Copy to clipboard

Run container #

./container/run.sh --framework trtllm -it

Copy to clipboard

Single Node Examples #

Important

Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the python3 -m dynamo.frontend <args> to start up the ingress and using python3 -m dynamo.trtllm <args> to start up the workers. You can easily take each command and run them in separate terminals.

For detailed information about the architecture and how KV-aware routing works, see the KV Cache Routing documentation.

Aggregated #

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh

Copy to clipboard

Aggregated with KV Routing #

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh

Copy to clipboard

Disaggregated #

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh

Copy to clipboard

Disaggregated with KV Routing #

Important

In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.

cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh

Copy to clipboard

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 #

cd $DYNAMO_HOME/examples/backends/trtllm

export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh

Copy to clipboard

Notes:

There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, ignore_eos should generally be omitted or set to false when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.

Advanced Examples #

Below we provide a selected list of advanced examples. Please open up an issue if you’d like to see a specific example!

Multinode Deployment #

For comprehensive instructions on multinode serving, see the multinode-examples.md guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see Llama4+eagle guide to learn how to use these scripts when a single worker fits on the single node.

Speculative Decoding #

Llama 4 Maverick Instruct + Eagle Speculative Decoding

Kubernetes Deployment #

For complete Kubernetes deployment instructions, configurations, and troubleshooting, see TensorRT-LLM Kubernetes Deployment Guide.

Client #

See client section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking #

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

KV Cache Transfer in Disaggregated Serving #

Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the KV cache transfer guide.

Request Migration #

You can enable request migration to handle worker failures gracefully. Use the --migration-limit flag to specify how many times a request can be migrated to another worker:

# For decode and aggregated workers
python3 -m dynamo.trtllm ... --migration-limit=3

Copy to clipboard

Important

Prefill workers do not support request migration and must use --migration-limit=0 (the default). Prefill workers only process prompts and return KV cache state - they don’t maintain long-running generation requests that would benefit from migration.

See the Request Migration Architecture documentation for details on how this works.

Request Cancellation #

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix #

	Prefill	Decode
Aggregated	✅	✅
Disaggregated	✅	✅

For more details, see the Request Cancellation Architecture documentation.

Client #

See client section to learn how to send request to the deployment.

NOTE: To send a request to a multi-node deployment, target the node which is running python3 -m dynamo.frontend <args>.

Benchmarking #

To benchmark your deployment with AIPerf, see this utility script, configuring the model name and host based on your deployment: perf.sh

Multimodal support #

Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the TensorRT-LLM Multimodal Guide.

Logits Processing #

Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.

How it works #

Interface: Implement dynamo.logits_processing.BaseLogitsProcessor which defines __call__(input_ids, logits) and modifies logits in-place.
TRT-LLM adapter: Use dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...) to convert Dynamo processors into TRT-LLM-compatible processors and assign them to SamplingParams.logits_processor.
Examples: See example processors in lib/bindings/python/src/dynamo/logits_processing/examples/ ( temperature, hello_world).

Quick test: HelloWorld processor #

You can enable a test-only processor that forces the model to respond with “Hello world!”. This is useful to verify the wiring without modifying your model or engine code.

cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh

Copy to clipboard

Notes:

When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
Expected chat response contains “Hello world”.

Bring your own processor #

Implement a processor by conforming to BaseLogitsProcessor and modify logits in-place. For example, temperature scaling:

from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor

class TemperatureProcessor(BaseLogitsProcessor):
    def __init__(self, temperature: float = 1.0):
        if temperature <= 0:
            raise ValueError("Temperature must be positive")
        self.temperature = temperature

    def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
        if self.temperature == 1.0:
            return
        logits.div_(self.temperature)

Copy to clipboard

Wire it into TRT-LLM by adapting and attaching to SamplingParams:

from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor

processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)

Copy to clipboard

Current limitations #

Per-request processing only (batch size must be 1); beam width > 1 is not supported.
Processors must modify logits in-place and not return a new tensor.
If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).

Performance Sweep #

For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the TensorRT-LLM Benchmark Scripts for DeepSeek R1 model. This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.

Dynamo KV Block Manager Integration #

Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.

Here is the instruction: Running KVBM in TensorRT-LLM .

LLM Deployment using TensorRT-LLM #

Use the Latest Release #

Table of Contents #

Feature Support Matrix #

Core Dynamo Features #

Large Scale P/D and WideEP Features #

TensorRT-LLM Quick Start #

Start Infrastructure Services (Local Development Only) #

Build container #

Run container #

Single Node Examples #

Aggregated #

Aggregated with KV Routing #

Disaggregated #

Disaggregated with KV Routing #

Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 #

Advanced Examples #

Multinode Deployment #

Speculative Decoding #

Kubernetes Deployment #

Client #

Benchmarking #

KV Cache Transfer in Disaggregated Serving #

Request Migration #

Request Cancellation #

Cancellation Support Matrix #

Client #

Benchmarking #

Multimodal support #

Logits Processing #

How it works #

Quick test: HelloWorld processor #

Bring your own processor #

Current limitations #

Performance Sweep #

Dynamo KV Block Manager Integration #

Related Articles