Multimodal Inference in Dynamo: — NVIDIA Dynamo Documentation
Title: Multimodal Inference in Dynamo: — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/latest/multimodal/multimodal_intro.html
Published Time: Fri, 07 Nov 2025 17:51:25 GMT
Markdown Content: Skip to main content
Back to top Ctrl+K
latest
latest0.6.10.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Search Ctrl+K
Search Ctrl+K
latest
latest0.6.10.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0
Table of Contents
Getting Started
Kubernetes Deployment
User Guides
Components
Design Docs
-
Multimodal Inference in Dynamo:
Multimodal Inference in Dynamo:#
You can find example workflows and reference implementations for deploying a multimodal model using Dynamo in multimodal examples.
EPD vs. PD Disaggregation#
Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.
1. EPD (Encode-Prefill-Decode) Disaggregation#
The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.
-
Media Input: Image, video, audio, or an embedding URL is provided.
-
Process Flow:
-
A dedicated Encode Worker is launched separately to handle the embedding extraction from the media input.
-
The extracted embeddings are transferred to the main engine via the NVIDIA Inference Xfer Library (NIXL).
-
The main Engine performs the remaining Prefill Decode Disaggregation steps to generate the output.
-
-
Benefit: This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.
2. PD (Prefill-Decode) Disaggregation#
The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.
-
Media Input: Image, video, or audio is loaded.
-
Process Flow:
-
The main Engine receives the media input.
-
The Engine executes the full sequence: Encode + Prefill + Decode.
-
-
Note: In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.
Inference Framework Support Matrix#
Dynamo supports multimodal capabilities across leading LLM inference backends, including vLLM, TensorRT-LLM (TRT-LLM), and SGLang. The table below details the current support level for EPD/PD and various media types for each stack.
| Stack | EPD Support | PD Support | Image | Video | Audio |
|---|---|---|---|---|---|
| vLLM | ✅ | ✅ | ✅ | ✅ | 🚧 |
| TRT-LLM | ✅ (Currently via precomputed Embeddings URL) | ✅ | ✅ | ❌ | ❌ |
| SGLang | ✅ | ❌ | ✅ | ❌ | ❌ |
previous Tool Calling with Dynamonext Finding Best Initial Configs using AIConfigurator
On this page
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2024-2025, NVIDIA CORPORATION & AFFILIATES.
Links/Buttons:
- Skip to main content
- NVIDIA Dynamo Documentation
- latest
- 0.6.1
- 0.6.0
- 0.5.1
- 0.5.0
- 0.4.1
- 0.4.0
- 0.3.2
- 0.3.1
- 0.3.0
- 0.2.1
- 0.2.0
- GitHub
- Installation
- Support Matrix
- Examples
- Deployment Guide
- Kubernetes Quickstart
- Detailed Installation Guide
- Dynamo Operator
- Minikube Setup
- Observability (K8s)
- Metrics
- Logging
- Multinode
- Multinode Deployments
- Grove
- Tool Calling
- Multimodality Support
- Finding Best Initial Configs
- Dynamo Benchmarking Guide
- Tuning Disaggregated Performance
- Writing Python Workers in Dynamo
- Observability (Local)
- Metrics Visualization with Prometheus and Grafana
- Health Checks
- Glossary
- Backends
- vLLM
- SGLang
- TensorRT-LLM
- Router
- Planner
- SLA Planner Quick Start
- SLA-Driven Profiling
- SLA-based Planner
- KVBM
- Motivation
- Architecture
- Components
- Design Deep Dive
- Integrations
- KVBM in vLLM
- KVBM in TRTLLM
- LMCache Integration
- Further Reading
- Overall Architecture
- Architecture Flow
- Disaggregated Serving
- Distributed Runtime
- #
- multimodal examples
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact