Multimodal Inference in Dynamo: — NVIDIA Dynamo Documentation

Last updated: 11/7/2025

Title: Multimodal Inference in Dynamo: — NVIDIA Dynamo Documentation

URL Source: https://docs.nvidia.com/dynamo/latest/multimodal/multimodal_intro.html

Published Time: Fri, 07 Nov 2025 17:51:25 GMT

Markdown Content: Skip to main content

Back to top Ctrl+K

Image 1: NVIDIA Dynamo Documentation - HomeImage 2: NVIDIA Dynamo Documentation - Home NVIDIA Dynamo Documentation

latest

latest0.6.10.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0

Search Ctrl+K

Search Ctrl+K

Image 3: NVIDIA Dynamo Documentation - HomeImage 4: NVIDIA Dynamo Documentation - Home NVIDIA Dynamo Documentation

latest

latest0.6.10.6.00.5.10.5.00.4.10.4.00.3.20.3.10.3.00.2.10.2.0

Table of Contents

Getting Started

Kubernetes Deployment

User Guides

Components

Design Docs

Multimodal Inference in Dynamo:#

You can find example workflows and reference implementations for deploying a multimodal model using Dynamo in multimodal examples.

EPD vs. PD Disaggregation#

Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.

1. EPD (Encode-Prefill-Decode) Disaggregation#

The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.

  • Media Input: Image, video, audio, or an embedding URL is provided.

  • Process Flow:

    1. A dedicated Encode Worker is launched separately to handle the embedding extraction from the media input.

    2. The extracted embeddings are transferred to the main engine via the NVIDIA Inference Xfer Library (NIXL).

    3. The main Engine performs the remaining Prefill Decode Disaggregation steps to generate the output.

  • Benefit: This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.

2. PD (Prefill-Decode) Disaggregation#

The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.

  • Media Input: Image, video, or audio is loaded.

  • Process Flow:

    1. The main Engine receives the media input.

    2. The Engine executes the full sequence: Encode + Prefill + Decode.

  • Note: In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.

Inference Framework Support Matrix#

Dynamo supports multimodal capabilities across leading LLM inference backends, including vLLM, TensorRT-LLM (TRT-LLM), and SGLang. The table below details the current support level for EPD/PD and various media types for each stack.

StackEPD SupportPD SupportImageVideoAudio
vLLM🚧
TRT-LLM✅ (Currently via precomputed Embeddings URL)
SGLang

previous Tool Calling with Dynamonext Finding Best Initial Configs using AIConfigurator

On this page

Image 5: NVIDIAImage 6: NVIDIA

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2024-2025, NVIDIA CORPORATION & AFFILIATES.

Image 7Image 8Image 9

Links/Buttons: