Dynamo supports two primary approaches for processing multimodal inputs, which differ in how the initial media encoding step is handled relative to the main LLM inference engine.

1. EPD (Encode-Prefill-Decode) Disaggregation#

The EPD approach introduces an explicit separation of the media encoding step, maximizing the utilization of specialized hardware and increasing overall system efficiency for large multimodal models.

Media Input: Image, video, audio, or an embedding URL is provided.
Process Flow:
1. A dedicated Encode Worker is launched separately to handle the embedding extraction from the media input.
2. The extracted embeddings are transferred to the main engine via the NVIDIA Inference Xfer Library (NIXL).
3. The main Engine performs the remaining Prefill Decode Disaggregation steps to generate the output.
Benefit: This disaggregation allows for the decoupling of media encoding hardware/resources from the main LLM serving engine, making the serving of large multimodal models more efficient.

2. PD (Prefill-Decode) Disaggregation#

The PD approach is a more traditional, aggregated method where the inference engine handles the entire process.

Media Input: Image, video, or audio is loaded.
Process Flow:
1. The main Engine receives the media input.
2. The Engine executes the full sequence: Encode + Prefill + Decode.
Note: In this approach, the encoding step is executed within the same pipeline as the prefill and decode phases.

Inference Framework Support Matrix#

Dynamo supports multimodal capabilities across leading LLM inference backends, including vLLM, TensorRT-LLM (TRT-LLM), and SGLang. The table below details the current support level for EPD/PD and various media types for each stack.

Stack	EPD Support	PD Support	Image	Video	Audio
vLLM	✅	✅	✅	✅	🚧
TRT-LLM	✅ (Currently via precomputed Embeddings URL)	✅	✅	❌	❌
SGLang	✅	❌	✅	❌	❌

previous Tool Calling with Dynamo next Finding Best Initial Configs using AIConfigurator

On this page

Links/Buttons: