Dynamo Architecture Flow — NVIDIA Dynamo Documentation

Last updated: 11/7/2025

Title: Dynamo Architecture Flow — NVIDIA Dynamo Documentation

URL Source: https://docs.nvidia.com/dynamo/latest/design_docs/dynamo_flow.html

Published Time: Fri, 07 Nov 2025 17:51:33 GMT

Markdown Content: Dynamo Architecture Flow#

This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in components/backends/vllm. Color-coded flows indicate different types of operations:

🔵 Main Request Flow (Blue)#

The primary user journey through the system:

  1. Discovery (S1): Client discovers the service endpoint

  2. Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)

  3. Validate (S3): Frontend forwards request to Processor for validation and routing

  4. Route (S3): Processor routes the validated request to appropriate Decode Worker

🟠 Decision and Allocation Flow (Orange)#

The system’s intelligent routing and resource allocation:

  1. Query (S4): Decode Worker queries for prefix cache hits to optimize processing

  2. Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory

  3. Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue

🟢 Prefill Worker Flow (Green)#

The dedicated prefill processing pipeline:

  1. NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers

  2. Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication

  3. Prefill (S9): Worker executes the prefill computation on the input tokens

  4. NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks

🟣 Completion Flow (Purple)#

The response generation and delivery:

  1. Notify (S11): PrefillWorker sends completion notification to Decode Worker

  2. Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data

  3. Response (S13): The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client

🔗 Infrastructure Connections (Dotted lines)#

Coordination and messaging support:

ETCD Connections (Gray, dotted)#

  • Frontend, Processor, Planner: Service discovery and registration

  • Decode Worker, PrefillWorker: NIXL metadata storage for GPU communication setup

NATS Connections (Teal, dotted)#

  • PrefillQueue: JetStream consumer group for reliable work distribution

  • Processor: Load balancing across workers

Planning Connections (Gold, dotted)#

  • Frontend → Planner: Metrics collection for auto-scaling decisions

  • Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker

Technical Implementation Details#

NIXL (NVIDIA Interchange Library):#

  • Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe

  • Decode Worker publishes GPU metadata to ETCD for coordination

  • PrefillWorker loads metadata to establish direct communication channels

  • Block-based transfers (64–128 tokens per block) for efficient batching

Disaggregated KV Cache:#

  • Each Decode Worker maintains local KV cache in its GPU memory

  • No shared storage bottlenecks—all transfers are direct worker-to-worker

  • Pre-allocated blocks ensure deterministic memory layout and performance

Links/Buttons: