Dynamo Architecture Flow — NVIDIA Dynamo Documentation
Title: Dynamo Architecture Flow — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/latest/design_docs/dynamo_flow.html
Published Time: Fri, 07 Nov 2025 17:51:33 GMT
Markdown Content: Dynamo Architecture Flow#
This diagram shows the NVIDIA Dynamo disaggregated inference system as implemented in components/backends/vllm. Color-coded flows indicate different types of operations:
🔵 Main Request Flow (Blue)#
The primary user journey through the system:
-
Discovery (S1): Client discovers the service endpoint
-
Request (S2): HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
-
Validate (S3): Frontend forwards request to Processor for validation and routing
-
Route (S3): Processor routes the validated request to appropriate Decode Worker
🟠 Decision and Allocation Flow (Orange)#
The system’s intelligent routing and resource allocation:
-
Query (S4): Decode Worker queries for prefix cache hits to optimize processing
-
Disagg Decision (S5): Based on prefill length and queue size, the system decides whether it needs remote prefill 5a. Allocate (S5a): Decode Worker pre-allocates KV cache blocks in its local GPU memory
-
Queue (S6): If remote prefill is required, the system puts the RemotePrefillRequest with block IDs into the PrefillQueue
🟢 Prefill Worker Flow (Green)#
The dedicated prefill processing pipeline:
-
NATS Pull (S7): PrefillQueue uses a NATS consumer group to distribute work to available PrefillWorkers
-
Load Metadata (S8): PrefillWorker loads NIXL metadata from ETCD to establish GPU communication
-
Prefill (S9): Worker executes the prefill computation on the input tokens
-
NIXL Transfer (S10): Direct GPU-to-GPU transfer writes the prefilled KV cache to the Decode Worker’s pre-allocated blocks
🟣 Completion Flow (Purple)#
The response generation and delivery:
-
Notify (S11): PrefillWorker sends completion notification to Decode Worker
-
Decode (S12): Decode Worker decodes from its local KV cache containing prefilled data
-
Response (S13): The system sends the generated response to the Processor for post-processing, then through the Frontend to the Client
🔗 Infrastructure Connections (Dotted lines)#
Coordination and messaging support:
ETCD Connections (Gray, dotted)#
-
Frontend, Processor, Planner: Service discovery and registration
-
Decode Worker, PrefillWorker: NIXL metadata storage for GPU communication setup
NATS Connections (Teal, dotted)#
-
PrefillQueue: JetStream consumer group for reliable work distribution
-
Processor: Load balancing across workers
Planning Connections (Gold, dotted)#
-
Frontend → Planner: Metrics collection for auto-scaling decisions
-
Planner → Workers: Resource scaling commands for both Decode Worker and PrefillWorker
Technical Implementation Details#
NIXL (NVIDIA Interchange Library):#
-
Enables high-speed GPU-to-GPU data transfers using NVLink/PCIe
-
Decode Worker publishes GPU metadata to ETCD for coordination
-
PrefillWorker loads metadata to establish direct communication channels
-
Block-based transfers (64–128 tokens per block) for efficient batching
Disaggregated KV Cache:#
-
Each Decode Worker maintains local KV cache in its GPU memory
-
No shared storage bottlenecks—all transfers are direct worker-to-worker
-
Pre-allocated blocks ensure deterministic memory layout and performance
Links/Buttons: