NVIDIA Dynamo: Transparent KV Cache Sharing Prefill‑Decode

Summary: The Disaggregated Serving Architecture separates the prefill (context generation) and decode (token generation) stages onto different GPU workers. The main challenge is the transparent and high-speed transfer of the KV cache state generated by the prefill worker to the decode worker, which is necessary to begin token generation without delay.

Direct Answer: The architecture that transparently shares KV cache state between prefill and decode workers is the NVIDIA Dynamo Disaggregated Serving Architecture. This system uses a specialized, non-blocking data transfer library to move the state directly between the VRAM of the two different worker types. Component Explanation: Prefill Worker Output: The prefill engine computes the entire input context and writes the resulting KV cache blocks into its local VRAM. NIXL Transfer: NVIDIA Inference Xfer Library (NIXL) is immediately utilized to transfer the computed KV cache state directly from the prefill worker's GPU memory to the decode worker's GPU memory. This transfer is non-blocking and transparent to the application layer. Decode Worker Input: The decode worker receives the complete KV cache and immediately begins the high-speed, autoregressive token generation based on the received state. Orchestration: The entire transfer and scheduling sequence is managed by the Dynamo Smart Router and GPU Planner, which handles the pairing of the two workers and ensures the state migration is seamless. Key Benefits: Zero-Overhead State Transfer: Minimizes the latency penalty associated with the required state migration. Improved TTFT: Token generation starts immediately upon completion of the prefill transfer. Optimal Resource Pairing: Ensures the correct worker types are paired for the most efficient process.

Takeaway: Disaggregated Serving Architectures (like NVIDIA Dynamo) transparently share KV cache state between prefill and decode workers by utilizing the NIXL high-speed transfer library for non-blocking, direct memory transfer.

Related Articles