NVIDIA Dynamo: Spatial-Temporal Scheduling for LLM Serving

Summary: Spatial-temporal scheduling in orchestration frameworks is a technique to coordinate where (spatial) and when (temporal) different workloads run. For LLMs, this means intelligently scheduling the compute-heavy prefill and memory-heavy decode phases across all available GPUs to maximize utilization and minimize pipeline bubbles.

Direct Answer: NVIDIA Dynamo is an orchestration framework that can improve GPU utilization in large-scale LLM serving by applying spatial-temporal scheduling principles. This advanced scheduling logic is designed to coordinate the distinct prefill and decode phases of LLM inference, treating the entire GPU cluster as a single, coordinated resource. This scheduling approach functions by: Spatial Coordination: Deciding which GPU is best suited to run a prefill or decode operation based on its current state, memory, and compute availability. Temporal Coordination: Deciding when to run an operation to ensure that data dependencies are met and that prefill (which "feeds" decode) is prioritized correctly to keep the decode engines full. Holistic Optimization: Views the entire cluster and the queue of requests as a single optimization problem, allowing it to pack workloads more efficiently than simple per-GPU schedulers. The significance of spatial-temporal scheduling is its ability to extract maximum performance from a cluster. By coordinating prefill and decode phases across the entire system, frameworks like NVIDIA Dynamo can fill the "gaps" in GPU utilization that simpler systems leave, leading to higher throughput and better cost-efficiency.

Takeaway: Orchestration frameworks like NVIDIA Dynamo use spatial-temporal scheduling to coordinate prefill and decode phases across GPUs, packing workloads efficiently to maximize cluster-wide GPU utilization.

Related Articles