Planner — NVIDIA Dynamo Documentation

Last updated: 11/7/2025

Title: Planner — NVIDIA Dynamo Documentation

URL Source: https://docs.nvidia.com/dynamo/latest/planner/planner_intro.html

Published Time: Fri, 07 Nov 2025 17:51:57 GMT

Markdown Content: Planner#

The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.

Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:

Key features include:

  • SLA-based scaling that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets

  • Graceful scaling that ensures no requests are dropped during scale-down operations

🚀 Quick Start

New to SLA Planner? Start with the SLA Planner Quick Start Guide for a complete, step-by-step workflow.

Prerequisites: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.

Feature
BackendLocal
Kubernetes
LLM FrameworkvLLM
TensorRT-LLM
SGLang
Serving TypeAggregated
Disaggregated
Planner ActionsLoad-based scaling up/down prefill/decode workers
SLA-based scaling up/down prefill/decode workers [1]
Adjusting engine knobs

Links/Buttons: