Running SGLang with Dynamo — NVIDIA Dynamo Documentation
Title: Running SGLang with Dynamo — NVIDIA Dynamo Documentation
URL Source: https://docs.nvidia.com/dynamo/latest/backends/sglang/README.html
Published Time: Fri, 07 Nov 2025 17:50:43 GMT
Markdown Content: Running SGLang with Dynamo#
Use the Latest Release#
We recommend using the latest stable release of dynamo to avoid breaking changes:
You can find the latest release here and check out the corresponding branch with:
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
Table of Contents#
Feature Support Matrix#
Core Dynamo Features#
| Feature | SGLang | Notes |
|---|---|---|
| Disaggregated Serving | ✅ | |
| Conditional Disaggregation | 🚧 | WIP PR |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Multimodal EPD Disaggregation | ✅ | |
| KVBM | ❌ | Planned |
Dynamo SGLang Integration#
Dynamo SGLang integrates SGLang engines into Dynamo’s distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang’s engine arguments.
Argument Handling#
Dynamo SGLang uses SGLang’s native argument parser, so most SGLang engine arguments work identically. You can pass any SGLang argument (like --model-path, --tp, --trust-remote-code) directly to dynamo.sglang.
Dynamo-Specific Arguments#
| Argument | Description | Default | SGLang Equivalent |
|---|---|---|---|
--endpoint | Dynamo endpoint in dyn://namespace.component.endpoint format | Auto-generated based on mode | N/A |
--migration-limit | Max times a request can migrate between workers for fault tolerance. See Request Migration Architecture. | 0 (disabled) | N/A |
--dyn-tool-call-parser | Tool call parser for structured outputs (takes precedence over --tool-call-parser) | None | --tool-call-parser |
--dyn-reasoning-parser | Reasoning parser for CoT models (takes precedence over --reasoning-parser) | None | --reasoning-parser |
--use-sglang-tokenizer | Use SGLang’s tokenizer instead of Dynamo’s | False | N/A |
--custom-jinja-template | Use custom chat template for that model (takes precedence over default chat template in model repo) | None | --chat-template |
Tokenizer Behavior#
-
Default (
--use-sglang-tokenizernot set): Dynamo handles tokenization/detokenization via our blazing fast frontend and passesinput_idsto SGLang -
With
--use-sglang-tokenizer: SGLang handles tokenization/detokenization, Dynamo passes raw prompts
Note
When using --use-sglang-tokenizer, only v1/chat/completions is available through Dynamo’s frontend.
Request Cancellation#
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
Cancellation Support Matrix#
| Prefill | Decode | |
|---|---|---|
| Aggregated | ✅ | ✅ |
| Disaggregated | ⚠️ | ✅ |
Warning
⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.
For more details, see the Request Cancellation Architecture documentation.
Installation#
Install latest release#
We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with curl -LsSf https://astral.sh/uv/install.sh | sh
Expand for instructions
create a virtual env
uv venv --python 3.12 --seed
install the latest release (which comes bundled with a stable sglang version)
uv pip install "ai-dynamo[sglang]"
Install editable version for development#
Expand for instructions
This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires nvcc to be available.
create a virtual env
uv venv --python 3.12 --seed
build dynamo runtime bindings
uv pip install maturin cd $DYNAMO_HOME/lib/bindings/python maturin develop --uv cd $DYNAMO_HOME
installs sglang supported version along with dynamo
include the prerelease flag to install flashinfer rc versions
uv pip install -e .
install any sglang version >= 0.5.3.post2
uv pip install "sglang[all]==0.5.3.post2"
Using docker containers#
Expand for instructions We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
cd $DYNAMO_ROOT
docker build
-f container/Dockerfile.sglang-wideep
-t dynamo-sglang
--no-cache
.
And then run it using
docker run
--gpus all
-it
--rm
--network host
--shm-size=10G
--ulimit memlock=-1
--ulimit stack=67108864
--ulimit nofile=65536:65536
--cap-add CAP_SYS_PTRACE
--ipc host
dynamo-sglang:latest
Quick Start#
Below we provide a guide that lets you run all of our common deployment patterns on a single node.
Start NATS and ETCD in the background#
Start using Docker Compose
docker compose -f deploy/docker-compose.yml up -d
Tip
Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
Additionally - because we use sglang’s argument parser, you can pass in any argument that sglang supports to the worker!
Aggregated Serving#
cd $DYNAMO_HOME/components/backends/sglang ./launch/agg.sh
Aggregated Serving with KV Routing#
cd $DYNAMO_HOME/components/backends/sglang ./launch/agg_router.sh
Aggregated Serving for Embedding Models#
Here’s an example that uses the Qwen/Qwen3-Embedding-4B model.
cd $DYNAMO_HOME/components/backends/sglang ./launch/agg_embed.sh
Send the following request to verify your deployment:
curl localhost:8000/v1/embeddings
-H "Content-Type: application/json"
-d '{
"model": "Qwen/Qwen3-Embedding-4B",
"input": "Hello, world!"
}'
Disaggregated serving#
See SGLang Disaggregation to learn more about how sglang and dynamo handle disaggregated serving.
cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg.sh
Disaggregated Serving with KV Aware Prefill Routing#
cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg_router.sh
Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention#
You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.
note this will require 4 GPUs
cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg_dp_attn.sh
Testing the Deployment#
Send a test request to verify your deployment:
curl localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
}
],
"stream": true,
"max_tokens": 30
}'
Advanced Examples#
Below we provide a selected list of advanced examples. Please open up an issue if you’d like to see a specific example!
Run a multi-node sized model#
Large scale P/D disaggregation with WideEP#
Hierarchical Cache (HiCache)#
Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL#
Deployment#
We currently provide deployment examples for Kubernetes and SLURM.
Kubernetes#
- Deploying Dynamo with SGLang on Kubernetes
SLURM#
- Deploying Dynamo with SGLang on SLURM
Links/Buttons:
- #
- Feature Support Matrix
- Dynamo SGLang Integration
- Installation
- Quick Start
- Single Node Examples
- Multi-Node and Advanced Examples
- Deploy on SLURM or Kubernetes
- Disaggregated Serving
- Conditional Disaggregation
- PR
- KV-Aware Routing
- SLA-Based Planner
- Multimodal EPD Disaggregation
- KVBM
- Request Migration Architecture
- Request Cancellation Architecture
- Docker Compose
- Qwen/Qwen3-Embedding-4B
- SGLang Disaggregation
- Run a multi-node model
- Run DeepSeek-R1-FP8 on H100s
- Run DeepSeek-R1-FP8 on GB200s
- Enable SGLang Hierarchical Cache (HiCache)