Title: Running SGLang with Dynamo — NVIDIA Dynamo Documentation

URL Source: https://docs.nvidia.com/dynamo/latest/backends/sglang/README.html

Published Time: Fri, 07 Nov 2025 17:50:43 GMT

Markdown Content: Running SGLang with Dynamo#

Use the Latest Release#

We recommend using the latest stable release of dynamo to avoid breaking changes:

You can find the latest release here and check out the corresponding branch with:

git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

Table of Contents#

Feature Support Matrix#

Core Dynamo Features#

Feature	SGLang	Notes
Disaggregated Serving	✅
Conditional Disaggregation	🚧	WIP PR
KV-Aware Routing	✅
SLA-Based Planner	✅
Multimodal EPD Disaggregation	✅
KVBM	❌	Planned

Dynamo SGLang Integration#

Dynamo SGLang integrates SGLang engines into Dynamo’s distributed runtime, enabling advanced features like disaggregated serving, KV-aware routing, and request migration while maintaining full compatibility with SGLang’s engine arguments.

Argument Handling#

Dynamo SGLang uses SGLang’s native argument parser, so most SGLang engine arguments work identically. You can pass any SGLang argument (like --model-path, --tp, --trust-remote-code) directly to dynamo.sglang.

Dynamo-Specific Arguments#

Argument	Description	Default	SGLang Equivalent
`--endpoint`	Dynamo endpoint in `dyn://namespace.component.endpoint` format	Auto-generated based on mode	N/A
`--migration-limit`	Max times a request can migrate between workers for fault tolerance. See Request Migration Architecture.	`0` (disabled)	N/A
`--dyn-tool-call-parser`	Tool call parser for structured outputs (takes precedence over `--tool-call-parser`)	`None`	`--tool-call-parser`
`--dyn-reasoning-parser`	Reasoning parser for CoT models (takes precedence over `--reasoning-parser`)	`None`	`--reasoning-parser`
`--use-sglang-tokenizer`	Use SGLang’s tokenizer instead of Dynamo’s	`False`	N/A
`--custom-jinja-template`	Use custom chat template for that model (takes precedence over default chat template in model repo)	`None`	`--chat-template`

Tokenizer Behavior#

Default (--use-sglang-tokenizer not set): Dynamo handles tokenization/detokenization via our blazing fast frontend and passes input_ids to SGLang
With --use-sglang-tokenizer: SGLang handles tokenization/detokenization, Dynamo passes raw prompts

Note

When using --use-sglang-tokenizer, only v1/chat/completions is available through Dynamo’s frontend.

Request Cancellation#

When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.

Cancellation Support Matrix#

	Prefill	Decode
Aggregated	✅	✅
Disaggregated	⚠️	✅

Warning

⚠️ SGLang backend currently does not support cancellation during remote prefill phase in disaggregated mode.

For more details, see the Request Cancellation Architecture documentation.

Installation#

Install latest release#

We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with curl -LsSf https://astral.sh/uv/install.sh | sh

Expand for instructions

create a virtual env

uv venv --python 3.12 --seed

install the latest release (which comes bundled with a stable sglang version)

uv pip install "ai-dynamo[sglang]"

Install editable version for development#

Expand for instructions This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires nvcc to be available.

create a virtual env

uv venv --python 3.12 --seed

build dynamo runtime bindings

uv pip install maturin cd $DYNAMO_HOME/lib/bindings/python maturin develop --uv cd $DYNAMO_HOME

installs sglang supported version along with dynamo

include the prerelease flag to install flashinfer rc versions

uv pip install -e .

install any sglang version >= 0.5.3.post2

uv pip install "sglang[all]==0.5.3.post2"

Using docker containers#

Expand for instructions We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.

cd $DYNAMO_ROOT docker build
-f container/Dockerfile.sglang-wideep
-t dynamo-sglang
--no-cache
.

And then run it using

docker run
--gpus all
-it
--rm
--network host
--shm-size=10G
--ulimit memlock=-1
--ulimit stack=67108864
--ulimit nofile=65536:65536
--cap-add CAP_SYS_PTRACE
--ipc host
dynamo-sglang:latest

Quick Start#

Below we provide a guide that lets you run all of our common deployment patterns on a single node.

Start NATS and ETCD in the background#

Start using Docker Compose

docker compose -f deploy/docker-compose.yml up -d

Tip

Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.

Additionally - because we use sglang’s argument parser, you can pass in any argument that sglang supports to the worker!

Aggregated Serving#

cd $DYNAMO_HOME/components/backends/sglang ./launch/agg.sh

Aggregated Serving with KV Routing#

cd $DYNAMO_HOME/components/backends/sglang ./launch/agg_router.sh

Aggregated Serving for Embedding Models#

Here’s an example that uses the Qwen/Qwen3-Embedding-4B model.

cd $DYNAMO_HOME/components/backends/sglang ./launch/agg_embed.sh

Send the following request to verify your deployment:

curl localhost:8000/v1/embeddings
-H "Content-Type: application/json"
-d '{ "model": "Qwen/Qwen3-Embedding-4B", "input": "Hello, world!" }'

Disaggregated serving#

See SGLang Disaggregation to learn more about how sglang and dynamo handle disaggregated serving.

cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg.sh

Disaggregated Serving with KV Aware Prefill Routing#

cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg_router.sh

Disaggregated Serving with Mixture-of-Experts (MoE) models and DP attention#

You can use this configuration to test out disaggregated serving with dp attention and expert parallelism on a single node before scaling to the full DeepSeek-R1 model across multiple nodes.

note this will require 4 GPUs

cd $DYNAMO_HOME/components/backends/sglang ./launch/disagg_dp_attn.sh

Testing the Deployment#

Send a test request to verify your deployment:

curl localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "Qwen/Qwen3-0.6B", "messages": [ { "role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time" } ], "stream": true, "max_tokens": 30 }'

Advanced Examples#

Below we provide a selected list of advanced examples. Please open up an issue if you’d like to see a specific example!

Run a multi-node sized model#

Run a multi-node model

Large scale P/D disaggregation with WideEP#

Hierarchical Cache (HiCache)#

Enable SGLang Hierarchical Cache (HiCache)

Multimodal Encode-Prefill-Decode (EPD) Disaggregation with NIXL#

Run a multimodal model with EPD Disaggregation

Deployment#

We currently provide deployment examples for Kubernetes and SLURM.

Kubernetes#

Deploying Dynamo with SGLang on Kubernetes

SLURM#

Deploying Dynamo with SGLang on SLURM

Links/Buttons: