| Component | Language | Description | Key Features |
|---|---|---|---|
| dlengine | Python/C++ | LLM inference engine | Prefill/decode engines, KV cache management, continuous batching, Ray-based distributed workers |
| dlengine-router | Rust | HTTP load balancer | OpenAI-compatible API, tool calls, routing strategies, engine discovery |
| Model | Component | Architecture |
|---|---|---|
| DeepSeek-V3 | dlengine | MLA + MoE |
| DeepSeek-V3.2 | dlengine | MLA + MoE + NSA |
| DeepSeek-V4 | dlengine | MLA + MoE + DSA + SWA |
| GLM-5 | dlengine | MLA + MoE + NSA |
| Kimi-K2 | dlengine | MLA + MoE |
| Qwen3 | dlengine | GQA (Dense) |
| Qwen3-MoE | dlengine | GQA + MoE |
| Qwen3.5-MoE | dlengine | GQA + GDN + MoE |
| Qwen3-VL | dlengine.vl | GQA + MoE + ViT |
| Feature | Description |
|---|---|
| ✅ Chunked Prefill | Split long prompts into chunks to overlap with decode batches. |
| ✅ Continuous Batching | Dynamic request scheduling with paged KV cache. |
| ✅ CUDA Graph | Captured decode kernels for low-latency token generation. |
| ✅ Encoder-Prefill-Decode (EPD) Disaggregation | Separate encoder, prefill and decode across GPU nodes with GPUDirect RDMA KV migration. |
| ✅ FP8 KV Cache | Float8 (E4M3) paged KV cache, ~50% memory reduction. |
| ✅ Gated Delta Net (GDN) | Linear attention for Qwen3.5-MoE hybrid full/linear layers. |
| ✅ Multi-head Latent Attention (MLA) | Compressed KV cache with low-rank projection for DeepSeek-V3 family. |
| ✅ Multi-Token Prediction (MTP) | Speculative decoding with model-native MTP heads. |
| ✅ Native Sparse Attention (NSA) | FP8 sparse decode with block-level indexing for DeepSeek-V3.2. |
| ✅ Node Discovery | Automatic engine registration and heartbeat via the DLSlime control plane (dlslime-ctrl). |
| ✅ Prefix Caching | Reuse KV cache of shared prompt prefixes across requests. |
| ✅ Tensor Parallelism (TP) | Split weight matrices across GPUs for large model inference. |
| ✅ Wide Expert Parallelism | MoE EP across all GPUs with attention data-parallel (attention_dp × ffn_ep). |
graph TB
Client[Client Layer<br/>HTTP Requests / OpenAI SDK]
Route[dlengine-router<br/>Rust/HTTP<br/>Load Balancer]
VL[dlengine.vl<br/>Vision Encoder]
Prefill[Prefill Engine<br/>Python/C++]
Decode[Decode Engine<br/>Python/C++]
Ctrl[dlslime-ctrl<br/>Redis<br/>Service Registry<br/>from DLSlime]
Client -->|HTTP| Route
Route -->|ZMQ| VL
Route -->|ZMQ| Prefill
Route -->|ZMQ| Decode
VL -->|RDMA<br/>Embeddings| Prefill
Prefill -->|RDMA<br/>KV Migration| Decode
VL -->|Register/Heartbeat| Ctrl
Prefill -->|Register/Heartbeat| Ctrl
Decode -->|Register/Heartbeat| Ctrl
Route -->|Engine Discovery| Ctrl
A prebuilt CUDA 12.8 development image bundles all build dependencies (PyTorch,
DeepEP/DeepGEMM/FlashMLA/FlashInfer, flash-attn, DLSlime, Rust toolchain), plus an
optional variant with 3FS USRBIO support. See docker/README.md
for the pinned dependency versions, build/run commands, and the 3FS image.
The DeepSeek kernels require SM90+ (NVIDIA Hopper) GPUs. To install the key dependencies manually instead of using the image:
cd DeepEP && pip install .
cd DeepGEMM && pip install .
cd FlashMLA && pip install .
pip install flashinfer-python==0.6.9
pip install dlslime==0.1.16pip install ".[all]"pip install ".[dlengine]" # DLEngine inference engine only
pip install ".[dlenginevl]" # DLEngine + vision-language extras (dlengine.vl subpackage)The control-plane server (
dlslime-ctrl) and its Python client (dlslime.ctrl.NanoCtrlClient) now live in the DLSlime repo. Install them via:pip install dlslime # PeerAgent + NanoCtrlClient (data-plane wheel) pip install dlslime-ctrl # Rust control-plane server binary # or, from a DLSlime checkout: pip install -e ./dlslime ./dlslime-ctrl
# Build DLEngine C++ extensions in-place (GPU kernels ship in dlengine.kernel)
cd dlengine && pip install -e . && cd ..
# Build dlengine-router (Rust)
cd dlengine-router && cargo build --release && cd ..
# Build dlslime-ctrl (Rust) from the DLSlime checkout
cd /path/to/DLSlime/dlslime-ctrl && cargo build --release && cd -Prefill-Decode disaggregation splits prompt processing (prefill) and token generation (decode) across separate GPU nodes connected via RDMA.
- 2 nodes with NVIDIA GPUs (SM90+ for FP8), RDMA-capable NICs
- Redis, Ray cluster, Rust toolchain
# Node 0 (head)
ray start --head --port=7078 --dashboard-host=0.0.0.0
# Node 1 (multi-node only)
ray start --address <node0-ip>:7078Batch generation without HTTP serving.
python dlengine/examples/non_disagg.py \
--model /models/Qwen3-235B-A22B \
--ray_address <node0-ip>:7078 \
--master_address <node0-ip>:6006 \
--attention_dp 8 --ffn_ep 8 \
--kvcache_block_size 256 \
--prompt "1+1=?" --max_tokens 128redis-server --bind 0.0.0.0 --port 6379
dlslime-ctrl server --redis-url redis://127.0.0.1:6379python dlengine/examples/disagg.py \
--model /models/Qwen3-235B-A22B \
--ray_address <node0-ip>:7078 \
--ctrl_address <node0-ip>:4479 \
--attention_dp 8 --ffn_ep 8 \
--prefill.master_address <node0-ip>:6006 \
--decode.master_address <node1-ip>:6006For single-node hybrid deployment (prefill + decode in one process), use the
dlengine serve command. It runs the engine in-process and exposes an
OpenAI-compatible HTTP API directly, in the spirit of vllm serve — no
dlengine-router and no ZMQ engine servers required:
# Same Config flags as engine_server.py (--host/--port bind HTTP for serve)
dlengine serve /path/to/model \
--host 0.0.0.0 --port 8100 \
--served-model-name Qwen3-4B \
--ray_address 127.0.0.1:7078Endpoints: GET /health, GET /v1/models, POST /v1/completions,
POST /v1/chat/completions (streaming and non-streaming).
curl http://127.0.0.1:8100/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3-4B", "messages": [{"role": "user", "content": "Hello"}]}'To make the node discoverable by a router (e.g. DLRouter) via dlslime-ctrl,
point it at a running control plane; the server then registers its HTTP
endpoint (entity kind dlengine) and keeps a heartbeat:
# Control plane (Redis + dlslime-ctrl)
redis-server --bind 0.0.0.0 --port 6379 &
dlslime-ctrl server --redis-url redis://127.0.0.1:6379 &
dlengine serve /path/to/model \
--host 0.0.0.0 --port 8100 \
--served-model-name Qwen3-4B \
--ctrl-address 127.0.0.1:4479dlengine serve also supports Prefill-Decode disaggregation over HTTP.
Launch one engine with --mode prefill and another with --mode decode
(both pointed at the same dlslime-ctrl). They register their role with the
control plane and connect their PeerAgents for KV migration. A PD-aware
gateway such as DLRouter then
orchestrates the two-stage handoff: it asks the prefill node to process the
prompt and return an opaque KV migration payload (via kv_transfer_params),
hands that payload to the decode node, which RDMA-pulls the KV cache and
streams the completion, and finally releases the prefill-side KV blocks
(POST /pd/free).
# Control plane (Redis + dlslime-ctrl)
redis-server --bind 0.0.0.0 --port 6379 &
dlslime-ctrl server --redis-url redis://127.0.0.1:6379 &
# Prefill node
dlengine serve /path/to/model \
--host 0.0.0.0 --port 8101 \
--served-model-name Qwen3-4B \
--mode prefill \
--ctrl-address 127.0.0.1:4479 \
--ray_address 127.0.0.1:7078
# Decode node
dlengine serve /path/to/model \
--host 0.0.0.0 --port 8102 \
--served-model-name Qwen3-4B \
--mode decode \
--ctrl-address 127.0.0.1:4479 \
--ray_address 127.0.0.1:7078DLRouter discovers both nodes (entity kind dlengine) via dlslime-ctrl,
maps their roles to PREFILL/DECODE, and serves an OpenAI-compatible API.
Point clients at DLRouter; requests transparently flow prefill → KV
migration → decode. When the prefill node fully answers a request on its own
(e.g. the first sampled token is EOS), the completion is returned directly
without a decode handoff. Prefill and decode engines may co-locate on the
same node when GPU resources allow.
ZMQ engine servers with OpenAI-compatible HTTP API via dlengine-router.
redis-server --bind 0.0.0.0 --port 6379
dlslime-ctrl server --redis-url redis://127.0.0.1:6379cd dlengine-router && cargo run --release # edit config.toml to set ctrl_address# Terminal 1 — Decode engine
python dlengine/dlengine/server/engine_server.py \
--model /models/Qwen3-235B-A22B \
--mode decode \
--ray_address <node0-ip>:7078 \
--ctrl_address <node0-ip>:4479 \
--ctrl_scope nanoctrl-0 \
--master_address <node1-ip>:6006 \
--host <node0-ip> --port 6001 \
--attention_dp 8 --ffn_ep 8 \
--kvcache_block_size 64 \
--max_num_batched_tokens 16384 --max_model_len 16384
# Terminal 2 — Prefill engine
python dlengine/dlengine/server/engine_server.py \
--model /models/Qwen3-235B-A22B \
--mode prefill \
--ray_address <node0-ip>:7078 \
--ctrl_address <node0-ip>:4479 \
--ctrl_scope nanoctrl-0 \
--master_address <node0-ip>:6006 \
--host <node0-ip> --port 6002 \
--attention_dp 8 --ffn_ep 8 \
--kvcache_block_size 64 \
--max_num_batched_tokens 16384 --max_model_len 16384curl http://<node0-ip>:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "/models/Qwen3-235B-A22B", "messages": [{"role": "user", "content": "Hello"}]}'See individual component license.
- Issues: GitHub Issues
- Documentation: Check component READMEs