Skip to content

Commit 94fb016

Browse files
committed
Permanently move links to well lit paths from old post
Signed-off-by: Pete Cheslock <[email protected]>
1 parent 02052b9 commit 94fb016

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

blog/2025-07-29_llm-d-v0.2-our-first-well-lit-paths.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ Our deployments have been tested and benchmarked on recent GPUs, such as H200 no
2727

2828
We’ve defined and improved three well-lit paths that form the foundation of this release:
2929

30-
* [**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
31-
* [**P/D disaggregation**:](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
32-
* [**Wide expert parallelism for DeepSeek R1 (EP/DP)**](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/wide-ep-lws): support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.
30+
* [**Intelligent inference scheduling over any vLLM deployment**](https://github.com/llm-d/llm-d/tree/main/guides/inference-scheduling): support for precise prefix-cache aware routing with no additional infrastructure, out-of-the-box load-aware scheduling for better tail latency that “just works”, and a new configurable scheduling profile system enable teams to see immediate latency wins and still customize scheduling behavior for their workloads and infrastructure.
31+
* [**P/D disaggregation**:](https://github.com/llm-d/llm-d/tree/main/guides/pd-disaggregation) support for separating prefill and decode workloads to improve latency and GPU utilization for long-context scenarios.
32+
* [**Wide expert parallelism for DeepSeek R1 (EP/DP)**](https://github.com/llm-d/llm-d/tree/main/guides/wide-ep-lws): support for large-scale multi-node deployments using expert and data parallelism patterns for MoE models. This includes optimized deployments leveraging NIXL+UCX for inter-node communication, with fixes and improvements to reduce latency, and demonstrates the use of LeaderWorkerSet for Kubernetes-native inference orchestration.
3333

3434
All of these scenarios are reproducible: we provide reference hardware specs, workloads, and benchmarking harness support, so others can evaluate, reproduce, and extend these benchmarks easily. This also reflects improvements to our deployment tooling and benchmarking framework, a new "machinery" that allows users to set up, test, and analyze these scenarios consistently.
3535

0 commit comments

Comments
 (0)