Skip to content

[Feature Request] Add Kubernetes-native runner for distributed inference benchmarking (llm-d) #1045

@cemigo114

Description

@cemigo114

Is your feature request related to a problem? Please describe.

Today, the runners/ directory is Slurm-centric for multi-node setups — e.g., launch_b200-dgxc-slurm.sh, launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh. Slurm is great for HPC-style clusters, but it's limiting for reproducing these benchmarks on the cloud-native stacks that most production LLM serving actually runs on: Kubernetes on EKS/GKE/AKS/OpenShift and on-prem K8s GPU fleets.

The repo already demonstrates disaggregated serving via NVIDIA Dynamo on Slurm (e.g., launch_gb200-nv.sh + PR #1008 for Kimi K2.5 NVFP4 GB200 disaggregated vLLM), so disaggregation itself is supported — the gap is K8s-native orchestration of the same patterns (disaggregated P/D, KV-cache-aware routing, wide-EP, autoscaling). Without that, community users can't easily reproduce InferenceX results in their own K8s environments, and newer serving patterns that are first-class in K8s-native stacks are harder to cover.

Describe the solution you'd like

Add a first-class Kubernetes-native runner targeting llm-d as a reference, analogous to the existing Slurm runners. Concretely:

  • New runner(s) under runners/ (e.g., launch_b200-k8s-llmd.sh, launch_mi355x-k8s-llmd.sh) that stand up llm-d on a K8s cluster and drive benchmarks through the existing harness.
  • Reuse llm-d's upstream Helm charts and reproducible benchmark workflows (shipped in llm-d v0.5, Feb 2026), which already include validated B200 numbers (~3.1k tok/s per decode GPU on wide-EP; up to 50k output tok/s on a 16×16 B200 P/D topology). This minimizes new orchestration code on the InferenceX side.
  • Integration with benchmarks/ so K8s-native results are directly comparable to Slurm-based runs on the same metrics (TTFT, ITL, throughput, goodput, per-GPU utilization).
  • Support the serving patterns llm-d exposes natively: disaggregated prefill/decode via NIXL, KV-cache-aware inference scheduling via the Gateway API, wide-EP for MoE models (DeepSeek, Qwen3.5, gpt-oss), and tiered KV offload.
  • Docs for running InferenceX benchmarks on a K8s cluster (GB200 NVL72 / B200 / H100 / MI355X) using llm-d as the orchestration layer.

Describe alternatives you've considered

  • Slurm-only (status quo): works for the current set of supported clusters, but limits reproducibility for the broader K8s-based community and makes it harder to benchmark K8s-native patterns (Gateway-API-based smart routing, HPA/VPA autoscaling, workload-variant autoscaler).
  • Raw Kubernetes Deployments/StatefulSets without llm-d: workable, but reinvents disaggregated serving, KV-cache-aware routing, and autoscaling that llm-d already provides on top of vLLM/SGLang.
  • Ray Serve / KServe / NVIDIA Dynamo on K8s: viable alternatives — could be added as additional K8s runners later. llm-d seems like a strong first target because it's purpose-built for distributed LLM inference, aligns with the vLLM/SGLang stack already used here, is Apache-2.0, and has an existing reproducible benchmark workflow that can be leveraged directly.

Additional context

  • llm-d: https://github.com/llm-d/llm-d — Kubernetes-native distributed inference stack with disaggregated P/D, KV-cache-aware scheduling, wide-EP, and native vLLM/SGLang support. Supported accelerators per their docs include NVIDIA A100+, AMD MI250+, Intel GPU Max, and Google TPU v5e+ — overlapping well with InferenceX's hardware coverage.
  • A K8s-native runner would also make it easier to onboard new accelerators/clouds without waiting for Slurm integration on each provider.
  • Happy to help prototype a runner if maintainers are interested and can point at a preferred starting cluster (B200 or MI355X).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions