Skip to content

[experimental chore/agentx-v0.2-aiperf-testing branch] agentic launcher: TOTAL_CPU_DRAM_GB=2000 hardcoded, OOMs on 1.5 TB MI355X nodes #1358

@andyluo7

Description

@andyluo7

Summary

benchmarks/single_node/agentic/minimaxm2.5_fp8_mi355x.sh hardcodes TOTAL_CPU_DRAM_GB=2000 inside the cpu offload branch, overriding any caller-provided value. On MI355X nodes with less than ~2 TB of host RAM (e.g. AAC1 cluster nodes have 1.5 TB), this triggers an OOM-kill of one or more vLLM TP workers during SimpleCPUOffloadConnector initialization.

Repro

On a 1.5 TB MI355X node (e.g. AAC1 smci355-ccs-aus-g12-*):

podman run ... \
  -e MODEL=MiniMaxAI/MiniMax-M2.5 -e TP=4 -e CONC=16 \
  -e OFFLOADING=cpu -e TOTAL_CPU_DRAM_GB=1200 \
  ... vllm/vllm-openai-rocm:nightly-51f22dcfd0... \
  /workspace/benchmarks/single_node/agentic/minimaxm2.5_fp8_mi355x.sh

Even though the env passes TOTAL_CPU_DRAM_GB=1200, the launcher overwrites it to 2000. Each TP worker then tries to allocate 2000 / 4 = 500 GB of pinned host memory; total allocation = 2000 GB > 1500 GB available → Worker_TP2 dies during init, EngineCore reports Worker proc VllmWorker-2 died unexpectedly.

Server log:

(Worker_TP3) INFO ... [worker.py:144] SimpleCPUOffloadWorker: 124 unique GPU KV tensors, allocating 528516 CPU blocks (500.00 GB)
(Worker_TP0) ...
(Worker_TP1) ...
(Worker_TP2) ...
(EngineCore) INFO ... [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) INFO ... [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds.
(EngineCore) ERROR ... Worker proc VllmWorker-2 died unexpectedly, shutting down executor.

Suggested fix

Make the value respect a caller-provided env var, falling back to 2000 only when unset:

-        TOTAL_CPU_DRAM_GB=2000
+        # Respect env override; AAC1 MI355X nodes have only 1.5 TB.
+        TOTAL_CPU_DRAM_GB=${TOTAL_CPU_DRAM_GB:-2000}

After this patch, TOTAL_CPU_DRAM_GB=1200 runs cleanly to completion (verified with full 30-min CONC=16 sweep, 1116 reqs / 1.88% err).

The same hardcode pattern likely exists in minimaxm2.5_fp8_mi300x.sh and minimaxm2.5_fp8_mi325x.sh and should get the same treatment.

Environment

  • Branch: chore/agentx-v0.2-aiperf-testing (tip c8dfb585)
  • Image: vllm/vllm-openai-rocm:nightly-51f22dcfd068fe8f1e3192da2a1e825b930223cf
  • Hardware: AAC1 MI355X partition 256C8G1H_MI355X_Ubuntu22, 1.5 TB RAM/node

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions