Conversation
There was a problem hiding this comment.
Code Review
This pull request removes the stage-b-cpu CI stage, migrating its tests to stage-a-cpu, and introduces comprehensive documentation for the CI system under docs/ci/, including stage definitions, label behaviors, and a contributor guide. The review feedback suggests minor improvements to the documentation: clarifying the naming pattern for CPU stages in 00-stage.md and using python3 instead of python in the contributor guide for consistency and compatibility.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| - A `suite=` with no matching job never runs. | ||
| - A stage job whose suite no test uses runs zero tests and exits 0 (intended during incremental migration). | ||
|
|
||
| Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class. |
There was a problem hiding this comment.
The naming pattern stage-<tier>-<gpus>-<hw> does not perfectly fit CPU stages like stage-a-cpu because they omit the <gpus> count. Clarifying this in the pattern description makes the documentation more accurate.
| Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class. | |
| Stage names follow `stage-<tier>-<gpus>-<hw>` (or `stage-<tier>-<hw>` for CPU, e.g. `stage-a-cpu`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class. |
stage-b-cpu was a separate ubuntu-latest job holding only three slower CPU tests (generate_hub single/multi-turn, pretokenized_via_tito). Running a whole extra runner — with its own checkout + dependency install — for three tests isn't worth the isolation it bought, so collapse them into stage-a-cpu's existing 4-shard lane and delete the job, its PER_COMMIT_SUITES entry, and its doc roster row. The three tests keep their explicit est_time (220/130/120s) so the LPT shard balancer still spreads them correctly. They now sit on the GPU-gating path (stage-a-cpu gates the GPU fleet), making the slowest a long-pole on its shard — accepted in exchange for dropping the redundant job.
The cu13 variant now builds linux/amd64 + linux/arm64 in one buildx run and pushes a single manifest; build.py gains a per-variant `platforms` field that drives `--platform`, and single-arch cu13-x86 / cu13-aarch64 / cu12-x86 replace the old primary / cu129-arm64 / cu13-arm64 / debug set. The Dockerfile drops the single WHEELS_TAG and instead picks WHEELS_TAG_X86 or WHEELS_TAG_ARM64 by TARGETARCH, so each arch in a multi-arch build installs its own wheels release. Docs cover the new variant table and the Dockerfile build-arg contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reverts the earlier fold of stage-b-cpu into stage-a-cpu. Folding had moved the three slower CPU tests (generate_hub single/multi-turn, pretokenized_via_tito; 220/130/120s) onto stage-a-cpu's 4-shard lane, which gates the GPU fleet — so the slowest sat as a long-pole on the GPU-gating critical path. Restoring the separate stage-b-cpu job gives them a dedicated ubuntu-latest runner with no dependency, off the GPU gating path, at the cost of one extra runner. Re-adds the stage-b-cpu job in pr-test.yml, its PER_COMMIT_SUITES entry, the three tests' suite=, and the doc roster row + gating note in 00-stage.md. The job/doc now describe the bucket as holding slower CPU tests rather than the stale "reserved, currently empty".
…hanges The "Manual build & push" section in 02-docker-build.md said enabling commit-pinning from the workflow needs only "a small build.py change". It also needs new workflow_dispatch inputs in docker-build.yml — the dispatch currently exposes no commit/branch input, so a build.py passthrough alone would have nothing to read from. Reworded the sentence to state both required changes.
Commit 71811fd dropped sglang-router from requirements.txt, but miles/utils/arguments.py still imports sglang_router.launch_router at module top level, and tests/conftest.py loads that module for every CPU shard. pytest collection therefore failed with `ModuleNotFoundError: No module named 'sglang_router'` (exit 4), erroring all stage-a-cpu / stage-b-cpu jobs before any test ran. Restore the dep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…latest Scheduled and Dockerfile-push builds produced only cu13, leaving the cu12.9 legacy image (radixark/miles:dev-cu12) stale until built by hand. Automatic runs now also build+push cu12-x86: a second build step gated on empty inputs.variant, a cu129 wheels fingerprint in the upstream-change gate, a latest-cu12 pointer, and an independent prune series for dev-cu12-<ts>. Also drop simulate_schedule from the latest-pointing step: moving the published latest tag mutates the registry like prune does, so a [DEBUG] dry-run must not do it — only the real cron now advances latest / latest-cu12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The doc buried how the remote build is triggered, what tags move where, and who can push. Reorganize into automatic-vs-manual triggers, a tags/registry table, a "trigger a build yourself" recipe (web + gh CLI), and a push-auth section; fix the broken YAML frontmatter; sync it to the cu12-x86 automation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Docker build workflow now gates only real scheduled runs on upstream changes, so simulate_schedule remains a manual build that does not move latest or prune. It also polls the Megatron branch and wheels releases actually baked by the Dockerfiles, with the CI Docker docs and doc-dev sentinels updated to keep the contract explicit.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
|
|
||
| By default CI fails fast on two levels: | ||
|
|
||
| - Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`). |
There was a problem hiding this comment.
01-label.md documents a !failure() gate that doesn't exist.
▎ L58/L63 say the cross-stage gate is !failure(), but grep '!failure()' pr-test.yml finds
▎ nothing. The real gate (pr-test.yml:113-118) uses needs.stage-a-cpu.result == 'success'.
▎ These differ: !failure() is true when stage-a-cpu is skipped; == 'success' is not. Workflow
▎ is fine — just fix the doc.
Fix — docs/ci/01-label.md
L58:
| - Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`). | |
| - Cross-stage: GPU stages run only when `stage-a-cpu` succeeds — the `if` requires | |
| `needs.stage-a-cpu.result == 'success'`. |
L63:
| - Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`). | |
| - Cross-stage: each GPU stage's check becomes `(needs.stage-a-cpu.result == 'success' || | |
| (needs.stage-a-cpu.result == 'failure' && contains(..., 'bypass-fastfail')))`, so GPU stages | |
| run even after `stage-a-cpu` fails. |
from cc for your reference
Reviewer feedback pointed out two doc mismatches: CPU stages omit the GPU-count segment, and GPU stage gating uses explicit stage-a-cpu result checks rather than !failure().
Summary
Document the CI system and make the docker build multi-arch (
cu13: amd64 + arm64).Motivation
The docker build carried four ad-hoc variants —
primary,cu129-arm64,cu13-arm64,debug— with arm64 and CUDA-12.9 each built as a separate single-arch image, never a unified manifest. Nothing documented how a test reaches a CI stage, how labels gate a run, or which Dockerfile builds what. A standalonerelease-docker.yamland staledocker/{justfile,version.txt,README.md}no longer matched how images are built.Usage
New docs answer contributor questions directly:
docs/ci/contributor-guide.md(add a test, read a red check),docs/ci/00-stage.md(suite→stage map),docs/ci/02-docker-build.md(variants plus build script).Design Notes
TARGETARCH:cu13runs onebuildx --platform linux/amd64,linux/arm64;DockerfilepicksWHEELS_TAG_X86orWHEELS_TAG_ARM64per arch, installing each verbatim.build.pyis the single source of truth: each variant pins itsplatforms,tag_prefix, optionaldockerfile(ROCm), build-args; theDockerfilestays variant-agnostic.docker-build.ymlcheck-upstreamfingerprints themiles-wheelsrelease, so a re-upload to a fixed tag still triggers a rebuild.Verification
python docker/build.py --variant cu13 --image-tag dev --pushproduces one amd64 + arm64 manifest; run in CI bydocker-build.yml.requirements.txtresolves withoutsglang-router— it ships in themiles-wheelsrelease the image installs.Review Focus
doc-driven-principlecorrectness and clarity