[doc, CI] doc driven CI and docker flow refactor, bugfix by guapisolo · Pull Request #1312 · radixark/miles

guapisolo · 2026-06-09T00:00:25Z

Summary

Document the CI system and make the docker build multi-arch (cu13: amd64 + arm64).

Motivation

The docker build carried four ad-hoc variants — primary, cu129-arm64, cu13-arm64, debug — with arm64 and CUDA-12.9 each built as a separate single-arch image, never a unified manifest. Nothing documented how a test reaches a CI stage, how labels gate a run, or which Dockerfile builds what. A standalone release-docker.yaml and stale docker/{justfile,version.txt,README.md} no longer matched how images are built.

Usage

# daily multi-arch image: amd64 + arm64 as one manifest
python docker/build.py --variant cu13 --image-tag dev --push
# single-arch / CUDA-12.9 legacy
python docker/build.py --variant cu13-x86 --image-tag dev --push
python docker/build.py --variant cu12-x86 --image-tag latest

New docs answer contributor questions directly: docs/ci/contributor-guide.md (add a test, read a red check), docs/ci/00-stage.md (suite→stage map), docs/ci/02-docker-build.md (variants plus build script).

Design Notes

Multi-arch by TARGETARCH: cu13 runs one buildx --platform linux/amd64,linux/arm64; Dockerfile picks WHEELS_TAG_X86 or WHEELS_TAG_ARM64 per arch, installing each verbatim.
build.py is the single source of truth: each variant pins its platforms, tag_prefix, optional dockerfile (ROCm), build-args; the Dockerfile stays variant-agnostic.
Rebuild trigger tracks wheels: docker-build.yml check-upstream fingerprints the miles-wheels release, so a re-upload to a fixed tag still triggers a rebuild.

Verification

No automated tests added; this is a build-infra plus documentation change.
Image path: python docker/build.py --variant cu13 --image-tag dev --push produces one amd64 + arm64 manifest; run in CI by docker-build.yml.
requirements.txt resolves without sglang-router — it ships in the miles-wheels release the image installs.

Review Focus

doc-driven-principle correctness and clarity
Whether there is mismatch between doc and code.
Whether current docker flow easy to use.

gemini-code-assist

Code Review

This pull request removes the stage-b-cpu CI stage, migrating its tests to stage-a-cpu, and introduces comprehensive documentation for the CI system under docs/ci/, including stage definitions, label behaviors, and a contributor guide. The review feedback suggests minor improvements to the documentation: clarifying the naming pattern for CPU stages in 00-stage.md and using python3 instead of python in the contributor guide for consistency and compatibility.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-09T00:01:34Z

+- A `suite=` with no matching job never runs.
+- A stage job whose suite no test uses runs zero tests and exits 0 (intended during incremental migration).
+
+Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.


The naming pattern stage-<tier>-<gpus>-<hw> does not perfectly fit CPU stages like stage-a-cpu because they omit the <gpus> count. Clarifying this in the pattern description makes the documentation more accurate.

Suggested change

Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.

Stage names follow `stage-<tier>-<gpus>-<hw>` (or `stage-<tier>-<hw>` for CPU, e.g. `stage-a-cpu`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.

stage-b-cpu was a separate ubuntu-latest job holding only three slower CPU tests (generate_hub single/multi-turn, pretokenized_via_tito). Running a whole extra runner — with its own checkout + dependency install — for three tests isn't worth the isolation it bought, so collapse them into stage-a-cpu's existing 4-shard lane and delete the job, its PER_COMMIT_SUITES entry, and its doc roster row. The three tests keep their explicit est_time (220/130/120s) so the LPT shard balancer still spreads them correctly. They now sit on the GPU-gating path (stage-a-cpu gates the GPU fleet), making the slowest a long-pole on its shard — accepted in exchange for dropping the redundant job.

The cu13 variant now builds linux/amd64 + linux/arm64 in one buildx run and pushes a single manifest; build.py gains a per-variant `platforms` field that drives `--platform`, and single-arch cu13-x86 / cu13-aarch64 / cu12-x86 replace the old primary / cu129-arm64 / cu13-arm64 / debug set. The Dockerfile drops the single WHEELS_TAG and instead picks WHEELS_TAG_X86 or WHEELS_TAG_ARM64 by TARGETARCH, so each arch in a multi-arch build installs its own wheels release. Docs cover the new variant table and the Dockerfile build-arg contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reverts the earlier fold of stage-b-cpu into stage-a-cpu. Folding had moved the three slower CPU tests (generate_hub single/multi-turn, pretokenized_via_tito; 220/130/120s) onto stage-a-cpu's 4-shard lane, which gates the GPU fleet — so the slowest sat as a long-pole on the GPU-gating critical path. Restoring the separate stage-b-cpu job gives them a dedicated ubuntu-latest runner with no dependency, off the GPU gating path, at the cost of one extra runner. Re-adds the stage-b-cpu job in pr-test.yml, its PER_COMMIT_SUITES entry, the three tests' suite=, and the doc roster row + gating note in 00-stage.md. The job/doc now describe the bucket as holding slower CPU tests rather than the stale "reserved, currently empty".

…hanges The "Manual build & push" section in 02-docker-build.md said enabling commit-pinning from the workflow needs only "a small build.py change". It also needs new workflow_dispatch inputs in docker-build.yml — the dispatch currently exposes no commit/branch input, so a build.py passthrough alone would have nothing to read from. Reworded the sentence to state both required changes.

Commit 71811fd dropped sglang-router from requirements.txt, but miles/utils/arguments.py still imports sglang_router.launch_router at module top level, and tests/conftest.py loads that module for every CPU shard. pytest collection therefore failed with `ModuleNotFoundError: No module named 'sglang_router'` (exit 4), erroring all stage-a-cpu / stage-b-cpu jobs before any test ran. Restore the dep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…latest Scheduled and Dockerfile-push builds produced only cu13, leaving the cu12.9 legacy image (radixark/miles:dev-cu12) stale until built by hand. Automatic runs now also build+push cu12-x86: a second build step gated on empty inputs.variant, a cu129 wheels fingerprint in the upstream-change gate, a latest-cu12 pointer, and an independent prune series for dev-cu12-<ts>. Also drop simulate_schedule from the latest-pointing step: moving the published latest tag mutates the registry like prune does, so a [DEBUG] dry-run must not do it — only the real cron now advances latest / latest-cu12. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The doc buried how the remote build is triggered, what tags move where, and who can push. Reorganize into automatic-vs-manual triggers, a tags/registry table, a "trigger a build yourself" recipe (web + gh CLI), and a push-auth section; fix the broken YAML frontmatter; sync it to the cu12-x86 automation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Docker build workflow now gates only real scheduled runs on upstream changes, so simulate_schedule remains a manual build that does not move latest or prune. It also polls the Megatron branch and wheels releases actually baked by the Dockerfiles, with the CI Docker docs and doc-dev sentinels updated to keep the contract explicit.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mintlify · 2026-06-17T01:46:19Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
radixark	🟢 Ready	View Preview	Jun 17, 2026, 1:46 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Shi-Dong

Approve to unblock.

yushengsu-thu · 2026-06-17T09:39:41Z

+
+By default CI fails fast on two levels:
+
+- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).


01-label.md documents a !failure() gate that doesn't exist.
▎ L58/L63 say the cross-stage gate is !failure(), but grep '!failure()' pr-test.yml finds
▎ nothing. The real gate (pr-test.yml:113-118) uses needs.stage-a-cpu.result == 'success'.
▎ These differ: !failure() is true when stage-a-cpu is skipped; == 'success' is not. Workflow
▎ is fine — just fix the doc.

Fix — docs/ci/01-label.md

L58:

Suggested change

- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).

- Cross-stage: GPU stages run only when `stage-a-cpu` succeeds — the `if` requires

`needs.stage-a-cpu.result == 'success'`.

L63:

Suggested change

- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).

- Cross-stage: each GPU stage's check becomes `(needs.stage-a-cpu.result == 'success' ||

(needs.stage-a-cpu.result == 'failure' && contains(..., 'bypass-fastfail')))`, so GPU stages

run even after `stage-a-cpu` fails.

from cc for your reference

Reviewer feedback pointed out two doc mismatches: CPU stages omit the GPU-count segment, and GPU stage gating uses explicit stage-a-cpu result checks rather than !failure().

guapisolo requested review from yueming-yuan and yushengsu-thu as code owners June 9, 2026 00:00

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

Base automatically changed from bypass-fastfail to main June 9, 2026 00:59

guapisolo force-pushed the ci/doc branch from f96b483 to 730dbf0 Compare June 9, 2026 01:37

guapisolo added 13 commits June 12, 2026 23:13

stage classfiy

6a2aded

tmp

134a862

upd doc

cda941a

remove release docker

0e00bc2

02 doc 1st version

bad7a9f

fix doc

57618da

remove old

272c8c9

fix doc

8447128

upd docker tag

56e5699

doc dev fix

83cf501

fix doc

2652fcd

remove router and torch ft arm

71811fd

guapisolo force-pushed the ci/doc branch from 5ee2e90 to 71811fd Compare June 13, 2026 00:02

guapisolo and others added 5 commits June 15, 2026 01:51

doc first

f00a466

guapisolo changed the title ~~[doc, CI] doc driven CI~~ [doc, CI] doc driven CI and docker flow refactor Jun 15, 2026

guapisolo changed the title ~~[doc, CI] doc driven CI and docker flow refactor~~ [doc, CI] doc driven CI and docker flow refactor, bugfix Jun 15, 2026

guapisolo and others added 4 commits June 15, 2026 22:15

Update docs/ci/contributor-guide.md

eb16bff

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mintlify Bot deployed to staging - docs June 17, 2026 01:46 View deployment

Shi-Dong approved these changes Jun 17, 2026

View reviewed changes

yushengsu-thu approved these changes Jun 17, 2026

View reviewed changes

Clarify CI stage naming and fast-fail gate docs

ce1d956

Reviewer feedback pointed out two doc mismatches: CPU stages omit the GPU-count segment, and GPU stage gating uses explicit stage-a-cpu result checks rather than !failure().

guapisolo merged commit 1daf707 into main Jun 17, 2026
15 checks passed

guapisolo deleted the ci/doc branch June 17, 2026 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc, CI] doc driven CI and docker flow refactor, bugfix#1312

[doc, CI] doc driven CI and docker flow refactor, bugfix#1312
guapisolo merged 23 commits into
mainfrom
ci/doc

guapisolo commented Jun 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Uh oh!

Uh oh!

mintlify Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Shi-Dong left a comment

Uh oh!

yushengsu-thu Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.
	Stage names follow `stage-<tier>-<gpus>-<hw>` (or `stage-<tier>-<hw>` for CPU, e.g. `stage-a-cpu`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.


		By default CI fails fast on two levels:

		- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).

	- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).
	- Cross-stage: GPU stages run only when `stage-a-cpu` succeeds — the `if` requires
	`needs.stage-a-cpu.result == 'success'`.

-- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).
+- Cross-stage: each GPU stage's check becomes `(needs.stage-a-cpu.result == 'success' ||
+(needs.stage-a-cpu.result == 'failure' && contains(..., 'bypass-fastfail')))`, so GPU stages
+run even after `stage-a-cpu` fails.

Conversation

guapisolo commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Usage

Design Notes

Verification

Review Focus

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mintlify Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shi-Dong left a comment

Choose a reason for hiding this comment

Uh oh!

yushengsu-thu Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guapisolo commented Jun 9, 2026 •

edited

Loading

mintlify Bot commented Jun 17, 2026 •

edited

Loading