Skip to content

[doc, CI] doc driven CI and docker flow refactor, bugfix#1312

Merged
guapisolo merged 23 commits into
mainfrom
ci/doc
Jun 17, 2026
Merged

[doc, CI] doc driven CI and docker flow refactor, bugfix#1312
guapisolo merged 23 commits into
mainfrom
ci/doc

Conversation

@guapisolo

@guapisolo guapisolo commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Document the CI system and make the docker build multi-arch (cu13: amd64 + arm64).

Motivation

The docker build carried four ad-hoc variants — primary, cu129-arm64, cu13-arm64, debug — with arm64 and CUDA-12.9 each built as a separate single-arch image, never a unified manifest. Nothing documented how a test reaches a CI stage, how labels gate a run, or which Dockerfile builds what. A standalone release-docker.yaml and stale docker/{justfile,version.txt,README.md} no longer matched how images are built.

Usage

# daily multi-arch image: amd64 + arm64 as one manifest
python docker/build.py --variant cu13 --image-tag dev --push
# single-arch / CUDA-12.9 legacy
python docker/build.py --variant cu13-x86 --image-tag dev --push
python docker/build.py --variant cu12-x86 --image-tag latest

New docs answer contributor questions directly: docs/ci/contributor-guide.md (add a test, read a red check), docs/ci/00-stage.md (suite→stage map), docs/ci/02-docker-build.md (variants plus build script).

Design Notes

  • Multi-arch by TARGETARCH: cu13 runs one buildx --platform linux/amd64,linux/arm64; Dockerfile picks WHEELS_TAG_X86 or WHEELS_TAG_ARM64 per arch, installing each verbatim.
  • build.py is the single source of truth: each variant pins its platforms, tag_prefix, optional dockerfile (ROCm), build-args; the Dockerfile stays variant-agnostic.
  • Rebuild trigger tracks wheels: docker-build.yml check-upstream fingerprints the miles-wheels release, so a re-upload to a fixed tag still triggers a rebuild.

Verification

  • No automated tests added; this is a build-infra plus documentation change.
  • Image path: python docker/build.py --variant cu13 --image-tag dev --push produces one amd64 + arm64 manifest; run in CI by docker-build.yml.
  • requirements.txt resolves without sglang-router — it ships in the miles-wheels release the image installs.

Review Focus

  • doc-driven-principle correctness and clarity
  • Whether there is mismatch between doc and code.
  • Whether current docker flow easy to use.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the stage-b-cpu CI stage, migrating its tests to stage-a-cpu, and introduces comprehensive documentation for the CI system under docs/ci/, including stage definitions, label behaviors, and a contributor guide. The review feedback suggests minor improvements to the documentation: clarifying the naming pattern for CPU stages in 00-stage.md and using python3 instead of python in the contributor guide for consistency and compatibility.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread docs/ci/00-stage.md Outdated
- A `suite=` with no matching job never runs.
- A stage job whose suite no test uses runs zero tests and exits 0 (intended during incremental migration).

Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The naming pattern stage-<tier>-<gpus>-<hw> does not perfectly fit CPU stages like stage-a-cpu because they omit the <gpus> count. Clarifying this in the pattern description makes the documentation more accurate.

Suggested change
Stage names follow `stage-<tier>-<gpus>-<hw>` (e.g. `stage-c-4-gpu-h200`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.
Stage names follow `stage-<tier>-<gpus>-<hw>` (or `stage-<tier>-<hw>` for CPU, e.g. `stage-a-cpu`): `tier ∈ {a, b, c}` classifies cost/role, `gpus` is the GPU count the test needs, `hw ∈ {cpu, h100, h200}` is the hardware class.

Comment thread docs/ci/contributor-guide.md Outdated
Base automatically changed from bypass-fastfail to main June 9, 2026 00:59
guapisolo added 13 commits June 12, 2026 23:13
stage-b-cpu was a separate ubuntu-latest job holding only three slower CPU
tests (generate_hub single/multi-turn, pretokenized_via_tito). Running a
whole extra runner — with its own checkout + dependency install — for three
tests isn't worth the isolation it bought, so collapse them into
stage-a-cpu's existing 4-shard lane and delete the job, its PER_COMMIT_SUITES
entry, and its doc roster row.

The three tests keep their explicit est_time (220/130/120s) so the LPT shard
balancer still spreads them correctly. They now sit on the GPU-gating path
(stage-a-cpu gates the GPU fleet), making the slowest a long-pole on its
shard — accepted in exchange for dropping the redundant job.
guapisolo and others added 5 commits June 15, 2026 01:51
The cu13 variant now builds linux/amd64 + linux/arm64 in one buildx run
and pushes a single manifest; build.py gains a per-variant `platforms`
field that drives `--platform`, and single-arch cu13-x86 / cu13-aarch64
/ cu12-x86 replace the old primary / cu129-arm64 / cu13-arm64 / debug
set. The Dockerfile drops the single WHEELS_TAG and instead picks
WHEELS_TAG_X86 or WHEELS_TAG_ARM64 by TARGETARCH, so each arch in a
multi-arch build installs its own wheels release. Docs cover the new
variant table and the Dockerfile build-arg contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reverts the earlier fold of stage-b-cpu into stage-a-cpu. Folding had moved the
three slower CPU tests (generate_hub single/multi-turn, pretokenized_via_tito;
220/130/120s) onto stage-a-cpu's 4-shard lane, which gates the GPU fleet — so the
slowest sat as a long-pole on the GPU-gating critical path. Restoring the separate
stage-b-cpu job gives them a dedicated ubuntu-latest runner with no dependency,
off the GPU gating path, at the cost of one extra runner.

Re-adds the stage-b-cpu job in pr-test.yml, its PER_COMMIT_SUITES entry, the three
tests' suite=, and the doc roster row + gating note in 00-stage.md. The job/doc now
describe the bucket as holding slower CPU tests rather than the stale
"reserved, currently empty".
…hanges

The "Manual build & push" section in 02-docker-build.md said enabling
commit-pinning from the workflow needs only "a small build.py change". It also
needs new workflow_dispatch inputs in docker-build.yml — the dispatch currently
exposes no commit/branch input, so a build.py passthrough alone would have
nothing to read from. Reworded the sentence to state both required changes.
Commit 71811fd dropped sglang-router from requirements.txt, but
miles/utils/arguments.py still imports sglang_router.launch_router at
module top level, and tests/conftest.py loads that module for every CPU
shard. pytest collection therefore failed with
`ModuleNotFoundError: No module named 'sglang_router'` (exit 4), erroring
all stage-a-cpu / stage-b-cpu jobs before any test ran. Restore the dep.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@guapisolo guapisolo changed the title [doc, CI] doc driven CI [doc, CI] doc driven CI and docker flow refactor Jun 15, 2026
@guapisolo guapisolo changed the title [doc, CI] doc driven CI and docker flow refactor [doc, CI] doc driven CI and docker flow refactor, bugfix Jun 15, 2026
guapisolo and others added 4 commits June 15, 2026 22:15
…latest

Scheduled and Dockerfile-push builds produced only cu13, leaving the cu12.9
legacy image (radixark/miles:dev-cu12) stale until built by hand. Automatic
runs now also build+push cu12-x86: a second build step gated on empty
inputs.variant, a cu129 wheels fingerprint in the upstream-change gate, a
latest-cu12 pointer, and an independent prune series for dev-cu12-<ts>.

Also drop simulate_schedule from the latest-pointing step: moving the published
latest tag mutates the registry like prune does, so a [DEBUG] dry-run must not
do it — only the real cron now advances latest / latest-cu12.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The doc buried how the remote build is triggered, what tags move where, and who
can push. Reorganize into automatic-vs-manual triggers, a tags/registry table, a
"trigger a build yourself" recipe (web + gh CLI), and a push-auth section; fix the
broken YAML frontmatter; sync it to the cu12-x86 automation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Docker build workflow now gates only real scheduled runs on upstream changes, so simulate_schedule remains a manual build that does not move latest or prune. It also polls the Megatron branch and wheels releases actually baked by the Dockerfiles, with the CI Docker docs and doc-dev sentinels updated to keep the contract explicit.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@mintlify

mintlify Bot commented Jun 17, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
radixark 🟢 Ready View Preview Jun 17, 2026, 1:46 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@Shi-Dong Shi-Dong left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to unblock.

Comment thread docs/ci/01-label.md Outdated

By default CI fails fast on two levels:

- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

01-label.md documents a !failure() gate that doesn't exist.
▎ L58/L63 say the cross-stage gate is !failure(), but grep '!failure()' pr-test.yml finds
▎ nothing. The real gate (pr-test.yml:113-118) uses needs.stage-a-cpu.result == 'success'.
▎ These differ: !failure() is true when stage-a-cpu is skipped; == 'success' is not. Workflow
▎ is fine — just fix the doc.

Fix — docs/ci/01-label.md

L58:

Suggested change
- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).
- Cross-stage: GPU stages run only when `stage-a-cpu` succeeds — the `if` requires
`needs.stage-a-cpu.result == 'success'`.

L63:

Suggested change
- Cross-stage: GPU stages skip if `stage-a-cpu` fails (`!failure()` in their `if`).
- Cross-stage: each GPU stage's check becomes `(needs.stage-a-cpu.result == 'success' ||
(needs.stage-a-cpu.result == 'failure' && contains(..., 'bypass-fastfail')))`, so GPU stages
run even after `stage-a-cpu` fails.

from cc for your reference

Reviewer feedback pointed out two doc mismatches: CPU stages omit the GPU-count segment, and GPU stage gating uses explicit stage-a-cpu result checks rather than !failure().
@guapisolo guapisolo merged commit 1daf707 into main Jun 17, 2026
15 checks passed
@guapisolo guapisolo deleted the ci/doc branch June 17, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants