Skip to content

Add Flash-MoE SSD backend support#475

Merged
ndizazzo merged 1 commit into
Mesh-LLM:mainfrom
IvGolovach:codex/flash-moe-ssd-backend
May 16, 2026
Merged

Add Flash-MoE SSD backend support#475
ndizazzo merged 1 commit into
Mesh-LLM:mainfrom
IvGolovach:codex/flash-moe-ssd-backend

Conversation

@IvGolovach
Copy link
Copy Markdown
Collaborator

Summary

Users can now run Flash-MoE as a built-in mesh-llm backend for single-node SSD expert streaming: giant MoE models can live on local NVMe while mesh-llm handles backend lifecycle, endpoint discovery, and OpenAI-compatible routing.

Supported modes:

  • Managed process mode: mesh-llm starts and supervises the Flash-MoE infer binary, allocates the local serving port, and appends --serve <port> itself.
  • Existing endpoint mode: mesh-llm attaches an already-running Flash-MoE /v1 endpoint and advertises it through the existing plugin inference path.

Why

This follows the roadmap direction for SSD expert streaming without pushing SSD streaming into llama.cpp internals and without changing Skippy, model packages, or mesh protocol compatibility.

Diff Scope

  • Added built-in flash-moe plugin support in mesh-llm-host-runtime.
  • Registered the plugin in built-in dispatch and plugin resolution.
  • Added plugin-scoped env wiring for built-in backend adapters.
  • Added config validation: exactly one of command or url, args only with command, and no user-provided --serve because mesh-llm owns the port.
  • Added docs in README.md, docs/plugins/flash-moe.md, docs/plugins/README.md, ROADMAP.md, and crates/mesh-llm/TODO.md.
  • Updated one stale host-runtime test expectation to match the existing dashboard-mode behavior.

Architecture / Protocol

Flash-MoE is integrated as a host-runtime plugin and registers an OpenAI-compatible endpoint with endpoint id flash-moe.

No mesh wire protocol, protobuf, ALPN, gossip schema, Skippy stage protocol, model-package format, or llama.cpp patch queue changes are included.

Branch / Commit Integrity

  • Base branch: main
  • Validated base SHA: 8d12c0be26fb3af4ed309fde6df65acfabff0162
  • git rev-list --left-right --count origin/main...HEAD: 0 1
  • Merge-base SHA: 8d12c0be26fb3af4ed309fde6df65acfabff0162
  • Introduced commit: d923e037bd7ffb38f632307228d64d61c4eb0640 Add Flash-MoE SSD backend plugin

Validation

Validation mode: Tier 3 — shared backend/runtime integration.

Local proof:

  • cargo fmt --all -- --check: PASS
  • LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm-host-runtime flash_moe: PASS, 10 passed, 0 failed
  • LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm-host-runtime --lib: PASS, 1242 passed, 0 failed, 5 ignored
  • LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm: PASS
  • /tmp/mesh-llm-just-tool/bin/just build: PASS
  • git diff --check origin/main...HEAD: PASS, no output

Remote CI is pending until this PR is opened. This PR should not be considered merge-ready until required GitHub checks pass on the final PR SHA.

Runtime Safety

  • No new mesh-wide blocking locks.
  • No new unbounded queues or background buffering paths.
  • Managed Flash-MoE process ownership is scoped to the plugin lifecycle.
  • Health probing uses a bounded timeout and keeps warm-up states non-fatal unless the child process exits.
  • No mesh protocol invariant is removed.

No invariant regression introduced.

Rollback Plan

Rollback: git revert <post_merge_commit_sha>.

DB downgrade: not applicable. Data repair: not applicable. Operational caveat: rollback removes the built-in Flash-MoE adapter and restores the previous plugin/backend surface.

Known Residual Risks

  • Real Flash-MoE Qwen3.5-397B hardware smoke was not run locally because the required Flash-MoE binary and model artifacts are not available in this environment.
  • This intentionally does not automate Flash-MoE artifact preparation or model conversion; those remain separate follow-up work.

@IvGolovach IvGolovach force-pushed the codex/flash-moe-ssd-backend branch from a348483 to 4d46085 Compare May 8, 2026 21:30
@i386 i386 self-assigned this May 8, 2026
@i386 i386 self-requested a review May 8, 2026 21:55
@ndizazzo ndizazzo assigned IvGolovach and unassigned i386 May 9, 2026
Comment thread .github/workflows/ci.yml Outdated
else
echo "ready=false" >> "$GITHUB_OUTPUT"
fi
- name: Bootstrap rustup
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why needed this and bash for CI ? I guess it is testing for the plugin? seem ok then

Comment thread ROADMAP.md Outdated
Today mesh-llm has two MoE modes: **solo** (model fits in memory, run it whole) and **split** (model doesn't fit, shard experts across nodes). SSD streaming would be a third mode: model doesn't fit in memory but *does* fit on one node's SSD. No mesh coordination, no cross-node traffic, no splitting — just one machine streaming experts from disk.

**Plan:** Use flash-moe directly as an alternative backend, not hack SSD streaming into llama.cpp. llama.cpp's `ggml_mul_mat_id` assumes all expert weights resident in one contiguous tensor — changing that is deep surgery across ggml, the Metal backend, and the model loader. Flash-moe is a working engine. Mesh-llm spawns it like it spawns llama-server — process management + HTTP wrapper.
**Plan:** Use flash-moe directly as an alternative backend, not hack SSD streaming into llama.cpp. llama.cpp's `ggml_mul_mat_id` assumes all expert weights resident in one contiguous tensor — changing that is deep surgery across ggml, the Metal backend, and the model loader. Flash-moe is a working engine. Mesh-llm integrates it through a built-in `flash-moe` plugin: process management, OpenAI-compatible endpoint registration, model discovery, and routing.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could probably drop this now as this implements it?

@michaelneale
Copy link
Copy Markdown
Collaborator

yeah this is pretty good use of plugins - I think makes sense as a plugin. would be good to round out the install experience (and make it clear that flash moe has to be installed separately right?).

Flash moe repo hasn't had any commits in 2 months however so I wonder if that repo is still alive, or it is somewhere else? could see a lot of people benefitting from it assuming flash-moe is solid enough.

@i386
Copy link
Copy Markdown
Collaborator

i386 commented May 9, 2026

Do we publish enough to crates.io that is could live in its own repo?

@IvGolovach IvGolovach requested a review from michaelneale May 9, 2026 19:22
IvGolovach added a commit to IvGolovach/mesh-llm that referenced this pull request May 11, 2026
Validation
* Validation tier: Tier 4 — CI workflow correction, because this resolves the PR Mesh-LLM#475 ROCm workflow rebase conflict without reintroducing the removed llama cache-hit topology.
* git diff --check: PASS
* git diff --cached --check: PASS
* ruby -e 'require "yaml"; YAML.load_file(".github/workflows/ci.yml")': PASS
* cargo fmt --all -- --check: PASS
* LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm-host-runtime: PASS
* LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm: PASS
* LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm --lib: PASS
* Ledger: not applicable — not required for selected validation tier/change family.
* Version: not applicable — not required for selected validation tier/change family.
* Not run: full local GitHub Actions matrix — not required locally for selected validation tier; required remote CI will rerun on the pushed PR SHA.

Rollback
* git revert HEAD
@IvGolovach IvGolovach force-pushed the codex/flash-moe-ssd-backend branch 2 times, most recently from 8e73b4b to 1a10816 Compare May 13, 2026 02:14
@IvGolovach
Copy link
Copy Markdown
Collaborator Author

I refreshed this against current main and narrowed the PR back to the Flash-MoE adapter itself. The unrelated CI changes are gone, the README/runtime conflicts are resolved, and the old review threads are now outdated. Local Flash-MoE tests pass and the full GitHub matrix is green now. Thanks again for the guidance here — I think this is a much cleaner review unit.

Validation
* Validation tier: Tier 2R - post-review conflict/base refresh of an existing shared runtime integration PR; the manual conflict scope was README.md, with targeted Flash-MoE/runtime checks rerun on the final rebased diff.
* git fetch --no-tags origin main:refs/remotes/origin/main: PASS
* git rebase origin/main: PASS, resolved conflict in README.md.
* git diff --check origin/main...HEAD: PASS
* git diff --cached --check: PASS
* cargo fmt --all -- --check: PASS
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal rustup run stable cargo test -p mesh-llm-host-runtime flash_moe --lib: PASS, 13 passed, 0 failed
* LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal rustup run stable cargo check -p mesh-llm: PASS
* Ledger: not applicable - not required for selected validation tier/change family.
* Version: not applicable - not required for selected validation tier/change family.
* Not run: just build - not required for this conflict/base-refresh tier; targeted Rust checks covered the affected runtime paths and GitHub CI will rerun the final PR SHA.
* Not run: full cargo test -p mesh-llm-host-runtime --lib - not required for selected tier; targeted Flash-MoE tests covered the changed plugin/runtime path.

Rollback
* git revert HEAD
@IvGolovach IvGolovach force-pushed the codex/flash-moe-ssd-backend branch from 1a10816 to a673062 Compare May 14, 2026 05:58
@ndizazzo ndizazzo self-requested a review May 16, 2026 00:00
Copy link
Copy Markdown
Collaborator

@ndizazzo ndizazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IvGolovach Agent caught one thing:

Medium: Flash-MoE managed backend startup failures happen after the plugin runtime already sends InitializeResponse, so missing/bad commands can look like plugin crash/restart loops instead of clean startup failures with install hints.

This wouldn't happen in all cases, so marking this approved and you can fast-follow with anything to address it.

@ndizazzo ndizazzo merged commit 2ffdc48 into Mesh-LLM:main May 16, 2026
21 checks passed
@ndizazzo
Copy link
Copy Markdown
Collaborator

@IvGolovach fire up a PR for the results in this comment, trying to get our PR backlog down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants