Add Flash-MoE SSD backend support by IvGolovach · Pull Request #475 · Mesh-LLM/mesh-llm

IvGolovach · 2026-05-08T20:40:20Z

Summary

Users can now run Flash-MoE as a built-in mesh-llm backend for single-node SSD expert streaming: giant MoE models can live on local NVMe while mesh-llm handles backend lifecycle, endpoint discovery, and OpenAI-compatible routing.

Supported modes:

Managed process mode: mesh-llm starts and supervises the Flash-MoE infer binary, allocates the local serving port, and appends --serve <port> itself.
Existing endpoint mode: mesh-llm attaches an already-running Flash-MoE /v1 endpoint and advertises it through the existing plugin inference path.

Why

This follows the roadmap direction for SSD expert streaming without pushing SSD streaming into llama.cpp internals and without changing Skippy, model packages, or mesh protocol compatibility.

Diff Scope

Added built-in flash-moe plugin support in mesh-llm-host-runtime.
Registered the plugin in built-in dispatch and plugin resolution.
Added plugin-scoped env wiring for built-in backend adapters.
Added config validation: exactly one of command or url, args only with command, and no user-provided --serve because mesh-llm owns the port.
Added docs in README.md, docs/plugins/flash-moe.md, docs/plugins/README.md, ROADMAP.md, and crates/mesh-llm/TODO.md.
Updated one stale host-runtime test expectation to match the existing dashboard-mode behavior.

Architecture / Protocol

Flash-MoE is integrated as a host-runtime plugin and registers an OpenAI-compatible endpoint with endpoint id flash-moe.

No mesh wire protocol, protobuf, ALPN, gossip schema, Skippy stage protocol, model-package format, or llama.cpp patch queue changes are included.

Branch / Commit Integrity

Base branch: main
Validated base SHA: 8d12c0be26fb3af4ed309fde6df65acfabff0162
git rev-list --left-right --count origin/main...HEAD: 0 1
Merge-base SHA: 8d12c0be26fb3af4ed309fde6df65acfabff0162
Introduced commit: d923e037bd7ffb38f632307228d64d61c4eb0640 Add Flash-MoE SSD backend plugin

Validation

Validation mode: Tier 3 — shared backend/runtime integration.

Local proof:

cargo fmt --all -- --check: PASS
LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm-host-runtime flash_moe: PASS, 10 passed, 0 failed
LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm-host-runtime --lib: PASS, 1242 passed, 0 failed, 5 ignored
LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm: PASS
/tmp/mesh-llm-just-tool/bin/just build: PASS
git diff --check origin/main...HEAD: PASS, no output

Remote CI is pending until this PR is opened. This PR should not be considered merge-ready until required GitHub checks pass on the final PR SHA.

Runtime Safety

No new mesh-wide blocking locks.
No new unbounded queues or background buffering paths.
Managed Flash-MoE process ownership is scoped to the plugin lifecycle.
Health probing uses a bounded timeout and keeps warm-up states non-fatal unless the child process exits.
No mesh protocol invariant is removed.

No invariant regression introduced.

Rollback Plan

Rollback: git revert <post_merge_commit_sha>.

DB downgrade: not applicable. Data repair: not applicable. Operational caveat: rollback removes the built-in Flash-MoE adapter and restores the previous plugin/backend surface.

Known Residual Risks

Real Flash-MoE Qwen3.5-397B hardware smoke was not run locally because the required Flash-MoE binary and model artifacts are not available in this environment.
This intentionally does not automate Flash-MoE artifact preparation or model conversion; those remain separate follow-up work.

michaelneale · 2026-05-09T06:49:58Z

          else
            echo "ready=false" >> "$GITHUB_OUTPUT"
          fi
+      - name: Bootstrap rustup


curious why needed this and bash for CI ? I guess it is testing for the plugin? seem ok then

michaelneale · 2026-05-09T06:50:18Z

 Today mesh-llm has two MoE modes: **solo** (model fits in memory, run it whole) and **split** (model doesn't fit, shard experts across nodes). SSD streaming would be a third mode: model doesn't fit in memory but *does* fit on one node's SSD. No mesh coordination, no cross-node traffic, no splitting — just one machine streaming experts from disk.

-**Plan:** Use flash-moe directly as an alternative backend, not hack SSD streaming into llama.cpp. llama.cpp's `ggml_mul_mat_id` assumes all expert weights resident in one contiguous tensor — changing that is deep surgery across ggml, the Metal backend, and the model loader. Flash-moe is a working engine. Mesh-llm spawns it like it spawns llama-server — process management + HTTP wrapper.
+**Plan:** Use flash-moe directly as an alternative backend, not hack SSD streaming into llama.cpp. llama.cpp's `ggml_mul_mat_id` assumes all expert weights resident in one contiguous tensor — changing that is deep surgery across ggml, the Metal backend, and the model loader. Flash-moe is a working engine. Mesh-llm integrates it through a built-in `flash-moe` plugin: process management, OpenAI-compatible endpoint registration, model discovery, and routing.


could probably drop this now as this implements it?

michaelneale · 2026-05-09T06:53:03Z

yeah this is pretty good use of plugins - I think makes sense as a plugin. would be good to round out the install experience (and make it clear that flash moe has to be installed separately right?).

Flash moe repo hasn't had any commits in 2 months however so I wonder if that repo is still alive, or it is somewhere else? could see a lot of people benefitting from it assuming flash-moe is solid enough.

i386 · 2026-05-09T06:55:12Z

Do we publish enough to crates.io that is could live in its own repo?

Validation * Validation tier: Tier 4 — CI workflow correction, because this resolves the PR Mesh-LLM#475 ROCm workflow rebase conflict without reintroducing the removed llama cache-hit topology. * git diff --check: PASS * git diff --cached --check: PASS * ruby -e 'require "yaml"; YAML.load_file(".github/workflows/ci.yml")': PASS * cargo fmt --all -- --check: PASS * LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm-host-runtime: PASS * LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm: PASS * LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm --lib: PASS * Ledger: not applicable — not required for selected validation tier/change family. * Version: not applicable — not required for selected validation tier/change family. * Not run: full local GitHub Actions matrix — not required locally for selected validation tier; required remote CI will rerun on the pushed PR SHA. Rollback * git revert HEAD

IvGolovach · 2026-05-13T03:22:41Z

I refreshed this against current main and narrowed the PR back to the Flash-MoE adapter itself. The unrelated CI changes are gone, the README/runtime conflicts are resolved, and the old review threads are now outdated. Local Flash-MoE tests pass and the full GitHub matrix is green now. Thanks again for the guidance here — I think this is a much cleaner review unit.

Validation * Validation tier: Tier 2R - post-review conflict/base refresh of an existing shared runtime integration PR; the manual conflict scope was README.md, with targeted Flash-MoE/runtime checks rerun on the final rebased diff. * git fetch --no-tags origin main:refs/remotes/origin/main: PASS * git rebase origin/main: PASS, resolved conflict in README.md. * git diff --check origin/main...HEAD: PASS * git diff --cached --check: PASS * cargo fmt --all -- --check: PASS * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal rustup run stable cargo test -p mesh-llm-host-runtime flash_moe --lib: PASS, 13 passed, 0 failed * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal rustup run stable cargo check -p mesh-llm: PASS * Ledger: not applicable - not required for selected validation tier/change family. * Version: not applicable - not required for selected validation tier/change family. * Not run: just build - not required for this conflict/base-refresh tier; targeted Rust checks covered the affected runtime paths and GitHub CI will rerun the final PR SHA. * Not run: full cargo test -p mesh-llm-host-runtime --lib - not required for selected tier; targeted Flash-MoE tests covered the changed plugin/runtime path. Rollback * git revert HEAD

ndizazzo

@IvGolovach Agent caught one thing:

Medium: Flash-MoE managed backend startup failures happen after the plugin runtime already sends InitializeResponse, so missing/bad commands can look like plugin crash/restart loops instead of clean startup failures with install hints.

This wouldn't happen in all cases, so marking this approved and you can fast-follow with anything to address it.

ndizazzo · 2026-05-16T23:10:13Z

@IvGolovach fire up a PR for the results in this comment, trying to get our PR backlog down

IvGolovach force-pushed the codex/flash-moe-ssd-backend branch from a348483 to 4d46085 Compare May 8, 2026 21:30

i386 self-assigned this May 8, 2026

i386 self-requested a review May 8, 2026 21:55

ndizazzo assigned IvGolovach and unassigned i386 May 9, 2026

michaelneale reviewed May 9, 2026

View reviewed changes

IvGolovach requested a review from michaelneale May 9, 2026 19:22

IvGolovach force-pushed the codex/flash-moe-ssd-backend branch 2 times, most recently from 8e73b4b to 1a10816 Compare May 13, 2026 02:14

IvGolovach force-pushed the codex/flash-moe-ssd-backend branch from 1a10816 to a673062 Compare May 14, 2026 05:58

ndizazzo self-requested a review May 16, 2026 00:00

ndizazzo approved these changes May 16, 2026

View reviewed changes

ndizazzo merged commit 2ffdc48 into Mesh-LLM:main May 16, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flash-MoE SSD backend support#475

Add Flash-MoE SSD backend support#475
ndizazzo merged 1 commit into
Mesh-LLM:mainfrom
IvGolovach:codex/flash-moe-ssd-backend

IvGolovach commented May 8, 2026

Uh oh!

michaelneale May 9, 2026

Uh oh!

michaelneale May 9, 2026

Uh oh!

michaelneale commented May 9, 2026

Uh oh!

i386 commented May 9, 2026

Uh oh!

IvGolovach commented May 13, 2026

Uh oh!

ndizazzo left a comment •

edited

Loading

Uh oh!

Uh oh!

ndizazzo commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

IvGolovach commented May 8, 2026

Summary

Why

Diff Scope

Architecture / Protocol

Branch / Commit Integrity

Validation

Runtime Safety

Rollback Plan

Known Residual Risks

Uh oh!

michaelneale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

michaelneale May 9, 2026

Choose a reason for hiding this comment

Uh oh!

michaelneale commented May 9, 2026

Uh oh!

i386 commented May 9, 2026

Uh oh!

IvGolovach commented May 13, 2026

Uh oh!

ndizazzo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ndizazzo commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ndizazzo left a comment •

edited

Loading