Add Flash-MoE SSD backend support#475
Conversation
a348483 to
4d46085
Compare
| else | ||
| echo "ready=false" >> "$GITHUB_OUTPUT" | ||
| fi | ||
| - name: Bootstrap rustup |
There was a problem hiding this comment.
curious why needed this and bash for CI ? I guess it is testing for the plugin? seem ok then
| Today mesh-llm has two MoE modes: **solo** (model fits in memory, run it whole) and **split** (model doesn't fit, shard experts across nodes). SSD streaming would be a third mode: model doesn't fit in memory but *does* fit on one node's SSD. No mesh coordination, no cross-node traffic, no splitting — just one machine streaming experts from disk. | ||
|
|
||
| **Plan:** Use flash-moe directly as an alternative backend, not hack SSD streaming into llama.cpp. llama.cpp's `ggml_mul_mat_id` assumes all expert weights resident in one contiguous tensor — changing that is deep surgery across ggml, the Metal backend, and the model loader. Flash-moe is a working engine. Mesh-llm spawns it like it spawns llama-server — process management + HTTP wrapper. | ||
| **Plan:** Use flash-moe directly as an alternative backend, not hack SSD streaming into llama.cpp. llama.cpp's `ggml_mul_mat_id` assumes all expert weights resident in one contiguous tensor — changing that is deep surgery across ggml, the Metal backend, and the model loader. Flash-moe is a working engine. Mesh-llm integrates it through a built-in `flash-moe` plugin: process management, OpenAI-compatible endpoint registration, model discovery, and routing. |
There was a problem hiding this comment.
could probably drop this now as this implements it?
|
yeah this is pretty good use of plugins - I think makes sense as a plugin. would be good to round out the install experience (and make it clear that flash moe has to be installed separately right?). Flash moe repo hasn't had any commits in 2 months however so I wonder if that repo is still alive, or it is somewhere else? could see a lot of people benefitting from it assuming flash-moe is solid enough. |
|
Do we publish enough to crates.io that is could live in its own repo? |
Validation * Validation tier: Tier 4 — CI workflow correction, because this resolves the PR Mesh-LLM#475 ROCm workflow rebase conflict without reintroducing the removed llama cache-hit topology. * git diff --check: PASS * git diff --cached --check: PASS * ruby -e 'require "yaml"; YAML.load_file(".github/workflows/ci.yml")': PASS * cargo fmt --all -- --check: PASS * LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm-host-runtime: PASS * LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm: PASS * LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm --lib: PASS * Ledger: not applicable — not required for selected validation tier/change family. * Version: not applicable — not required for selected validation tier/change family. * Not run: full local GitHub Actions matrix — not required locally for selected validation tier; required remote CI will rerun on the pushed PR SHA. Rollback * git revert HEAD
8e73b4b to
1a10816
Compare
|
I refreshed this against current main and narrowed the PR back to the Flash-MoE adapter itself. The unrelated CI changes are gone, the README/runtime conflicts are resolved, and the old review threads are now outdated. Local Flash-MoE tests pass and the full GitHub matrix is green now. Thanks again for the guidance here — I think this is a much cleaner review unit. |
Validation * Validation tier: Tier 2R - post-review conflict/base refresh of an existing shared runtime integration PR; the manual conflict scope was README.md, with targeted Flash-MoE/runtime checks rerun on the final rebased diff. * git fetch --no-tags origin main:refs/remotes/origin/main: PASS * git rebase origin/main: PASS, resolved conflict in README.md. * git diff --check origin/main...HEAD: PASS * git diff --cached --check: PASS * cargo fmt --all -- --check: PASS * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal rustup run stable cargo test -p mesh-llm-host-runtime flash_moe --lib: PASS, 13 passed, 0 failed * LLAMA_STAGE_BUILD_DIR=/Users/Funtland/Downloads/mesh-llm/.deps/llama-build/build-stage-abi-metal rustup run stable cargo check -p mesh-llm: PASS * Ledger: not applicable - not required for selected validation tier/change family. * Version: not applicable - not required for selected validation tier/change family. * Not run: just build - not required for this conflict/base-refresh tier; targeted Rust checks covered the affected runtime paths and GitHub CI will rerun the final PR SHA. * Not run: full cargo test -p mesh-llm-host-runtime --lib - not required for selected tier; targeted Flash-MoE tests covered the changed plugin/runtime path. Rollback * git revert HEAD
1a10816 to
a673062
Compare
There was a problem hiding this comment.
@IvGolovach Agent caught one thing:
Medium: Flash-MoE managed backend startup failures happen after the plugin runtime already sends InitializeResponse, so missing/bad commands can look like plugin crash/restart loops instead of clean startup failures with install hints.
This wouldn't happen in all cases, so marking this approved and you can fast-follow with anything to address it.
|
@IvGolovach fire up a PR for the results in this comment, trying to get our PR backlog down |
Summary
Users can now run Flash-MoE as a built-in mesh-llm backend for single-node SSD expert streaming: giant MoE models can live on local NVMe while mesh-llm handles backend lifecycle, endpoint discovery, and OpenAI-compatible routing.
Supported modes:
inferbinary, allocates the local serving port, and appends--serve <port>itself./v1endpoint and advertises it through the existing plugin inference path.Why
This follows the roadmap direction for SSD expert streaming without pushing SSD streaming into llama.cpp internals and without changing Skippy, model packages, or mesh protocol compatibility.
Diff Scope
flash-moeplugin support inmesh-llm-host-runtime.commandorurl,argsonly withcommand, and no user-provided--servebecause mesh-llm owns the port.README.md,docs/plugins/flash-moe.md,docs/plugins/README.md,ROADMAP.md, andcrates/mesh-llm/TODO.md.Architecture / Protocol
Flash-MoE is integrated as a host-runtime plugin and registers an OpenAI-compatible endpoint with endpoint id
flash-moe.No mesh wire protocol, protobuf, ALPN, gossip schema, Skippy stage protocol, model-package format, or llama.cpp patch queue changes are included.
Branch / Commit Integrity
main8d12c0be26fb3af4ed309fde6df65acfabff0162git rev-list --left-right --count origin/main...HEAD:0 18d12c0be26fb3af4ed309fde6df65acfabff0162d923e037bd7ffb38f632307228d64d61c4eb0640 Add Flash-MoE SSD backend pluginValidation
Validation mode: Tier 3 — shared backend/runtime integration.
Local proof:
cargo fmt --all -- --check: PASSLLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm-host-runtime flash_moe: PASS, 10 passed, 0 failedLLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm-host-runtime --lib: PASS, 1242 passed, 0 failed, 5 ignoredLLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p mesh-llm: PASS/tmp/mesh-llm-just-tool/bin/just build: PASSgit diff --check origin/main...HEAD: PASS, no outputRemote CI is pending until this PR is opened. This PR should not be considered merge-ready until required GitHub checks pass on the final PR SHA.
Runtime Safety
No invariant regression introduced.
Rollback Plan
Rollback:
git revert <post_merge_commit_sha>.DB downgrade: not applicable. Data repair: not applicable. Operational caveat: rollback removes the built-in Flash-MoE adapter and restores the previous plugin/backend surface.
Known Residual Risks