Skip to content

Explain Skippy split capacity shortfalls#498

Merged
i386 merged 1 commit into
Mesh-LLM:mainfrom
IvGolovach:codex/skippy-split-readiness
May 10, 2026
Merged

Explain Skippy split capacity shortfalls#498
i386 merged 1 commit into
Mesh-LLM:mainfrom
IvGolovach:codex/skippy-split-readiness

Conversation

@IvGolovach
Copy link
Copy Markdown
Collaborator

Summary

Skippy split startup failures now explain the capacity gap instead of only reporting aggregate totals.

  • Reports how much aggregate memory the mesh is short by when a split model cannot fit across eligible participants.
  • Includes the eligible participant set used for the capacity calculation, with per-node capacity, cache, missing-artifact, RTT, and transfer signals.
  • Carries excluded peer reasons from the readiness wait into the final capacity error, so operators can distinguish missing capacity from peers that were filtered out.
  • Keeps placement behavior unchanged; this is runtime diagnostics plus focused test coverage.

Why

During recent relay/join testing, a fresh invite successfully joined the mesh and the node could see a peer with relay RTT. The remaining startup failure was no longer connectivity; it was capacity:

aggregate split capacity for meshllm/Qwen3-Coder-480B-A35B-Instruct-UD-Q4_K_XL-layers requires 302.1GB, mesh has 154.6GB across 2 participant(s)

That message proved the mesh was connected, but it did not answer the operational questions that matter next: how far short is the mesh, which participants were counted, and whether any visible peers were excluded from the split candidate set.

This PR makes that failure mode actionable without changing the split planner's placement decisions.

Example

The aggregate-capacity error now has room for the full readiness context:

aggregate split capacity for <model> requires 302.1GB, mesh has 154.6GB across 2 participant(s), short by 147.5GB; participants [...]; excluded [...]

The participant labels include capacity/cache/missing-artifact/RTT/transfer details, and excluded peers include the reason they were not eligible for the split.

What changed

  • wait_for_split_participants now returns a SplitParticipantSnapshot, preserving both eligible participants and excluded peers after the wait loop.
  • Aggregate split capacity failures are formatted through SplitCapacityReadinessReport.
  • Split topology planning accepts excluded-peer context for diagnostics while preserving the existing planner behavior.
  • Participant labels now use the same decimal GB formatting as the rest of the readiness message.
  • Added focused tests for shortfall reporting, participant labels, and excluded-peer reporting.

Related

Compatibility

  • No protobuf schema changes.
  • No ALPN, stream type, or wire-format changes.
  • No gossip compatibility impact.
  • No UI or embedded asset changes.
  • Internal return-type changes stay inside the local runtime split startup path.

Branch integrity

  • Base branch: main
  • Validated base SHA: 9cfef9008e952cfc221dcd486073fd920fc6924f
  • Head SHA: e17e8a6c9a57afdb16912fed0c7effa281d986b8
  • git rev-list --left-right --count origin/main...HEAD: 0 1
  • Diff scope: crates/mesh-llm-host-runtime/src/runtime/local.rs

Validation

Validation tier: Tier 2 - narrow runtime diagnostics. The diff only changes Skippy split readiness error reporting and focused tests in the local runtime path.

git diff --check origin/main...HEAD
PASS

cargo fmt --all -- --check
PASS

LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime split_topology --lib
PASS, 8 passed

LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime aggregate_split_capacity_error_reports_excluded_peers --lib
PASS, 1 passed

LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm-host-runtime
PASS

LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm
PASS

Not run: just build - not required for this validation tier because this is a Rust-only runtime diagnostics change with no UI or embedded asset changes.

Remote CI: pending until the PR is submitted.

Rollback plan

Rollback: revert this PR.

DB downgrade: not applicable.
Data repair: not applicable.
Operational caveats: rollback restores the previous less-detailed aggregate split capacity error.

Known residual risks

  • This does not add capacity to the mesh or change split placement. It only makes insufficient-capacity failures easier to diagnose.
  • Full multi-node inference was not rerun for this diagnostics-only change; the triggering manual test already demonstrated the connected-but-insufficient-capacity state this PR improves.

Validation
* Validation tier: Tier 2 - narrow runtime diagnostics, because this only changes Skippy split readiness error reporting and tests in the local runtime path.
* git diff --check: PASS
* git diff --cached --check: PASS
* cargo fmt --all -- --check: PASS
* LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime split_topology --lib: PASS, 8 passed
* LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo test -p mesh-llm-host-runtime aggregate_split_capacity_error_reports_excluded_peers --lib: PASS, 1 passed
* LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm-host-runtime: PASS
* LLAMA_STAGE_BUILD_DIR=.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm: PASS
* Ledger: not applicable - repository has no ledger requirement for this change family.
* Version: not applicable - no release metadata or public protocol changed.
* Not run: just build - not required for selected validation tier; Rust-only runtime diagnostics change with no UI/assets touched.

Rollback
* git revert HEAD
@i386
Copy link
Copy Markdown
Collaborator

i386 commented May 10, 2026

Nice, love it!

@i386 i386 merged commit 043319c into Mesh-LLM:main May 10, 2026
17 checks passed
@IvGolovach IvGolovach deleted the codex/skippy-split-readiness branch May 10, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants