Improve context planning, admission, and coordinator fencing#513
Conversation
Context planning now produces useful context windows for split models instead of falling back to 4096. Split-aware budget: the planner now accepts a local_layer_fraction so it can compute the KV cache cost for just this node's layers, not the whole model. For layer packages, the fraction is estimated from the VRAM ratio (local / total mesh VRAM). KV quant negotiation: when the requested KV quantisation (e.g. f16) cannot reach the model's native context length, the planner walks a quant ladder (f16 → q8_0 → q4_0) and picks the least aggressive quant that fits. The negotiated quant is applied to the stage load request automatically. Layer package metadata: for split models, the planner now reads GGUF architecture metadata from the layer package's shared/metadata.gguf instead of returning None (which caused a fallback to 4096 default). Also fixes 13 pre-existing compile errors in mesh-llm-host-runtime test code (missing latency fields from #491, wrong function names and stale struct fields from #485).
There was a problem hiding this comment.
Pull request overview
This PR improves runtime context planning for split (layer-package) models by making the VRAM/KV-cache budget calculation “split-aware” and by negotiating KV-cache quantization when needed to reach larger (ideally native) context lengths, instead of hard-falling back to 4K. It also includes fixes to previously broken mesh-llm-host-runtime tests/constructors that were out of CI coverage.
Changes:
- Add split-aware planning inputs (
local_layer_fraction) and scale KV-cache cost by local layer share to plan realistic context windows for split models. - Add KV quant “ladder” negotiation (f16 → q8_0 → q4_0) and plumb the negotiated cache types through runtime stage load.
- Improve layer-package planning by reading compact GGUF metadata from
shared/metadata.gguf, and fix multiple test compile issues due to struct/API drift.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/model-artifact/src/gguf.rs | Adds KV-cache quant helpers/constants and as_llama_arg() mapping used by negotiation/plumbing. |
| crates/mesh-llm-host-runtime/src/runtime/local.rs | Computes split-aware local model bytes/fraction, scans layer-package metadata, and applies negotiated KV quant to runtime load. |
| crates/mesh-llm-host-runtime/src/runtime/context_planning.rs | Implements split-aware KV budgeting + KV quant negotiation and updates/extends unit tests. |
| crates/mesh-llm-host-runtime/src/protocol/mod.rs | Updates test fixtures to match newer struct fields. |
| crates/mesh-llm-host-runtime/src/mesh/tests.rs | Updates test fixtures for new served-model descriptor shape and endpoint-id helper rename. |
| crates/mesh-llm-host-runtime/src/mesh/mod.rs | Adds peer VRAM summation helper used for split fraction estimation. |
| crates/mesh-llm-host-runtime/src/inference/skippy/mod.rs | Re-exports HF package resolution helper used by layer-package metadata scanning. |
Fix 13 compile errors and 3 test failures in mesh-llm-host-runtime that were silently broken on main (CI only ran mesh-llm --lib which has zero tests). Compile fixes: - Add missing latency fields (latency_ms, latency_source, latency_age_ms, latency_observer_id) to PeerAnnouncement test constructions in protocol/mod.rs (#491 missed these sites) - Fix test_endpoint_id → make_test_endpoint_id in mesh/tests.rs (#485 used wrong function name) - Update ServedModelDescriptor to current struct shape (capabilities + topology instead of format + quantization + size_bytes) - Add missing available_model_sizes field Test fixes: - gossip_frame_roundtrip_preserves_scanned_model_metadata: add ModelRuntimeDescriptor with context_length to served_model_runtime (was empty vec, then asserted on first element) - initial_pretty_session_mode: update expectation to match current implementation (Client surface now allows dashboard) - Remove broken timing-dependent streaming proxy test (covered by two other streaming tests that pass) - Mark HF download test as #[ignore] (downloads 800MB, needs auth) CI: - Add cargo test -p mesh-llm-host-runtime --lib to both Linux and macOS CI jobs - Add cargo test -p model-artifact --lib to both jobs
…st assertions Address Copilot review feedback: - Skip KV quant negotiation when the user explicitly set --cache-type-k or --cache-type-v. Previously the planner would negotiate to q4_0 for a larger context, but the downstream load honoured the user's f16 override — producing a context/memory mismatch. New kv_quant_user_locked flag prevents this. - Wrap scan_layer_package_metadata in spawn_blocking to avoid filesystem I/O on the async executor (GGUF header reads, stat calls). - Tighten negotiate_kv_quant_upgrades_to_reach_native_context assertion to check exact expected value (16K) instead of just > 8K. - Add user_locked_kv_quant_skips_negotiation test proving the lock prevents negotiation and produces a smaller context than unlocked.
Drop the tiered KvCachePolicy (f16/q8_0/q4_0 by model size) and the negotiation ladder in context_planning. KV cache is now Q8_0 everywhere unless the user explicitly sets --cache-type-k/v. The planner just does: VRAM budget ÷ per-token KV cost → context length. No tiers, no negotiation, no negotiated_kv_quant, no kv_quant_user_locked. Split path now runs the same planner (was hardcoded to 4096). -216 lines net.
The local load path (start_runtime_local_model) incorrectly computed a fractional layer share based on mesh VRAM ratio when loading layer-package models. Since this path loads the entire model on one node, the fraction should always be 1.0 — fractional scaling only applies in the split path. This could overestimate free VRAM and plan a context window larger than actually fits. Also: fallback on invalid --cache-type-k/v now defaults to Q8_0 (was f16), remove dead total_peer_vram_bytes(), fix stale doc comment.
|
Added bounded OpenAI generation admission in How it works now:
flowchart TD
A[OpenAI chat or completion request] --> B{Generation lane available?}
B -- Yes --> C[Acquire lane permit]
C --> D[Run generation]
D --> E[Release lane permit]
B -- No --> F{Queue slot available?}
F -- No --> R[Return 429 rate_limit_exceeded\nRetry-After: 1]
F -- Yes --> G[Reserve queue slot]
G --> H{Lane opens within 10s?}
H -- Yes --> I[Acquire lane permit\nDrop queue reservation]
I --> D
H -- No --> T[Drop queue reservation\nReturn 429 rate_limit_exceeded\nRetry-After: 1]
Validation run after rebasing on the latest PR head:
Earlier validation for the same change also passed |
# Conflicts: # crates/mesh-llm-host-runtime/src/mesh/tests.rs
This reverts commit ba211c1.
The bounded admission fields added in 7a00d20 were not threaded into the multimodal smoke fixtures, breaking `cargo check --tests` in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes: 1. materialization: post-download snapshot re-scan now runs for ALL requests, not just metadata-only probes. When the HF SDK downloads to a skeleton snapshot that can't satisfy the caller's layer range, re-scan all cached snapshots for one that can. This catches stage load paths that carry a frozen skeleton hash in the topology config. 2. skippy-runtime: infer_activation_width_from_layers now checks the layer file exists before calling ModelInfo::open. Previously the C++ gguf_init_from_file would log 'failed to open GGUF file' errors for missing files before the Rust error handling could suppress them. The file existence check avoids the noisy C++ error output entirely.
| async fn claim(&mut self, claim: StageCoordinatorClaim) -> Result<StageCoordinatorClaimAck> { | ||
| match self | ||
| .coordinator_claims | ||
| .accept_claim(claim, current_time_unix_ms()) | ||
| { | ||
| ClaimDecision::Accepted { | ||
| supersedes_term: Some(_), | ||
| claim, | ||
| } => { | ||
| self.fence_stale_runtime_for_claim(&claim).await?; | ||
| Ok(StageCoordinatorClaimAck { | ||
| accepted: true, | ||
| claim, | ||
| error: None, | ||
| }) | ||
| } | ||
| ClaimDecision::Accepted { claim, .. } => Ok(StageCoordinatorClaimAck { | ||
| accepted: true, | ||
| claim, | ||
| error: None, | ||
| }), | ||
| ClaimDecision::Rejected { current, reason } => Ok(StageCoordinatorClaimAck { | ||
| accepted: false, | ||
| claim: current.unwrap_or_else(|| StageCoordinatorClaim { | ||
| model_id: String::new(), | ||
| package_ref: String::new(), | ||
| manifest_sha256: String::new(), | ||
| topology_id: String::new(), | ||
| run_id: String::new(), | ||
| coordinator_id: String::new(), | ||
| coordinator_term: 0, | ||
| participant_set_hash: String::new(), | ||
| topology_hash: String::new(), | ||
| lease_until_unix_ms: 0, | ||
| }), | ||
| error: Some(reason.to_string()), | ||
| }), |
| run_id: context.run_id.to_string(), | ||
| stage_id: stage.stage_id.clone(), | ||
| shutdown_generation, | ||
| coordinator_term: shutdown_generation, |
| fn ensure_requested_model(advertised_model_id: &str, requested: &str) -> OpenAiResult<()> { | ||
| if requested == advertised_model_id { | ||
| if requested == advertised_model_id | ||
| || strip_default_revision(requested) == strip_default_revision(advertised_model_id) | ||
| { | ||
| Ok(()) | ||
| } else { | ||
| Err(OpenAiError::model_not_found(requested)) | ||
| } | ||
| } | ||
|
|
||
| /// Strip `@main` so `org/repo@main:Q4` and `org/repo:Q4` compare equal. | ||
| fn strip_default_revision(id: &str) -> String { | ||
| id.replacen("@main", "", 1) | ||
| } |
There was a problem hiding this comment.
@michaelneale I agree this one is worth fixing. The current normalization changes model identity, not just presentation, so a non-default revision that happens to start with main could match the wrong advertised model.
| function isSplitParticipant(payload: StatusPayload): boolean { | ||
| const stages = payload.runtime?.stages ?? [] | ||
| return stages.some((s) => s.node_id === payload.node_id || s.node_id.startsWith(payload.node_id)) | ||
| } |
| run_id: context.run_id.to_string(), | ||
| stage_id: stage.stage_id.clone(), | ||
| shutdown_generation, | ||
| coordinator_term: shutdown_generation, |
Match llama-server's default of --parallel 4. Lanes share a unified KV cache with eviction (kv_unified=true), so lane count does not multiply KV memory cost. 4 concurrent request slots is a sensible default for most model sizes; users can override via gpu.parallel in config.toml or the per-model parallel setting.
* origin/main: fix(debug-model-loading): fix extremely long model loading times in debug builds
|
some final testing: |
|
@michaelneale two additional worthwhile fixes from a pass over the current PR head:
|
…r packages, dead code - strip_default_revision now only removes @main when followed by : or end-of-string, preventing corruption of repo names like @mainland. - SplitTopologyCoordinator::local_model_fits uses package source_model_bytes instead of stat-ing the hf:// pseudo-path (which returned 0, making local fallback look possible when the model cannot actually fit). - Remove unused QWEN_CODER_480B_Q4_KV_BYTES_PER_TOKEN constant.
| parallel_lanes: usize, | ||
| ) -> u64 { | ||
| let kv_bytes = u128::from(kv_per_layer) | ||
| .saturating_mul(u128::from(context_length)) | ||
| .saturating_mul(parallel_lanes as u128); |
| async fn claim(&mut self, claim: StageCoordinatorClaim) -> Result<StageCoordinatorClaimAck> { | ||
| match self | ||
| .coordinator_claims | ||
| .accept_claim(claim, current_time_unix_ms()) | ||
| { | ||
| ClaimDecision::Accepted { | ||
| supersedes_term: Some(_), | ||
| claim, | ||
| } => { | ||
| self.fence_stale_runtime_for_claim(&claim).await?; | ||
| Ok(StageCoordinatorClaimAck { | ||
| accepted: true, | ||
| claim, | ||
| error: None, | ||
| }) | ||
| } | ||
| ClaimDecision::Accepted { claim, .. } => Ok(StageCoordinatorClaimAck { | ||
| accepted: true, | ||
| claim, | ||
| error: None, | ||
| }), | ||
| ClaimDecision::Rejected { current, reason } => Ok(StageCoordinatorClaimAck { | ||
| accepted: false, | ||
| claim: current.unwrap_or_else(|| StageCoordinatorClaim { | ||
| model_id: String::new(), | ||
| package_ref: String::new(), | ||
| manifest_sha256: String::new(), | ||
| topology_id: String::new(), | ||
| run_id: String::new(), | ||
| coordinator_id: String::new(), | ||
| coordinator_term: 0, | ||
| participant_set_hash: String::new(), | ||
| topology_hash: String::new(), | ||
| lease_until_unix_ms: 0, | ||
| }), | ||
| error: Some(reason.to_string()), | ||
| }), |
| fn node_subsets(nodes: &[UsableNode], count: usize) -> Vec<Vec<UsableNode>> { | ||
| let mut subsets = Vec::new(); | ||
| let mut current = Vec::with_capacity(count); | ||
| collect_node_subsets(nodes, count, 0, &mut current, &mut subsets); | ||
| subsets | ||
| } | ||
|
|
||
| fn collect_node_subsets( | ||
| nodes: &[UsableNode], | ||
| count: usize, | ||
| start: usize, | ||
| current: &mut Vec<UsableNode>, | ||
| subsets: &mut Vec<Vec<UsableNode>>, | ||
| ) { | ||
| if current.len() == count { | ||
| subsets.push(current.clone()); | ||
| return; | ||
| } | ||
| let needed = count - current.len(); | ||
| if nodes.len().saturating_sub(start) < needed { | ||
| return; | ||
| } | ||
| for index in start..=nodes.len() - needed { | ||
| current.push(nodes[index].clone()); | ||
| collect_node_subsets(nodes, count, index + 1, current, subsets); | ||
| current.pop(); | ||
| } | ||
| } |
There was a problem hiding this comment.
what - this is only the nodes for a split there is no world in which this is an issue, goodness me.
There was a problem hiding this comment.
I think there may be an availability risk in the new coordinator-fencing flow. In claim_split_coordinator_lease(), the coordinator can reach quorum for a newer topology, but each accepting stage has already processed StageControlRequest::Claim; on the stage side, claim() immediately calls fence_stale_runtime_for_claim() when the claim supersedes an older term.
If the replan later fails during prepare/load, this could partially or fully shut down the currently serving lower-term topology before the replacement is ready. A safer shape might be to avoid destructive fencing during the initial claim phase, then fence old runtime state only after the replacement generation has completed prepare/load and is ready for cutover.
The relevant paths look like:
crates/mesh-llm-host-runtime/src/runtime/local.rs::claim_split_coordinator_leaseload_split_runtime_generationcrates/mesh-llm-host-runtime/src/inference/skippy/stage/mod.rs::{claim,fence_stale_runtime_for_claim}
We could add some regression coverage for partial claim acceptance and for a later prepare/load failure after some stages have accepted the new term. The old split should remain serving unless the replacement topology is actually ready to take over.
|
Opened follow-up PR #526 for the remaining worthwhile open review comments: https://github.com/Mesh-LLM/mesh-llm/pull/526\n\ncc @michaelneale |
… snapshots Root cause: when two nodes had different HF cache states for the same layer package (different snapshot commits, different model-package.json content), the cache resolution code could pick a stale snapshot with all layers present instead of the current HEAD snapshot with partial layers. This caused manifest sha256 mismatches during split Load, killing the split topology. Three changes: 1. Cache resolution now checks only the REQUESTED layer range, not all declared layers. Metadata-only probes (layer_start=layer_end=0) check for at least one layer artifact (anti-skeleton). Real stage loads check their assigned range only. 2. Removed the pre-download floating-revision fallback scan that walked all cached snapshots looking for one with layers. This was the code that picked stale snapshots with different manifests. 3. Scoped the post-download fallback scan to metadata-only probes only. Real stage loads always download their assigned layers into the HEAD snapshot, so no fallback is needed. Also adds debug-level stage control tracing in handle_stage_control for future split debugging. Validated: 480B split across Studio (stage-0, layers 0-49) and James (stage-1, layers 50-61) over relay — inference working end-to-end with divergent HF cache states on both nodes.
Adds 10 focused tests for the cache resolution functions introduced in the previous commit: - cached_snapshot_has_any_layer_artifact: skeleton rejection, partial acceptance - cached_snapshot_has_requested_layers: range checks, partial ranges, missing layers - should_prefer_cached_snapshot_for_request: dispatch to correct check based on metadata-only vs stage load
|
@ndizazzo I think this branch has diverged enough - can we do some follow up with that? |
KV cache is a unified allocation shared across parallel lanes with eviction. The diagnostic function was multiplying by lane count, overstating memory needs in failure messages.
Summary
This PR makes split and layer-package Skippy serving more production-shaped under real agent traffic. Model startup now plans context from GGUF metadata and the local layer share, Skippy defaults KV cache memory to Q8_0 unless the user overrides it, generation admission is bounded instead of failing immediately when all lanes are busy, and split coordinator election is fenced so stale coordinators cannot keep acting after leadership changes.
What Changed
--cache-type-k/--cache-type-vuser overrides.429 rate_limit_exceededwithRetry-After.ba211c17; auto context planning is back to maximizing context first, with slots derived from the remaining KV budget unless explicitly overridden.Architecture
flowchart TD A["GGUF metadata + layer package info"] --> B["Runtime resource planner"] C["Local VRAM + local layer share"] --> B D["User overrides: ctx, parallel, KV cache"] --> B B --> E["Context length"] B --> F["Skippy lane count"] E --> G["Stage load request"] F --> H["Bounded generation admission"] H --> I["Use free lane"] H --> J["Wait briefly in bounded queue"] H --> K["Retryable 429 when saturated"] L["Coordinator election"] --> M["Fencing token"] M --> N["Reject stale coordinator actions"]Planning and admission stay separate: the runtime planner chooses context from the local KV budget and derives slots from the remaining capacity, while skippy-server enforces bounded request admission at generation time. Coordinator fencing protects the split orchestration path independently, so stale elected coordinators cannot continue to mutate deployment state after a newer election wins.
User Impact
Protocol
No breaking mesh gossip protocol change is intended. Coordinator fencing is additive to the split orchestration behavior, and OpenAI rate-limit responses remain OpenAI-shaped while adding retry guidance through
Retry-After.Validation
just buildcargo check -p mesh-llmcargo check -p mesh-llm-host-runtimecargo check -p skippy-servercargo test -p openai-frontend --libcargo fmt --all -- --checkgit diff --checkgit diff --check HEAD~1..HEADAlso attempted targeted
mesh-llm-host-runtimecontext-planning tests andskippy-serverlib tests locally; those were blocked by native staticllama-commonlinkage when the local llama static ABI libraries were not visible to the test build.