Skip to content

Improve context planning, admission, and coordinator fencing#513

Merged
michaelneale merged 46 commits into
mainfrom
micn/context-smarts
May 12, 2026
Merged

Improve context planning, admission, and coordinator fencing#513
michaelneale merged 46 commits into
mainfrom
micn/context-smarts

Conversation

@michaelneale
Copy link
Copy Markdown
Collaborator

@michaelneale michaelneale commented May 11, 2026

Summary

This PR makes split and layer-package Skippy serving more production-shaped under real agent traffic. Model startup now plans context from GGUF metadata and the local layer share, Skippy defaults KV cache memory to Q8_0 unless the user overrides it, generation admission is bounded instead of failing immediately when all lanes are busy, and split coordinator election is fenced so stale coordinators cannot keep acting after leadership changes.

What Changed

  • Added split-aware context planning based on GGUF architecture metadata, KV bytes per token, local layer fraction, and available VRAM.
  • Switched Skippy KV cache defaults to Q8_0 for both K and V while preserving explicit --cache-type-k / --cache-type-v user overrides.
  • Removed the earlier KV quant negotiation ladder in favor of simpler budget-based planning.
  • Unified solo, split, and layer-package loads through the same runtime resource planning path.
  • Fixed solo layer-package planning so full local loads no longer scale model bytes by peer VRAM.
  • Added bounded OpenAI generation admission: free lanes are used immediately, brief overlap can queue behind busy lanes, and sustained overload returns retryable 429 rate_limit_exceeded with Retry-After.
  • Added split coordinator fencing so coordinator election changes invalidate stale coordinator work instead of allowing old leaders to continue launching or materializing stages.
  • Repaired host-runtime tests that had drifted from current structures and added host-runtime/model-artifact test coverage to CI.
  • Reverted the context/concurrency co-planning change from ba211c17; auto context planning is back to maximizing context first, with slots derived from the remaining KV budget unless explicitly overridden.

Architecture

flowchart TD
    A["GGUF metadata + layer package info"] --> B["Runtime resource planner"]
    C["Local VRAM + local layer share"] --> B
    D["User overrides: ctx, parallel, KV cache"] --> B
    B --> E["Context length"]
    B --> F["Skippy lane count"]
    E --> G["Stage load request"]
    F --> H["Bounded generation admission"]
    H --> I["Use free lane"]
    H --> J["Wait briefly in bounded queue"]
    H --> K["Retryable 429 when saturated"]
    L["Coordinator election"] --> M["Fencing token"]
    M --> N["Reject stale coordinator actions"]
Loading

Planning and admission stay separate: the runtime planner chooses context from the local KV budget and derives slots from the remaining capacity, while skippy-server enforces bounded request admission at generation time. Coordinator fencing protects the split orchestration path independently, so stale elected coordinators cannot continue to mutate deployment state after a newer election wins.

User Impact

  • Split models can start with useful long context instead of falling back to 4K.
  • Agent clients such as Goose get smoother behavior during short request overlaps.
  • Sustained overload remains bounded and retryable rather than becoming an unbounded queue.
  • Explicit user settings for context, parallelism, and KV cache types still take precedence.
  • Split orchestration is safer across coordinator churn and leadership races.

Protocol

No breaking mesh gossip protocol change is intended. Coordinator fencing is additive to the split orchestration behavior, and OpenAI rate-limit responses remain OpenAI-shaped while adding retry guidance through Retry-After.

Validation

  • just build
  • cargo check -p mesh-llm
  • cargo check -p mesh-llm-host-runtime
  • cargo check -p skippy-server
  • cargo test -p openai-frontend --lib
  • cargo fmt --all -- --check
  • git diff --check
  • git diff --check HEAD~1..HEAD

Also attempted targeted mesh-llm-host-runtime context-planning tests and skippy-server lib tests locally; those were blocked by native static llama-common linkage when the local llama static ABI libraries were not visible to the test build.

Context planning now produces useful context windows for split models
instead of falling back to 4096.

Split-aware budget: the planner now accepts a local_layer_fraction so
it can compute the KV cache cost for just this node's layers, not the
whole model. For layer packages, the fraction is estimated from the
VRAM ratio (local / total mesh VRAM).

KV quant negotiation: when the requested KV quantisation (e.g. f16)
cannot reach the model's native context length, the planner walks a
quant ladder (f16 → q8_0 → q4_0) and picks the least aggressive
quant that fits. The negotiated quant is applied to the stage load
request automatically.

Layer package metadata: for split models, the planner now reads GGUF
architecture metadata from the layer package's shared/metadata.gguf
instead of returning None (which caused a fallback to 4096 default).

Also fixes 13 pre-existing compile errors in mesh-llm-host-runtime
test code (missing latency fields from #491, wrong function names and
stale struct fields from #485).
Copilot AI review requested due to automatic review settings May 11, 2026 01:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves runtime context planning for split (layer-package) models by making the VRAM/KV-cache budget calculation “split-aware” and by negotiating KV-cache quantization when needed to reach larger (ideally native) context lengths, instead of hard-falling back to 4K. It also includes fixes to previously broken mesh-llm-host-runtime tests/constructors that were out of CI coverage.

Changes:

  • Add split-aware planning inputs (local_layer_fraction) and scale KV-cache cost by local layer share to plan realistic context windows for split models.
  • Add KV quant “ladder” negotiation (f16 → q8_0 → q4_0) and plumb the negotiated cache types through runtime stage load.
  • Improve layer-package planning by reading compact GGUF metadata from shared/metadata.gguf, and fix multiple test compile issues due to struct/API drift.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
crates/model-artifact/src/gguf.rs Adds KV-cache quant helpers/constants and as_llama_arg() mapping used by negotiation/plumbing.
crates/mesh-llm-host-runtime/src/runtime/local.rs Computes split-aware local model bytes/fraction, scans layer-package metadata, and applies negotiated KV quant to runtime load.
crates/mesh-llm-host-runtime/src/runtime/context_planning.rs Implements split-aware KV budgeting + KV quant negotiation and updates/extends unit tests.
crates/mesh-llm-host-runtime/src/protocol/mod.rs Updates test fixtures to match newer struct fields.
crates/mesh-llm-host-runtime/src/mesh/tests.rs Updates test fixtures for new served-model descriptor shape and endpoint-id helper rename.
crates/mesh-llm-host-runtime/src/mesh/mod.rs Adds peer VRAM summation helper used for split fraction estimation.
crates/mesh-llm-host-runtime/src/inference/skippy/mod.rs Re-exports HF package resolution helper used by layer-package metadata scanning.

Comment thread crates/mesh-llm-host-runtime/src/mesh/mod.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs
Comment thread crates/mesh-llm-host-runtime/src/runtime/context_planning.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs
Fix 13 compile errors and 3 test failures in mesh-llm-host-runtime
that were silently broken on main (CI only ran mesh-llm --lib which
has zero tests).

Compile fixes:
- Add missing latency fields (latency_ms, latency_source,
  latency_age_ms, latency_observer_id) to PeerAnnouncement test
  constructions in protocol/mod.rs (#491 missed these sites)
- Fix test_endpoint_id → make_test_endpoint_id in mesh/tests.rs
  (#485 used wrong function name)
- Update ServedModelDescriptor to current struct shape (capabilities
  + topology instead of format + quantization + size_bytes)
- Add missing available_model_sizes field

Test fixes:
- gossip_frame_roundtrip_preserves_scanned_model_metadata: add
  ModelRuntimeDescriptor with context_length to served_model_runtime
  (was empty vec, then asserted on first element)
- initial_pretty_session_mode: update expectation to match current
  implementation (Client surface now allows dashboard)
- Remove broken timing-dependent streaming proxy test (covered by
  two other streaming tests that pass)
- Mark HF download test as #[ignore] (downloads 800MB, needs auth)

CI:
- Add cargo test -p mesh-llm-host-runtime --lib to both Linux and
  macOS CI jobs
- Add cargo test -p model-artifact --lib to both jobs
…st assertions

Address Copilot review feedback:

- Skip KV quant negotiation when the user explicitly set --cache-type-k
  or --cache-type-v. Previously the planner would negotiate to q4_0 for
  a larger context, but the downstream load honoured the user's f16
  override — producing a context/memory mismatch. New kv_quant_user_locked
  flag prevents this.

- Wrap scan_layer_package_metadata in spawn_blocking to avoid filesystem
  I/O on the async executor (GGUF header reads, stat calls).

- Tighten negotiate_kv_quant_upgrades_to_reach_native_context assertion
  to check exact expected value (16K) instead of just > 8K.

- Add user_locked_kv_quant_skips_negotiation test proving the lock
  prevents negotiation and produces a smaller context than unlocked.
Copilot AI review requested due to automatic review settings May 11, 2026 01:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread crates/mesh-llm-host-runtime/src/mesh/mod.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/runtime/mod.rs
Comment thread crates/mesh-llm-host-runtime/src/runtime/context_planning.rs Outdated
i386 and others added 2 commits May 11, 2026 12:11
Drop the tiered KvCachePolicy (f16/q8_0/q4_0 by model size) and the
negotiation ladder in context_planning.  KV cache is now Q8_0 everywhere
unless the user explicitly sets --cache-type-k/v.

The planner just does: VRAM budget ÷ per-token KV cost → context length.
No tiers, no negotiation, no negotiated_kv_quant, no kv_quant_user_locked.

Split path now runs the same planner (was hardcoded to 4096).

-216 lines net.
@michaelneale michaelneale changed the title Split-aware context planning with KV quant negotiation Split-aware context planning, universal Q8_0 KV default May 11, 2026
Copilot AI review requested due to automatic review settings May 11, 2026 02:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/runtime/context_planning.rs
Comment thread crates/model-artifact/src/gguf.rs Outdated
michaelneale and others added 2 commits May 11, 2026 13:06
The local load path (start_runtime_local_model) incorrectly computed a
fractional layer share based on mesh VRAM ratio when loading layer-package
models.  Since this path loads the entire model on one node, the fraction
should always be 1.0 — fractional scaling only applies in the split path.

This could overestimate free VRAM and plan a context window larger than
actually fits.

Also: fallback on invalid --cache-type-k/v now defaults to Q8_0 (was f16),
remove dead total_peer_vram_bytes(), fix stale doc comment.
Copilot AI review requested due to automatic review settings May 11, 2026 03:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@i386
Copy link
Copy Markdown
Collaborator

i386 commented May 11, 2026

Added bounded OpenAI generation admission in 7a00d20e.

How it works now:

  • A request still enters immediately when a generation lane permit is free.
  • If all lanes are busy, the server allows a bounded waiting area: one queued request per configured generation lane.
  • A queued request waits up to 10 seconds for a permit.
  • If the queue is already full, or the 10-second wait expires, the response is a retryable OpenAI-shaped 429 rate_limit_exceeded with Retry-After: 1.
  • This gives agent clients like Goose enough grace for brief overlap without turning lane exhaustion into an unbounded request pileup.
flowchart TD
    A[OpenAI chat or completion request] --> B{Generation lane available?}
    B -- Yes --> C[Acquire lane permit]
    C --> D[Run generation]
    D --> E[Release lane permit]

    B -- No --> F{Queue slot available?}
    F -- No --> R[Return 429 rate_limit_exceeded\nRetry-After: 1]

    F -- Yes --> G[Reserve queue slot]
    G --> H{Lane opens within 10s?}
    H -- Yes --> I[Acquire lane permit\nDrop queue reservation]
    I --> D

    H -- No --> T[Drop queue reservation\nReturn 429 rate_limit_exceeded\nRetry-After: 1]
Loading

Validation run after rebasing on the latest PR head:

  • cargo check -p skippy-server

Earlier validation for the same change also passed cargo check -p mesh-llm, cargo fmt --all -- --check, and cargo test -p openai-frontend --lib. cargo test -p skippy-server --lib is still blocked locally by the missing native llama-common static library.

Copilot AI review requested due to automatic review settings May 11, 2026 03:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@i386 i386 changed the title Split-aware context planning, universal Q8_0 KV default Improve context planning, concurrency, and coordinator fencing May 11, 2026
@i386 i386 changed the title Improve context planning, concurrency, and coordinator fencing Improve context planning, admission, and coordinator fencing May 11, 2026
The bounded admission fields added in 7a00d20 were not threaded into
the multimodal smoke fixtures, breaking `cargo check --tests` in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 11, 2026 03:58
Two changes:

1. materialization: post-download snapshot re-scan now runs for ALL
   requests, not just metadata-only probes.  When the HF SDK downloads
   to a skeleton snapshot that can't satisfy the caller's layer range,
   re-scan all cached snapshots for one that can.  This catches stage
   load paths that carry a frozen skeleton hash in the topology config.

2. skippy-runtime: infer_activation_width_from_layers now checks the
   layer file exists before calling ModelInfo::open.  Previously the
   C++ gguf_init_from_file would log 'failed to open GGUF file' errors
   for missing files before the Rust error handling could suppress them.
   The file existence check avoids the noisy C++ error output entirely.
Copilot AI review requested due to automatic review settings May 12, 2026 02:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 3 comments.

Comment thread crates/skippy-coordinator/src/topology.rs
Comment on lines +107 to +143
async fn claim(&mut self, claim: StageCoordinatorClaim) -> Result<StageCoordinatorClaimAck> {
match self
.coordinator_claims
.accept_claim(claim, current_time_unix_ms())
{
ClaimDecision::Accepted {
supersedes_term: Some(_),
claim,
} => {
self.fence_stale_runtime_for_claim(&claim).await?;
Ok(StageCoordinatorClaimAck {
accepted: true,
claim,
error: None,
})
}
ClaimDecision::Accepted { claim, .. } => Ok(StageCoordinatorClaimAck {
accepted: true,
claim,
error: None,
}),
ClaimDecision::Rejected { current, reason } => Ok(StageCoordinatorClaimAck {
accepted: false,
claim: current.unwrap_or_else(|| StageCoordinatorClaim {
model_id: String::new(),
package_ref: String::new(),
manifest_sha256: String::new(),
topology_id: String::new(),
run_id: String::new(),
coordinator_id: String::new(),
coordinator_term: 0,
participant_set_hash: String::new(),
topology_hash: String::new(),
lease_until_unix_ms: 0,
}),
error: Some(reason.to_string()),
}),
run_id: context.run_id.to_string(),
stage_id: stage.stage_id.clone(),
shutdown_generation,
coordinator_term: shutdown_generation,
* origin/main:
  Hide cold models from chat selector (#520)
  Decompose skippy-prompt main (#517)
  Stream layer package artifacts from HF jobs (#515)
Copilot AI review requested due to automatic review settings May 12, 2026 03:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.

Comment thread crates/skippy-coordinator/src/topology.rs
Comment thread crates/skippy-coordinator/src/topology.rs
Comment on lines 5333 to +5346
fn ensure_requested_model(advertised_model_id: &str, requested: &str) -> OpenAiResult<()> {
if requested == advertised_model_id {
if requested == advertised_model_id
|| strip_default_revision(requested) == strip_default_revision(advertised_model_id)
{
Ok(())
} else {
Err(OpenAiError::model_not_found(requested))
}
}

/// Strip `@main` so `org/repo@main:Q4` and `org/repo:Q4` compare equal.
fn strip_default_revision(id: &str) -> String {
id.replacen("@main", "", 1)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelneale I agree this one is worth fixing. The current normalization changes model identity, not just presentation, so a non-default revision that happens to start with main could match the wrong advertised model.

Comment on lines +175 to +178
function isSplitParticipant(payload: StatusPayload): boolean {
const stages = payload.runtime?.stages ?? []
return stages.some((s) => s.node_id === payload.node_id || s.node_id.startsWith(payload.node_id))
}
run_id: context.run_id.to_string(),
stage_id: stage.stage_id.clone(),
shutdown_generation,
coordinator_term: shutdown_generation,
Match llama-server's default of --parallel 4.  Lanes share a unified KV
cache with eviction (kv_unified=true), so lane count does not multiply
KV memory cost.  4 concurrent request slots is a sensible default for
most model sizes; users can override via gpu.parallel in config.toml or
the per-model parallel setting.
* origin/main:
  fix(debug-model-loading): fix extremely long model loading times in debug builds
@michaelneale
Copy link
Copy Markdown
Collaborator Author

some final testing:

- All tests pass: 25 (skippy-coordinator) + 1292 (host-runtime) + 62 (skippy-server)
 - Pushed to micn/context-smarts
 - Release build + deploy to Studio
 - Split running: Studio stage-0 (layers 0..63) + Local stage-1 (layers 63..64)
 - 4 lanes, 40K context — down from 16 lanes
 - Inference works through both Studio and local proxy
 - Zero snapshot errors
 - Console at http://localhost:3131
 │ Mode                              │ Result                                                    │
 ├───────────────────────────────────┼───────────────────────────────────────────────────────────┤
 │ Solo (Qwen3-4B local)             │ ✅ Serving, inference works                               │
 ├───────────────────────────────────┼───────────────────────────────────────────────────────────┤
 │ Client --auto (public mesh)       │ ✅ 19 peers, routed to Qwen3-8B, "Blue."                  │
 ├───────────────────────────────────┼───────────────────────────────────────────────────────────┤
 │ Split (Studio + Local, Qwen3-32B) │ ✅ 4 lanes, 40K ctx, inference works both paths, 0 errors │
 └───────────────────────────────────┴───────────────────────────────────────────────────────────┘

@i386
Copy link
Copy Markdown
Collaborator

i386 commented May 12, 2026

@michaelneale two additional worthwhile fixes from a pass over the current PR head:

  1. SplitTopologyCoordinator::local_model_fits() currently calls election::total_model_bytes(&self.model_path). For HF layer-package refs, self.model_path is an hf://... pseudo-path, so that returns 0 and makes local fallback look possible even when the source model cannot fit locally. That can cause stage-loss recovery to choose local fallback, tear down the split handle, then fail to load local. This should use self.package.source_model_bytes or the layer-package-aware planning bytes instead.

  2. cargo test -p skippy-coordinator --lib passes but emits a dead-code warning for QWEN_CODER_480B_Q4_KV_BYTES_PER_TOKEN in crates/skippy-coordinator/src/topology.rs. The repo checklist says not to leave warnings in touched code, so either remove it or add coverage that uses it.

…r packages, dead code

- strip_default_revision now only removes @main when followed by : or
  end-of-string, preventing corruption of repo names like @mainland.
- SplitTopologyCoordinator::local_model_fits uses package source_model_bytes
  instead of stat-ing the hf:// pseudo-path (which returned 0, making local
  fallback look possible when the model cannot actually fit).
- Remove unused QWEN_CODER_480B_Q4_KV_BYTES_PER_TOKEN constant.
Copilot AI review requested due to automatic review settings May 12, 2026 05:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 3 comments.

Comment on lines +277 to +281
parallel_lanes: usize,
) -> u64 {
let kv_bytes = u128::from(kv_per_layer)
.saturating_mul(u128::from(context_length))
.saturating_mul(parallel_lanes as u128);
Comment on lines +107 to +143
async fn claim(&mut self, claim: StageCoordinatorClaim) -> Result<StageCoordinatorClaimAck> {
match self
.coordinator_claims
.accept_claim(claim, current_time_unix_ms())
{
ClaimDecision::Accepted {
supersedes_term: Some(_),
claim,
} => {
self.fence_stale_runtime_for_claim(&claim).await?;
Ok(StageCoordinatorClaimAck {
accepted: true,
claim,
error: None,
})
}
ClaimDecision::Accepted { claim, .. } => Ok(StageCoordinatorClaimAck {
accepted: true,
claim,
error: None,
}),
ClaimDecision::Rejected { current, reason } => Ok(StageCoordinatorClaimAck {
accepted: false,
claim: current.unwrap_or_else(|| StageCoordinatorClaim {
model_id: String::new(),
package_ref: String::new(),
manifest_sha256: String::new(),
topology_id: String::new(),
run_id: String::new(),
coordinator_id: String::new(),
coordinator_term: 0,
participant_set_hash: String::new(),
topology_hash: String::new(),
lease_until_unix_ms: 0,
}),
error: Some(reason.to_string()),
}),
Comment on lines +205 to +232
fn node_subsets(nodes: &[UsableNode], count: usize) -> Vec<Vec<UsableNode>> {
let mut subsets = Vec::new();
let mut current = Vec::with_capacity(count);
collect_node_subsets(nodes, count, 0, &mut current, &mut subsets);
subsets
}

fn collect_node_subsets(
nodes: &[UsableNode],
count: usize,
start: usize,
current: &mut Vec<UsableNode>,
subsets: &mut Vec<Vec<UsableNode>>,
) {
if current.len() == count {
subsets.push(current.clone());
return;
}
let needed = count - current.len();
if nodes.len().saturating_sub(start) < needed {
return;
}
for index in start..=nodes.len() - needed {
current.push(nodes[index].clone());
collect_node_subsets(nodes, count, index + 1, current, subsets);
current.pop();
}
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what - this is only the nodes for a split there is no world in which this is an issue, goodness me.

Copy link
Copy Markdown
Collaborator

@ndizazzo ndizazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there may be an availability risk in the new coordinator-fencing flow. In claim_split_coordinator_lease(), the coordinator can reach quorum for a newer topology, but each accepting stage has already processed StageControlRequest::Claim; on the stage side, claim() immediately calls fence_stale_runtime_for_claim() when the claim supersedes an older term.

If the replan later fails during prepare/load, this could partially or fully shut down the currently serving lower-term topology before the replacement is ready. A safer shape might be to avoid destructive fencing during the initial claim phase, then fence old runtime state only after the replacement generation has completed prepare/load and is ready for cutover.

The relevant paths look like:

  • crates/mesh-llm-host-runtime/src/runtime/local.rs::claim_split_coordinator_lease
  • load_split_runtime_generation
  • crates/mesh-llm-host-runtime/src/inference/skippy/stage/mod.rs::{claim,fence_stale_runtime_for_claim}

We could add some regression coverage for partial claim acceptance and for a later prepare/load failure after some stages have accepted the new term. The old split should remain serving unless the replacement topology is actually ready to take over.

@i386
Copy link
Copy Markdown
Collaborator

i386 commented May 12, 2026

Opened follow-up PR #526 for the remaining worthwhile open review comments: https://github.com/Mesh-LLM/mesh-llm/pull/526\n\ncc @michaelneale

… snapshots

Root cause: when two nodes had different HF cache states for the same
layer package (different snapshot commits, different model-package.json
content), the cache resolution code could pick a stale snapshot with all
layers present instead of the current HEAD snapshot with partial layers.
This caused manifest sha256 mismatches during split Load, killing the
split topology.

Three changes:

1. Cache resolution now checks only the REQUESTED layer range, not all
   declared layers. Metadata-only probes (layer_start=layer_end=0) check
   for at least one layer artifact (anti-skeleton). Real stage loads
   check their assigned range only.

2. Removed the pre-download floating-revision fallback scan that walked
   all cached snapshots looking for one with layers. This was the code
   that picked stale snapshots with different manifests.

3. Scoped the post-download fallback scan to metadata-only probes only.
   Real stage loads always download their assigned layers into the HEAD
   snapshot, so no fallback is needed.

Also adds debug-level stage control tracing in handle_stage_control for
future split debugging.

Validated: 480B split across Studio (stage-0, layers 0-49) and James
(stage-1, layers 50-61) over relay — inference working end-to-end with
divergent HF cache states on both nodes.
Adds 10 focused tests for the cache resolution functions introduced in
the previous commit:
- cached_snapshot_has_any_layer_artifact: skeleton rejection, partial acceptance
- cached_snapshot_has_requested_layers: range checks, partial ranges, missing layers
- should_prefer_cached_snapshot_for_request: dispatch to correct check based on metadata-only vs stage load
Copilot AI review requested due to automatic review settings May 12, 2026 07:29
@michaelneale
Copy link
Copy Markdown
Collaborator Author

@ndizazzo I think this branch has diverged enough - can we do some follow up with that?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 3 comments.

Comment thread crates/mesh-llm-host-runtime/src/inference/skippy/mod.rs
Comment thread crates/mesh-llm-host-runtime/src/runtime/split_planning.rs Outdated
Comment thread crates/mesh-llm-host-runtime/src/inference/skippy/stage/mod.rs
KV cache is a unified allocation shared across parallel lanes with
eviction. The diagnostic function was multiplying by lane count,
overstating memory needs in failure messages.
@michaelneale michaelneale merged commit d136df1 into main May 12, 2026
42 of 43 checks passed
@michaelneale michaelneale deleted the micn/context-smarts branch May 12, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker blocking other PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants