Improve context planning, admission, and coordinator fencing by michaelneale · Pull Request #513 · Mesh-LLM/mesh-llm

michaelneale · 2026-05-11T01:17:41Z

Summary

This PR makes split and layer-package Skippy serving more production-shaped under real agent traffic. Model startup now plans context from GGUF metadata and the local layer share, Skippy defaults KV cache memory to Q8_0 unless the user overrides it, generation admission is bounded instead of failing immediately when all lanes are busy, and split coordinator election is fenced so stale coordinators cannot keep acting after leadership changes.

What Changed

Added split-aware context planning based on GGUF architecture metadata, KV bytes per token, local layer fraction, and available VRAM.
Switched Skippy KV cache defaults to Q8_0 for both K and V while preserving explicit --cache-type-k / --cache-type-v user overrides.
Removed the earlier KV quant negotiation ladder in favor of simpler budget-based planning.
Unified solo, split, and layer-package loads through the same runtime resource planning path.
Fixed solo layer-package planning so full local loads no longer scale model bytes by peer VRAM.
Added bounded OpenAI generation admission: free lanes are used immediately, brief overlap can queue behind busy lanes, and sustained overload returns retryable 429 rate_limit_exceeded with Retry-After.
Added split coordinator fencing so coordinator election changes invalidate stale coordinator work instead of allowing old leaders to continue launching or materializing stages.
Repaired host-runtime tests that had drifted from current structures and added host-runtime/model-artifact test coverage to CI.
Reverted the context/concurrency co-planning change from ba211c17; auto context planning is back to maximizing context first, with slots derived from the remaining KV budget unless explicitly overridden.

Architecture

flowchart TD
    A["GGUF metadata + layer package info"] --> B["Runtime resource planner"]
    C["Local VRAM + local layer share"] --> B
    D["User overrides: ctx, parallel, KV cache"] --> B
    B --> E["Context length"]
    B --> F["Skippy lane count"]
    E --> G["Stage load request"]
    F --> H["Bounded generation admission"]
    H --> I["Use free lane"]
    H --> J["Wait briefly in bounded queue"]
    H --> K["Retryable 429 when saturated"]
    L["Coordinator election"] --> M["Fencing token"]
    M --> N["Reject stale coordinator actions"]

Planning and admission stay separate: the runtime planner chooses context from the local KV budget and derives slots from the remaining capacity, while skippy-server enforces bounded request admission at generation time. Coordinator fencing protects the split orchestration path independently, so stale elected coordinators cannot continue to mutate deployment state after a newer election wins.

User Impact

Split models can start with useful long context instead of falling back to 4K.
Agent clients such as Goose get smoother behavior during short request overlaps.
Sustained overload remains bounded and retryable rather than becoming an unbounded queue.
Explicit user settings for context, parallelism, and KV cache types still take precedence.
Split orchestration is safer across coordinator churn and leadership races.

Protocol

No breaking mesh gossip protocol change is intended. Coordinator fencing is additive to the split orchestration behavior, and OpenAI rate-limit responses remain OpenAI-shaped while adding retry guidance through Retry-After.

Validation

just build
cargo check -p mesh-llm
cargo check -p mesh-llm-host-runtime
cargo check -p skippy-server
cargo test -p openai-frontend --lib
cargo fmt --all -- --check
git diff --check
git diff --check HEAD~1..HEAD

Also attempted targeted mesh-llm-host-runtime context-planning tests and skippy-server lib tests locally; those were blocked by native static llama-common linkage when the local llama static ABI libraries were not visible to the test build.

Context planning now produces useful context windows for split models instead of falling back to 4096. Split-aware budget: the planner now accepts a local_layer_fraction so it can compute the KV cache cost for just this node's layers, not the whole model. For layer packages, the fraction is estimated from the VRAM ratio (local / total mesh VRAM). KV quant negotiation: when the requested KV quantisation (e.g. f16) cannot reach the model's native context length, the planner walks a quant ladder (f16 → q8_0 → q4_0) and picks the least aggressive quant that fits. The negotiated quant is applied to the stage load request automatically. Layer package metadata: for split models, the planner now reads GGUF architecture metadata from the layer package's shared/metadata.gguf instead of returning None (which caused a fallback to 4096 default). Also fixes 13 pre-existing compile errors in mesh-llm-host-runtime test code (missing latency fields from #491, wrong function names and stale struct fields from #485).

Copilot

Pull request overview

This PR improves runtime context planning for split (layer-package) models by making the VRAM/KV-cache budget calculation “split-aware” and by negotiating KV-cache quantization when needed to reach larger (ideally native) context lengths, instead of hard-falling back to 4K. It also includes fixes to previously broken mesh-llm-host-runtime tests/constructors that were out of CI coverage.

Changes:

Add split-aware planning inputs (local_layer_fraction) and scale KV-cache cost by local layer share to plan realistic context windows for split models.
Add KV quant “ladder” negotiation (f16 → q8_0 → q4_0) and plumb the negotiated cache types through runtime stage load.
Improve layer-package planning by reading compact GGUF metadata from shared/metadata.gguf, and fix multiple test compile issues due to struct/API drift.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
crates/model-artifact/src/gguf.rs	Adds KV-cache quant helpers/constants and `as_llama_arg()` mapping used by negotiation/plumbing.
crates/mesh-llm-host-runtime/src/runtime/local.rs	Computes split-aware local model bytes/fraction, scans layer-package metadata, and applies negotiated KV quant to runtime load.
crates/mesh-llm-host-runtime/src/runtime/context_planning.rs	Implements split-aware KV budgeting + KV quant negotiation and updates/extends unit tests.
crates/mesh-llm-host-runtime/src/protocol/mod.rs	Updates test fixtures to match newer struct fields.
crates/mesh-llm-host-runtime/src/mesh/tests.rs	Updates test fixtures for new served-model descriptor shape and endpoint-id helper rename.
crates/mesh-llm-host-runtime/src/mesh/mod.rs	Adds peer VRAM summation helper used for split fraction estimation.
crates/mesh-llm-host-runtime/src/inference/skippy/mod.rs	Re-exports HF package resolution helper used by layer-package metadata scanning.

Fix 13 compile errors and 3 test failures in mesh-llm-host-runtime that were silently broken on main (CI only ran mesh-llm --lib which has zero tests). Compile fixes: - Add missing latency fields (latency_ms, latency_source, latency_age_ms, latency_observer_id) to PeerAnnouncement test constructions in protocol/mod.rs (#491 missed these sites) - Fix test_endpoint_id → make_test_endpoint_id in mesh/tests.rs (#485 used wrong function name) - Update ServedModelDescriptor to current struct shape (capabilities + topology instead of format + quantization + size_bytes) - Add missing available_model_sizes field Test fixes: - gossip_frame_roundtrip_preserves_scanned_model_metadata: add ModelRuntimeDescriptor with context_length to served_model_runtime (was empty vec, then asserted on first element) - initial_pretty_session_mode: update expectation to match current implementation (Client surface now allows dashboard) - Remove broken timing-dependent streaming proxy test (covered by two other streaming tests that pass) - Mark HF download test as #[ignore] (downloads 800MB, needs auth) CI: - Add cargo test -p mesh-llm-host-runtime --lib to both Linux and macOS CI jobs - Add cargo test -p model-artifact --lib to both jobs

…st assertions Address Copilot review feedback: - Skip KV quant negotiation when the user explicitly set --cache-type-k or --cache-type-v. Previously the planner would negotiate to q4_0 for a larger context, but the downstream load honoured the user's f16 override — producing a context/memory mismatch. New kv_quant_user_locked flag prevents this. - Wrap scan_layer_package_metadata in spawn_blocking to avoid filesystem I/O on the async executor (GGUF header reads, stat calls). - Tighten negotiate_kv_quant_upgrades_to_reach_native_context assertion to check exact expected value (16K) instead of just > 8K. - Add user_locked_kv_quant_skips_negotiation test proving the lock prevents negotiation and produces a smaller context than unlocked.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Drop the tiered KvCachePolicy (f16/q8_0/q4_0 by model size) and the negotiation ladder in context_planning. KV cache is now Q8_0 everywhere unless the user explicitly sets --cache-type-k/v. The planner just does: VRAM budget ÷ per-token KV cost → context length. No tiers, no negotiation, no negotiated_kv_quant, no kv_quant_user_locked. Split path now runs the same planner (was hardcoded to 4096). -216 lines net.

Copilot

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

The local load path (start_runtime_local_model) incorrectly computed a fractional layer share based on mesh VRAM ratio when loading layer-package models. Since this path loads the entire model on one node, the fraction should always be 1.0 — fractional scaling only applies in the split path. This could overestimate free VRAM and plan a context window larger than actually fits. Also: fallback on invalid --cache-type-k/v now defaults to Q8_0 (was f16), remove dead total_peer_vram_bytes(), fix stale doc comment.

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

i386 · 2026-05-11T03:11:10Z

Added bounded OpenAI generation admission in 7a00d20e.

How it works now:

A request still enters immediately when a generation lane permit is free.
If all lanes are busy, the server allows a bounded waiting area: one queued request per configured generation lane.
A queued request waits up to 10 seconds for a permit.
If the queue is already full, or the 10-second wait expires, the response is a retryable OpenAI-shaped 429 rate_limit_exceeded with Retry-After: 1.
This gives agent clients like Goose enough grace for brief overlap without turning lane exhaustion into an unbounded request pileup.

flowchart TD
    A[OpenAI chat or completion request] --> B{Generation lane available?}
    B -- Yes --> C[Acquire lane permit]
    C --> D[Run generation]
    D --> E[Release lane permit]

    B -- No --> F{Queue slot available?}
    F -- No --> R[Return 429 rate_limit_exceeded\nRetry-After: 1]

    F -- Yes --> G[Reserve queue slot]
    G --> H{Lane opens within 10s?}
    H -- Yes --> I[Acquire lane permit\nDrop queue reservation]
    I --> D

    H -- No --> T[Drop queue reservation\nReturn 429 rate_limit_exceeded\nRetry-After: 1]

Validation run after rebasing on the latest PR head:

cargo check -p skippy-server

Earlier validation for the same change also passed cargo check -p mesh-llm, cargo fmt --all -- --check, and cargo test -p openai-frontend --lib. cargo test -p skippy-server --lib is still blocked locally by the missing native llama-common static library.

# Conflicts: # crates/mesh-llm-host-runtime/src/mesh/tests.rs

Copilot

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

This reverts commit ba211c1.

The bounded admission fields added in 7a00d20 were not threaded into the multimodal smoke fixtures, breaking `cargo check --tests` in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes: 1. materialization: post-download snapshot re-scan now runs for ALL requests, not just metadata-only probes. When the HF SDK downloads to a skeleton snapshot that can't satisfy the caller's layer range, re-scan all cached snapshots for one that can. This catches stage load paths that carry a frozen skeleton hash in the topology config. 2. skippy-runtime: infer_activation_width_from_layers now checks the layer file exists before calling ModelInfo::open. Previously the C++ gguf_init_from_file would log 'failed to open GGUF file' errors for missing files before the Rust error handling could suppress them. The file existence check avoids the noisy C++ error output entirely.

Copilot

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 3 comments.

+    async fn claim(&mut self, claim: StageCoordinatorClaim) -> Result<StageCoordinatorClaimAck> {
+        match self
+            .coordinator_claims
+            .accept_claim(claim, current_time_unix_ms())
+        {
+            ClaimDecision::Accepted {
+                supersedes_term: Some(_),
+                claim,
+            } => {
+                self.fence_stale_runtime_for_claim(&claim).await?;
+                Ok(StageCoordinatorClaimAck {
+                    accepted: true,
+                    claim,
+                    error: None,
+                })
+            }
+            ClaimDecision::Accepted { claim, .. } => Ok(StageCoordinatorClaimAck {
+                accepted: true,
+                claim,
+                error: None,
+            }),
+            ClaimDecision::Rejected { current, reason } => Ok(StageCoordinatorClaimAck {
+                accepted: false,
+                claim: current.unwrap_or_else(|| StageCoordinatorClaim {
+                    model_id: String::new(),
+                    package_ref: String::new(),
+                    manifest_sha256: String::new(),
+                    topology_id: String::new(),
+                    run_id: String::new(),
+                    coordinator_id: String::new(),
+                    coordinator_term: 0,
+                    participant_set_hash: String::new(),
+                    topology_hash: String::new(),
+                    lease_until_unix_ms: 0,
+                }),
+                error: Some(reason.to_string()),
+            }),


        run_id: context.run_id.to_string(),
        stage_id: stage.stage_id.clone(),
        shutdown_generation,
+        coordinator_term: shutdown_generation,


* origin/main: Hide cold models from chat selector (#520) Decompose skippy-prompt main (#517) Stream layer package artifacts from HF jobs (#515)

Copilot

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.

i386 · 2026-05-12T05:50:16Z

 fn ensure_requested_model(advertised_model_id: &str, requested: &str) -> OpenAiResult<()> {
-    if requested == advertised_model_id {
+    if requested == advertised_model_id
+        || strip_default_revision(requested) == strip_default_revision(advertised_model_id)
+    {
        Ok(())
    } else {
        Err(OpenAiError::model_not_found(requested))
    }
 }

+/// Strip `@main` so `org/repo@main:Q4` and `org/repo:Q4` compare equal.
+fn strip_default_revision(id: &str) -> String {
+    id.replacen("@main", "", 1)
+}


@michaelneale I agree this one is worth fixing. The current normalization changes model identity, not just presentation, so a non-default revision that happens to start with main could match the wrong advertised model.

+function isSplitParticipant(payload: StatusPayload): boolean {
+  const stages = payload.runtime?.stages ?? []
+  return stages.some((s) => s.node_id === payload.node_id || s.node_id.startsWith(payload.node_id))
+}


        run_id: context.run_id.to_string(),
        stage_id: stage.stage_id.clone(),
        shutdown_generation,
+        coordinator_term: shutdown_generation,


Match llama-server's default of --parallel 4. Lanes share a unified KV cache with eviction (kv_unified=true), so lane count does not multiply KV memory cost. 4 concurrent request slots is a sensible default for most model sizes; users can override via gpu.parallel in config.toml or the per-model parallel setting.

* origin/main: fix(debug-model-loading): fix extremely long model loading times in debug builds

michaelneale · 2026-05-12T04:41:13Z

some final testing:

- All tests pass: 25 (skippy-coordinator) + 1292 (host-runtime) + 62 (skippy-server)
 - Pushed to micn/context-smarts
 - Release build + deploy to Studio
 - Split running: Studio stage-0 (layers 0..63) + Local stage-1 (layers 63..64)
 - 4 lanes, 40K context — down from 16 lanes
 - Inference works through both Studio and local proxy
 - Zero snapshot errors
 - Console at http://localhost:3131
 │ Mode                              │ Result                                                    │
 ├───────────────────────────────────┼───────────────────────────────────────────────────────────┤
 │ Solo (Qwen3-4B local)             │ ✅ Serving, inference works                               │
 ├───────────────────────────────────┼───────────────────────────────────────────────────────────┤
 │ Client --auto (public mesh)       │ ✅ 19 peers, routed to Qwen3-8B, "Blue."                  │
 ├───────────────────────────────────┼───────────────────────────────────────────────────────────┤
 │ Split (Studio + Local, Qwen3-32B) │ ✅ 4 lanes, 40K ctx, inference works both paths, 0 errors │
 └───────────────────────────────────┴───────────────────────────────────────────────────────────┘

i386 · 2026-05-12T05:50:40Z

@michaelneale two additional worthwhile fixes from a pass over the current PR head:

SplitTopologyCoordinator::local_model_fits() currently calls election::total_model_bytes(&self.model_path). For HF layer-package refs, self.model_path is an hf://... pseudo-path, so that returns 0 and makes local fallback look possible even when the source model cannot fit locally. That can cause stage-loss recovery to choose local fallback, tear down the split handle, then fail to load local. This should use self.package.source_model_bytes or the layer-package-aware planning bytes instead.
cargo test -p skippy-coordinator --lib passes but emits a dead-code warning for QWEN_CODER_480B_Q4_KV_BYTES_PER_TOKEN in crates/skippy-coordinator/src/topology.rs. The repo checklist says not to leave warnings in touched code, so either remove it or add coverage that uses it.

@main

…r packages, dead code - strip_default_revision now only removes @main when followed by : or end-of-string, preventing corruption of repo names like @mainland. - SplitTopologyCoordinator::local_model_fits uses package source_model_bytes instead of stat-ing the hf:// pseudo-path (which returned 0, making local fallback look possible when the model cannot actually fit). - Remove unused QWEN_CODER_480B_Q4_KV_BYTES_PER_TOKEN constant.

Copilot

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 3 comments.

+    parallel_lanes: usize,
+) -> u64 {
+    let kv_bytes = u128::from(kv_per_layer)
+        .saturating_mul(u128::from(context_length))
+        .saturating_mul(parallel_lanes as u128);


+    async fn claim(&mut self, claim: StageCoordinatorClaim) -> Result<StageCoordinatorClaimAck> {
+        match self
+            .coordinator_claims
+            .accept_claim(claim, current_time_unix_ms())
+        {
+            ClaimDecision::Accepted {
+                supersedes_term: Some(_),
+                claim,
+            } => {
+                self.fence_stale_runtime_for_claim(&claim).await?;
+                Ok(StageCoordinatorClaimAck {
+                    accepted: true,
+                    claim,
+                    error: None,
+                })
+            }
+            ClaimDecision::Accepted { claim, .. } => Ok(StageCoordinatorClaimAck {
+                accepted: true,
+                claim,
+                error: None,
+            }),
+            ClaimDecision::Rejected { current, reason } => Ok(StageCoordinatorClaimAck {
+                accepted: false,
+                claim: current.unwrap_or_else(|| StageCoordinatorClaim {
+                    model_id: String::new(),
+                    package_ref: String::new(),
+                    manifest_sha256: String::new(),
+                    topology_id: String::new(),
+                    run_id: String::new(),
+                    coordinator_id: String::new(),
+                    coordinator_term: 0,
+                    participant_set_hash: String::new(),
+                    topology_hash: String::new(),
+                    lease_until_unix_ms: 0,
+                }),
+                error: Some(reason.to_string()),
+            }),


michaelneale · 2026-05-12T06:14:33Z

+fn node_subsets(nodes: &[UsableNode], count: usize) -> Vec<Vec<UsableNode>> {
+    let mut subsets = Vec::new();
+    let mut current = Vec::with_capacity(count);
+    collect_node_subsets(nodes, count, 0, &mut current, &mut subsets);
+    subsets
+}
+
+fn collect_node_subsets(
+    nodes: &[UsableNode],
+    count: usize,
+    start: usize,
+    current: &mut Vec<UsableNode>,
+    subsets: &mut Vec<Vec<UsableNode>>,
+) {
+    if current.len() == count {
+        subsets.push(current.clone());
+        return;
+    }
+    let needed = count - current.len();
+    if nodes.len().saturating_sub(start) < needed {
+        return;
+    }
+    for index in start..=nodes.len() - needed {
+        current.push(nodes[index].clone());
+        collect_node_subsets(nodes, count, index + 1, current, subsets);
+        current.pop();
+    }
+}


what - this is only the nodes for a split there is no world in which this is an issue, goodness me.

ndizazzo

I think there may be an availability risk in the new coordinator-fencing flow. In claim_split_coordinator_lease(), the coordinator can reach quorum for a newer topology, but each accepting stage has already processed StageControlRequest::Claim; on the stage side, claim() immediately calls fence_stale_runtime_for_claim() when the claim supersedes an older term.

If the replan later fails during prepare/load, this could partially or fully shut down the currently serving lower-term topology before the replacement is ready. A safer shape might be to avoid destructive fencing during the initial claim phase, then fence old runtime state only after the replacement generation has completed prepare/load and is ready for cutover.

The relevant paths look like:

crates/mesh-llm-host-runtime/src/runtime/local.rs::claim_split_coordinator_lease
load_split_runtime_generation
crates/mesh-llm-host-runtime/src/inference/skippy/stage/mod.rs::{claim,fence_stale_runtime_for_claim}

We could add some regression coverage for partial claim acceptance and for a later prepare/load failure after some stages have accepted the new term. The old split should remain serving unless the replacement topology is actually ready to take over.

i386 · 2026-05-12T06:37:07Z

Opened follow-up PR #526 for the remaining worthwhile open review comments: https://github.com/Mesh-LLM/mesh-llm/pull/526\n\ncc @michaelneale

… snapshots Root cause: when two nodes had different HF cache states for the same layer package (different snapshot commits, different model-package.json content), the cache resolution code could pick a stale snapshot with all layers present instead of the current HEAD snapshot with partial layers. This caused manifest sha256 mismatches during split Load, killing the split topology. Three changes: 1. Cache resolution now checks only the REQUESTED layer range, not all declared layers. Metadata-only probes (layer_start=layer_end=0) check for at least one layer artifact (anti-skeleton). Real stage loads check their assigned range only. 2. Removed the pre-download floating-revision fallback scan that walked all cached snapshots looking for one with layers. This was the code that picked stale snapshots with different manifests. 3. Scoped the post-download fallback scan to metadata-only probes only. Real stage loads always download their assigned layers into the HEAD snapshot, so no fallback is needed. Also adds debug-level stage control tracing in handle_stage_control for future split debugging. Validated: 480B split across Studio (stage-0, layers 0-49) and James (stage-1, layers 50-61) over relay — inference working end-to-end with divergent HF cache states on both nodes.

Adds 10 focused tests for the cache resolution functions introduced in the previous commit: - cached_snapshot_has_any_layer_artifact: skeleton rejection, partial acceptance - cached_snapshot_has_requested_layers: range checks, partial ranges, missing layers - should_prefer_cached_snapshot_for_request: dispatch to correct check based on metadata-only vs stage load

michaelneale · 2026-05-12T07:33:52Z

@ndizazzo I think this branch has diverged enough - can we do some follow up with that?

Copilot

Pull request overview

Copilot reviewed 49 out of 50 changed files in this pull request and generated 3 comments.

KV cache is a unified allocation shared across parallel lanes with eviction. The diagnostic function was multiplying by lane count, overstating memory needs in failure messages.

Copilot AI review requested due to automatic review settings May 11, 2026 01:17

Copilot started reviewing on behalf of michaelneale May 11, 2026 01:18 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

michaelneale added 2 commits May 11, 2026 11:32

Copilot AI review requested due to automatic review settings May 11, 2026 01:42

Copilot started reviewing on behalf of michaelneale May 11, 2026 01:43 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread crates/mesh-llm-host-runtime/src/mesh/mod.rs Outdated

Comment thread crates/mesh-llm-host-runtime/src/runtime/mod.rs

Comment thread crates/mesh-llm-host-runtime/src/runtime/context_planning.rs Outdated

i386 and others added 2 commits May 11, 2026 12:11

Add split coordinator fencing

04986df

michaelneale changed the title ~~Split-aware context planning with KV quant negotiation~~ Split-aware context planning, universal Q8_0 KV default May 11, 2026

fix: rename test to match actual behavior (copilot review)

4fd65e0

Copilot AI review requested due to automatic review settings May 11, 2026 02:21

Copilot started reviewing on behalf of michaelneale May 11, 2026 02:21 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs Outdated

Comment thread crates/mesh-llm-host-runtime/src/runtime/local.rs Outdated

Comment thread crates/mesh-llm-host-runtime/src/runtime/context_planning.rs

Comment thread crates/model-artifact/src/gguf.rs Outdated

michaelneale and others added 2 commits May 11, 2026 13:06

Bound skippy OpenAI generation admission

7a00d20

Copilot AI review requested due to automatic review settings May 11, 2026 03:10

Copilot AI reviewed May 11, 2026

View reviewed changes

i386 added 2 commits May 11, 2026 13:15

Co-plan context and skippy concurrency

ba211c1

Merge branch 'micn/context-smarts' into jd/coordinator-election-fencing

99459ff

# Conflicts: # crates/mesh-llm-host-runtime/src/mesh/tests.rs

Copilot AI review requested due to automatic review settings May 11, 2026 03:27

Copilot AI reviewed May 11, 2026

View reviewed changes

i386 changed the title ~~Split-aware context planning, universal Q8_0 KV default~~ Improve context planning, concurrency, and coordinator fencing May 11, 2026

Revert "Co-plan context and skippy concurrency"

54e3089

This reverts commit ba211c1.

i386 changed the title ~~Improve context planning, concurrency, and coordinator fencing~~ Improve context planning, admission, and coordinator fencing May 11, 2026

fix: add generation queue fields to openai test fixtures

272f65d

The bounded admission fields added in 7a00d20 were not threaded into the multimodal smoke fixtures, breaking `cargo check --tests` in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 11, 2026 03:58

Copilot AI review requested due to automatic review settings May 12, 2026 02:52

Copilot started reviewing on behalf of michaelneale May 12, 2026 02:53 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

michaelneale added 2 commits May 12, 2026 13:44

fix: collapse nested if to satisfy clippy collapsible_if

2c7d0a5

Merge remote-tracking branch 'origin/main' into micn/context-smarts

c726034

* origin/main: Hide cold models from chat selector (#520) Decompose skippy-prompt main (#517) Stream layer package artifacts from HF jobs (#515)

Copilot AI review requested due to automatic review settings May 12, 2026 03:49

Copilot started reviewing on behalf of michaelneale May 12, 2026 03:50 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

michaelneale added 2 commits May 12, 2026 14:24

Merge remote-tracking branch 'origin/main' into micn/context-smarts

819d554

* origin/main: fix(debug-model-loading): fix extremely long model loading times in debug builds

Copilot AI review requested due to automatic review settings May 12, 2026 05:59

Copilot started reviewing on behalf of michaelneale May 12, 2026 06:00 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

ndizazzo reviewed May 12, 2026

View reviewed changes

i386 mentioned this pull request May 12, 2026

Fix PR 513 review follow-ups #526

Merged

ndizazzo mentioned this pull request May 12, 2026

Report backend GPUs and bundle GPU benchmarks #509

Merged

michaelneale added 2 commits May 12, 2026 17:22

Copilot AI review requested due to automatic review settings May 12, 2026 07:29

Copilot started reviewing on behalf of michaelneale May 12, 2026 07:30 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread crates/mesh-llm-host-runtime/src/inference/skippy/mod.rs

Comment thread crates/mesh-llm-host-runtime/src/runtime/split_planning.rs Outdated

Comment thread crates/mesh-llm-host-runtime/src/inference/skippy/stage/mod.rs

fix: split diagnostic bytes_per_layer should not multiply KV by lanes

e1cb5dd

KV cache is a unified allocation shared across parallel lanes with eviction. The diagnostic function was multiplying by lane count, overstating memory needs in failure messages.

michaelneale merged commit d136df1 into main May 12, 2026
42 of 43 checks passed

michaelneale deleted the micn/context-smarts branch May 12, 2026 08:04

Conversation

michaelneale commented May 11, 2026 • edited by i386 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Architecture

User Impact

Protocol

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

i386 commented May 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

i386 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

michaelneale commented May 12, 2026

Uh oh!

i386 commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

michaelneale May 12, 2026

Choose a reason for hiding this comment

Uh oh!

ndizazzo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

i386 commented May 12, 2026

Uh oh!

michaelneale commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

michaelneale commented May 11, 2026 •

edited by i386

Loading

ndizazzo left a comment •

edited

Loading