Skip to content

Enable adaptive Skippy speculation#446

Closed
i386 wants to merge 3 commits into
mainfrom
jd/skippy-speculative-auto
Closed

Enable adaptive Skippy speculation#446
i386 wants to merge 3 commits into
mainfrom
jd/skippy-speculative-auto

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented May 6, 2026

Summary

  • Add a new skippy-speculative crate for reusable n-gram policy, verification-span classification, repair strategy, and MTP notes.
  • Add OpenAI --openai-ngram-auto with prompt/repeated-suffix gating, shared acceptance/cost telemetry, and adaptive repair behavior.
  • Default the embedded mesh Skippy serving preset to adaptive n-gram auto mode while keeping standalone skippy-server opt-in.
  • Add scripts/skippy-openai-ngram-bench.sh and README/Mermaid diagrams that document how n-gram, draft-model, and fallback decode are compared.

Architecture

  • skippy-speculative owns speculation policy and telemetry-shaped decisions; skippy-server remains responsible for target verification and streaming.
  • Auto n-gram mode enables immediately for favorable prompt shape or after repeated suffix hits; otherwise it stays in cold observation to avoid verifier overhead.
  • Draft-model speculation remains explicit because the realistic pair tested was slower despite functional compatibility.

Protocol

  • No mesh wire/protobuf/gossip protocol changes.
  • CLI and OpenAI server flags are additive; older mesh nodes continue to interoperate as before.

Validation

  • LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p skippy-speculative -p skippy-server --lib
  • LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo test -p mesh-llm inference::skippy --lib
  • LLAMA_STAGE_BUILD_DIR=.deps/llama.cpp/build-stage-abi-metal cargo check -p skippy-server -p mesh-llm -p skippy-bench
  • bash -n scripts/skippy-openai-ngram-bench.sh
  • cargo fmt --all -- --check
  • git diff --check

Benchmark Evidence

  • Qwen 3 4B depth-1 mixed, 64 max tokens: ngram-auto 32.31 tok/s vs baseline 30.89 tok/s, 1.05x.
  • Qwen 3 4B depth-1 coding-warm, 64 max tokens: ngram-auto 41.19 tok/s vs baseline 33.52 tok/s, 1.23x.
  • Qwen 3 4B depth-4 stress: zero errors; mixed 1.01x and coding-warm 0.99x, so safe/neutral under concurrency rather than a throughput win.
  • Llama 3.2 3B target plus 1B draft loaded successfully, but draft-adaptive was slower, so draft stays explicit and pair-dependent.

@i386 i386 force-pushed the jd/skippy-speculative-auto branch from b49348a to 9aa0094 Compare May 6, 2026 11:06
@i386 i386 marked this pull request as ready for review May 6, 2026 11:06
@i386 i386 force-pushed the jd/skippy-speculative-auto branch from 9aa0094 to 252a538 Compare May 6, 2026 11:20
@i386 i386 force-pushed the jd/skippy-speculative-auto branch from 252a538 to dc235d2 Compare May 6, 2026 11:44
@michaelneale
Copy link
Copy Markdown
Collaborator

Worth trying with 10x parameter size models?

Is there some threshold where draft helps? Would you use ngram and draft?

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has not been updated in at least 5 days. It will be closed after 7 days of inactivity to keep the active review queue current. Please update it within 2 days if the changes are still moving forward.

@github-actions github-actions Bot added the stale label May 12, 2026
@github-actions github-actions Bot removed the stale label May 13, 2026
@ndizazzo ndizazzo added the stale label May 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Closing this pull request because it has not been updated in at least 7 days. Please reopen or create a fresh pull request when the changes are ready to continue.

@github-actions github-actions Bot closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants