feat(llm): add --low-vram to share a GPU with another large model#662
feat(llm): add --low-vram to share a GPU with another large model#662brettdavies wants to merge 1 commit into
--low-vram to share a GPU with another large model#662Conversation
Adds `lowVram` to `LlamaCppConfig` (also `--low-vram` CLI flag and `QMD_LOW_VRAM=1` env). When enabled, the heavy generate (~2 GB) and rerank (~2.3 GB) models are disposed immediately after each use, while the tiny embed model (~320 MB) stays resident. Peak VRAM drops from ~5.4 GB to ~2.6 GB at the cost of per-stage load latency (~3 s → ~5.6 s on a typical GPU). This makes qmd usable on GPUs where loading all three models at once exhausts free VRAM — for example, sharing a 24 GB GPU with a ~20 GB Ollama instance. Addresses the failure mode tracked in tobi#275 across all entry points that construct an LlamaCpp instance: `qmd query`, `qmd mcp`, and the upcoming `qmd serve`. The pipeline stages (expand → embed → search → rerank) are inherently sequential, so disposing between them only costs reload time, not correctness. Concurrency: when lowVram is on, expandQuery and rerank calls serialize through per-method promise chains so a dispose can never race with another caller's in-flight use of the same model. embed and embedBatch remain parallel. The two chains are independent — expand and rerank against their separate heavy models can run in parallel against each other. The flag is global (`--low-vram` works on any subcommand that constructs an LlamaCpp), so qmd query, qmd mcp, and other one-shot commands all benefit — not just long-lived daemons. Naming follows the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).
c9d6c27 to
29073bc
Compare
--low-vram mode for memory-constrained GPUs--low-vram to share a GPU with another large model
|
Hi @tobi, friendly nudge on this one when you have a review cycle. Single opt-in flag with no behaviour change in the default path: The win is concrete and matches the case in the body: on a 24 GB GPU sharing space with a ~20 GB Ollama, Rebased onto current Happy to rename the flag (the body's open question on |
--low-vram: share a GPU with another large modelWhy
qmd query(and any other path that runs the full pipeline) loads three GGUF models into a single process: embed (~320 MB), generate (~2 GB), rerank (~2.3 GB). Once all three are resident, peak VRAM sits around 5.4 GB. On a GPU that's already shared with another model (say a 24 GB card running a ~20 GB Ollama), there isn't enough free VRAM left, and rerank context creation fails withFailed to create any rerank context. That's the failure mode tracked in #275 (closed during the v2.5.1 backlog cleanup; reproducible on hardware ranging from a 2 GB GTX 960M to a 6 GB RTX 3060 per reporters there, plus the 24 GB / Ollama-coexistence case below).The three pipeline stages (expand → embed → search → rerank) are inherently sequential, so the win is straightforward: keep the tiny embed model resident, dispose the heavy generate and rerank models after each use, reload them on demand. Peak drops to ~2.6 GB. The cost is per-stage load latency.
Measured on an RTX 3090 Ti running alongside Ollama Gemma 4 26B (~20.3 GB VRAM):
qmd query(cold)qmd query(warm, default)qmd query --low-vram22K-file collection, hybrid query, full pipeline (expand + vec + rerank).
What this PR does
Adds
lowVramtoLlamaCppConfig, surfaced as a global--low-vramCLI flag (any subcommand) and aQMD_LOW_VRAM=1env var. Naming follows the existing engine-knob pattern (QMD_FORCE_CPU,QMD_LLAMA_GPU).When enabled,
LlamaCpp:finallyblock after eachexpandQuerycall.finallyblock after eachrerankcall.expandQuerycalls through an internal promise chain so adisposecall can never race with another caller's in-flight use of the same model. Same forrerank.embedandembedBatchrunning in parallel as before: the embed model stays resident.The two chains are independent:
expandQueryandrerankagainst their separate heavy models can run in parallel against each other.Because the flag is global and the constructor reads
QMD_LOW_VRAM, it works everywhereLlamaCppis constructed (qmd query,qmd embed,qmd mcp --http --daemon,qmd vsearch, etc.) without per-subcommand plumbing.Architecture
Single commit. The diff is:
src/llm.ts:LlamaCppConfig.lowVram?: boolean(+ doc comment), two private dispose helpers (disposeGenerateModel,disposeRerankModel), two chain fields (generateChain,rerankChain), publicexpandQueryandrerankroute through their chain + finally-dispose whenthis.lowVramis true and fall through to existing impl otherwise. Existing method bodies extracted intoexpandQueryImpl/rerankImpl; no behavioural change in the default path.src/cli/qmd.ts:--low-vramboolean option added to the global parser;process.env.QMD_LOW_VRAM = "1"set after parsing (mirrors how--no-gpusetsQMD_FORCE_CPU). Help text + env-var docs updated.test/llm-low-vram.test.ts: new, 6 tests covering the concurrency contract without loading real models.Tests
test/llm-low-vram.test.ts(new, 6 tests):expandQuerycalls serialize correctly (the bug we're guarding against is adisposeracing with an in-flight call; verified by recording the event sequence on a stubbedLlamaCpp).rerankcalls serialize correctly.expandQueryandrerankchains are independent: they run in parallel against each other.lowVram=false) behaves exactly as today: no serialization, no dispose. Verified by counting in-flight calls (max = caller concurrency, not 1).QMD_LOW_VRAM=1env is read by the constructor when no explicit config is given.tsc --noEmitclean. Full suite passes apart from the pre-existing GPU/model-load-dependent failures that also reproduce on bareupstream/mainin the same environment (rerank context creation under VRAM pressure, MCP hybridQuery, skill-bundle paths).Backwards compatibility
lowVramcode paths are only entered when--low-vram/QMD_LOW_VRAM=1is explicitly set.LlamaCppConfig.lowVramdefaults tofalse. Existing callers see no behaviour change.LLMinterface is unchanged.expandQueryandrerankkeep their signatures; the new dispose helpers are private.--low-vram) and env var (QMD_LOW_VRAM) are additive.LLMSessionwrappers (session.expandQuery,session.rerank) flow through the same public methods, so they pick up the chain automatically whenlowVramis on; no session-level changes needed.Things I deliberately didn't do
qmd doctorcould surface a recommendation in a follow-up.LLMSession. Looked at it; the session manager tracks ref-count + abort signals, not per-model lifecycle. Adding chain state there would couple two unrelated concerns. Keeping it insideLlamaCppkeeps the engine self-contained.Open questions for you
--low-vramdescribes the user-visible effect. An earlier draft called it--sequential(describes the mechanism); readable to people who already know the pipeline structure, less so otherwise. Went with--low-vramfor discoverability. Open to either.qmd doctorrecommend--low-vramwhen free VRAM is below some threshold? Easy to add but felt like a separate concern.