Skip to content

feat(llm): add --low-vram to share a GPU with another large model#662

Open
brettdavies wants to merge 1 commit into
tobi:mainfrom
brettdavies:feat/llama-low-vram
Open

feat(llm): add --low-vram to share a GPU with another large model#662
brettdavies wants to merge 1 commit into
tobi:mainfrom
brettdavies:feat/llama-low-vram

Conversation

@brettdavies
Copy link
Copy Markdown

@brettdavies brettdavies commented May 19, 2026

--low-vram: share a GPU with another large model

Why

qmd query (and any other path that runs the full pipeline) loads three GGUF models into a single process: embed (~320 MB), generate (~2 GB), rerank (~2.3 GB). Once all three are resident, peak VRAM sits around 5.4 GB. On a GPU that's already shared with another model (say a 24 GB card running a ~20 GB Ollama), there isn't enough free VRAM left, and rerank context creation fails with Failed to create any rerank context. That's the failure mode tracked in #275 (closed during the v2.5.1 backlog cleanup; reproducible on hardware ranging from a 2 GB GTX 960M to a 6 GB RTX 3060 per reporters there, plus the 24 GB / Ollama-coexistence case below).

The three pipeline stages (expand → embed → search → rerank) are inherently sequential, so the win is straightforward: keep the tiny embed model resident, dispose the heavy generate and rerank models after each use, reload them on demand. Peak drops to ~2.6 GB. The cost is per-stage load latency.

Measured on an RTX 3090 Ti running alongside Ollama Gemma 4 26B (~20.3 GB VRAM):

Mode Query time Peak qmd VRAM Coexists with 20 GB Ollama?
Bare qmd query (cold) 33 s ~3 GB (churning) partially; CPU offload thrash + occasional rerank failures
qmd query (warm, default) 3.0 s 5.4 GB no (25.7 GB total > 24 GB free)
qmd query --low-vram 5.6 s 2.6 GB yes (22.6 GB total, headroom)

22K-file collection, hybrid query, full pipeline (expand + vec + rerank).

What this PR does

Adds lowVram to LlamaCppConfig, surfaced as a global --low-vram CLI flag (any subcommand) and a QMD_LOW_VRAM=1 env var. Naming follows the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).

When enabled, LlamaCpp:

  1. Disposes the generate model in a finally block after each expandQuery call.
  2. Disposes the rerank model + ranking contexts in a finally block after each rerank call.
  3. Serializes overlapping expandQuery calls through an internal promise chain so a dispose call can never race with another caller's in-flight use of the same model. Same for rerank.
  4. Leaves embed and embedBatch running in parallel as before: the embed model stays resident.

The two chains are independent: expandQuery and rerank against their separate heavy models can run in parallel against each other.

Because the flag is global and the constructor reads QMD_LOW_VRAM, it works everywhere LlamaCpp is constructed (qmd query, qmd embed, qmd mcp --http --daemon, qmd vsearch, etc.) without per-subcommand plumbing.

Architecture

29073bc  feat(llm): add low-vram mode that loads one heavy model at a time

Single commit. The diff is:

  • src/llm.ts: LlamaCppConfig.lowVram?: boolean (+ doc comment), two private dispose helpers (disposeGenerateModel, disposeRerankModel), two chain fields (generateChain, rerankChain), public expandQuery and rerank route through their chain + finally-dispose when this.lowVram is true and fall through to existing impl otherwise. Existing method bodies extracted into expandQueryImpl / rerankImpl; no behavioural change in the default path.
  • src/cli/qmd.ts: --low-vram boolean option added to the global parser; process.env.QMD_LOW_VRAM = "1" set after parsing (mirrors how --no-gpu sets QMD_FORCE_CPU). Help text + env-var docs updated.
  • test/llm-low-vram.test.ts: new, 6 tests covering the concurrency contract without loading real models.

Tests

test/llm-low-vram.test.ts (new, 6 tests):

  • Overlapping expandQuery calls serialize correctly (the bug we're guarding against is a dispose racing with an in-flight call; verified by recording the event sequence on a stubbed LlamaCpp).
  • Overlapping rerank calls serialize correctly.
  • expandQuery and rerank chains are independent: they run in parallel against each other.
  • A failing call still releases the chain, so the next caller doesn't deadlock.
  • Default (lowVram=false) behaves exactly as today: no serialization, no dispose. Verified by counting in-flight calls (max = caller concurrency, not 1).
  • QMD_LOW_VRAM=1 env is read by the constructor when no explicit config is given.

tsc --noEmit clean. Full suite passes apart from the pre-existing GPU/model-load-dependent failures that also reproduce on bare upstream/main in the same environment (rerank context creation under VRAM pressure, MCP hybridQuery, skill-bundle paths).

Backwards compatibility

  • Behaviour with no flag set is bit-for-bit unchanged — the lowVram code paths are only entered when --low-vram / QMD_LOW_VRAM=1 is explicitly set.
  • LlamaCppConfig.lowVram defaults to false. Existing callers see no behaviour change.
  • The public LLM interface is unchanged. expandQuery and rerank keep their signatures; the new dispose helpers are private.
  • New flag (--low-vram) and env var (QMD_LOW_VRAM) are additive.
  • The LLMSession wrappers (session.expandQuery, session.rerank) flow through the same public methods, so they pick up the chain automatically when lowVram is on; no session-level changes needed.

Things I deliberately didn't do

  • Don't dispose the embed model. It's small (~320 MB) and hot because embed is the most-called method. Disposing it would thrash any indexing or search loop.
  • Don't auto-detect "should I enable lowVram?". Picking the threshold (free VRAM headroom, model-size sum) is policy I'd rather let the user set explicitly than guess. qmd doctor could surface a recommendation in a follow-up.
  • Don't move serialization into LLMSession. Looked at it; the session manager tracks ref-count + abort signals, not per-model lifecycle. Adding chain state there would couple two unrelated concerns. Keeping it inside LlamaCpp keeps the engine self-contained.
  • No new context-size or batch-size tuning. Out of scope.

Open questions for you

  1. Naming: --low-vram describes the user-visible effect. An earlier draft called it --sequential (describes the mechanism); readable to people who already know the pipeline structure, less so otherwise. Went with --low-vram for discoverability. Open to either.
  2. Default behaviour: should qmd doctor recommend --low-vram when free VRAM is below some threshold? Easy to add but felt like a separate concern.

Adds `lowVram` to `LlamaCppConfig` (also `--low-vram` CLI flag and
`QMD_LOW_VRAM=1` env). When enabled, the heavy generate (~2 GB) and
rerank (~2.3 GB) models are disposed immediately after each use,
while the tiny embed model (~320 MB) stays resident. Peak VRAM drops
from ~5.4 GB to ~2.6 GB at the cost of per-stage load latency
(~3 s → ~5.6 s on a typical GPU).

This makes qmd usable on GPUs where loading all three models at once
exhausts free VRAM — for example, sharing a 24 GB GPU with a ~20 GB
Ollama instance. Addresses the failure mode tracked in tobi#275 across
all entry points that construct an LlamaCpp instance: `qmd query`,
`qmd mcp`, and the upcoming `qmd serve`.

The pipeline stages (expand → embed → search → rerank) are inherently
sequential, so disposing between them only costs reload time, not
correctness.

Concurrency: when lowVram is on, expandQuery and rerank calls
serialize through per-method promise chains so a dispose can never
race with another caller's in-flight use of the same model. embed
and embedBatch remain parallel. The two chains are independent —
expand and rerank against their separate heavy models can run in
parallel against each other.

The flag is global (`--low-vram` works on any subcommand that
constructs an LlamaCpp), so qmd query, qmd mcp, and other one-shot
commands all benefit — not just long-lived daemons. Naming follows
the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).
@brettdavies brettdavies force-pushed the feat/llama-low-vram branch from c9d6c27 to 29073bc Compare May 26, 2026 03:39
@brettdavies brettdavies changed the title feat(llm): --low-vram mode for memory-constrained GPUs feat(llm): add --low-vram to share a GPU with another large model May 26, 2026
@brettdavies
Copy link
Copy Markdown
Author

Hi @tobi, friendly nudge on this one when you have a review cycle.

Single opt-in flag with no behaviour change in the default path: LlamaCppConfig.lowVram defaults to false, so existing callers, indexes, and the qmd query / qmd embed / qmd mcp paths run identically when the flag isn't set. The lowVram code paths are only entered when --low-vram / QMD_LOW_VRAM=1 is explicitly passed.

The win is concrete and matches the case in the body: on a 24 GB GPU sharing space with a ~20 GB Ollama, qmd query currently fails the rerank stage with Failed to create any rerank context. With --low-vram, peak VRAM drops from 5.4 GB to 2.6 GB and the full pipeline runs (3.0s → 5.6s; table in the body has measurements). Same fix applies to the smaller-card scenarios reported on #275.

Rebased onto current main (commit 29073bc); tsc --noEmit clean; six new concurrency tests in test/llm-low-vram.test.ts pass against a stubbed LlamaCpp so they run without a GPU and cover the dispose-mid-call race the chain is meant to prevent. The same engine flag underpins the qmd serve --low-vram passthrough in #663, but this PR stands alone.

Happy to rename the flag (the body's open question on --low-vram vs --sequential) or adjust scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant