feat(llm): add `--low-vram` to share a GPU with another large model by brettdavies · Pull Request #662 · tobi/qmd

brettdavies · 2026-05-19T23:10:09Z

`--low-vram`: share a GPU with another large model

Why

qmd query (and any other path that runs the full pipeline) loads three GGUF models into a single process: embed (~320 MB), generate (~2 GB), rerank (~2.3 GB). Once all three are resident, peak VRAM sits around 5.4 GB. On a GPU that's already shared with another model (say a 24 GB card running a ~20 GB Ollama), there isn't enough free VRAM left, and rerank context creation fails with Failed to create any rerank context. That's the failure mode tracked in #275 (closed during the v2.5.1 backlog cleanup; reproducible on hardware ranging from a 2 GB GTX 960M to a 6 GB RTX 3060 per reporters there, plus the 24 GB / Ollama-coexistence case below).

The three pipeline stages (expand → embed → search → rerank) are inherently sequential, so the win is straightforward: keep the tiny embed model resident, dispose the heavy generate and rerank models after each use, reload them on demand. Peak drops to ~2.6 GB. The cost is per-stage load latency.

Measured on an RTX 3090 Ti running alongside Ollama Gemma 4 26B (~20.3 GB VRAM):

Mode	Query time	Peak qmd VRAM	Coexists with 20 GB Ollama?
Bare `qmd query` (cold)	33 s	~3 GB (churning)	partially; CPU offload thrash + occasional rerank failures
`qmd query` (warm, default)	3.0 s	5.4 GB	no (25.7 GB total > 24 GB free)
`qmd query --low-vram`	5.6 s	2.6 GB	yes (22.6 GB total, headroom)

22K-file collection, hybrid query, full pipeline (expand + vec + rerank).

What this PR does

Adds lowVram to LlamaCppConfig, surfaced as a global --low-vram CLI flag (any subcommand) and a QMD_LOW_VRAM=1 env var. Naming follows the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).

When enabled, LlamaCpp:

Disposes the generate model in a finally block after each expandQuery call.
Disposes the rerank model + ranking contexts in a finally block after each rerank call.
Serializes overlapping expandQuery calls through an internal promise chain so a dispose call can never race with another caller's in-flight use of the same model. Same for rerank.
Leaves embed and embedBatch running in parallel as before: the embed model stays resident.

The two chains are independent: expandQuery and rerank against their separate heavy models can run in parallel against each other.

Because the flag is global and the constructor reads QMD_LOW_VRAM, it works everywhere LlamaCpp is constructed (qmd query, qmd embed, qmd mcp --http --daemon, qmd vsearch, etc.) without per-subcommand plumbing.

Architecture

29073bc  feat(llm): add low-vram mode that loads one heavy model at a time

Single commit. The diff is:

src/llm.ts: LlamaCppConfig.lowVram?: boolean (+ doc comment), two private dispose helpers (disposeGenerateModel, disposeRerankModel), two chain fields (generateChain, rerankChain), public expandQuery and rerank route through their chain + finally-dispose when this.lowVram is true and fall through to existing impl otherwise. Existing method bodies extracted into expandQueryImpl / rerankImpl; no behavioural change in the default path.
src/cli/qmd.ts: --low-vram boolean option added to the global parser; process.env.QMD_LOW_VRAM = "1" set after parsing (mirrors how --no-gpu sets QMD_FORCE_CPU). Help text + env-var docs updated.
test/llm-low-vram.test.ts: new, 6 tests covering the concurrency contract without loading real models.

Tests

test/llm-low-vram.test.ts (new, 6 tests):

Overlapping expandQuery calls serialize correctly (the bug we're guarding against is a dispose racing with an in-flight call; verified by recording the event sequence on a stubbed LlamaCpp).
Overlapping rerank calls serialize correctly.
expandQuery and rerank chains are independent: they run in parallel against each other.
A failing call still releases the chain, so the next caller doesn't deadlock.
Default (lowVram=false) behaves exactly as today: no serialization, no dispose. Verified by counting in-flight calls (max = caller concurrency, not 1).
QMD_LOW_VRAM=1 env is read by the constructor when no explicit config is given.

tsc --noEmit clean. Full suite passes apart from the pre-existing GPU/model-load-dependent failures that also reproduce on bare upstream/main in the same environment (rerank context creation under VRAM pressure, MCP hybridQuery, skill-bundle paths).

Backwards compatibility

Behaviour with no flag set is bit-for-bit unchanged — the lowVram code paths are only entered when --low-vram / QMD_LOW_VRAM=1 is explicitly set.
LlamaCppConfig.lowVram defaults to false. Existing callers see no behaviour change.
The public LLM interface is unchanged. expandQuery and rerank keep their signatures; the new dispose helpers are private.
New flag (--low-vram) and env var (QMD_LOW_VRAM) are additive.
The LLMSession wrappers (session.expandQuery, session.rerank) flow through the same public methods, so they pick up the chain automatically when lowVram is on; no session-level changes needed.

Things I deliberately didn't do

Don't dispose the embed model. It's small (~320 MB) and hot because embed is the most-called method. Disposing it would thrash any indexing or search loop.
Don't auto-detect "should I enable lowVram?". Picking the threshold (free VRAM headroom, model-size sum) is policy I'd rather let the user set explicitly than guess. qmd doctor could surface a recommendation in a follow-up.
Don't move serialization into LLMSession. Looked at it; the session manager tracks ref-count + abort signals, not per-model lifecycle. Adding chain state there would couple two unrelated concerns. Keeping it inside LlamaCpp keeps the engine self-contained.
No new context-size or batch-size tuning. Out of scope.

Open questions for you

Naming: --low-vram describes the user-visible effect. An earlier draft called it --sequential (describes the mechanism); readable to people who already know the pipeline structure, less so otherwise. Went with --low-vram for discoverability. Open to either.
Default behaviour: should qmd doctor recommend --low-vram when free VRAM is below some threshold? Easy to add but felt like a separate concern.

Adds `lowVram` to `LlamaCppConfig` (also `--low-vram` CLI flag and `QMD_LOW_VRAM=1` env). When enabled, the heavy generate (~2 GB) and rerank (~2.3 GB) models are disposed immediately after each use, while the tiny embed model (~320 MB) stays resident. Peak VRAM drops from ~5.4 GB to ~2.6 GB at the cost of per-stage load latency (~3 s → ~5.6 s on a typical GPU). This makes qmd usable on GPUs where loading all three models at once exhausts free VRAM — for example, sharing a 24 GB GPU with a ~20 GB Ollama instance. Addresses the failure mode tracked in tobi#275 across all entry points that construct an LlamaCpp instance: `qmd query`, `qmd mcp`, and the upcoming `qmd serve`. The pipeline stages (expand → embed → search → rerank) are inherently sequential, so disposing between them only costs reload time, not correctness. Concurrency: when lowVram is on, expandQuery and rerank calls serialize through per-method promise chains so a dispose can never race with another caller's in-flight use of the same model. embed and embedBatch remain parallel. The two chains are independent — expand and rerank against their separate heavy models can run in parallel against each other. The flag is global (`--low-vram` works on any subcommand that constructs an LlamaCpp), so qmd query, qmd mcp, and other one-shot commands all benefit — not just long-lived daemons. Naming follows the existing engine-knob pattern (QMD_FORCE_CPU, QMD_LLAMA_GPU).

brettdavies · 2026-05-26T04:26:09Z

Hi @tobi, friendly nudge on this one when you have a review cycle.

Single opt-in flag with no behaviour change in the default path: LlamaCppConfig.lowVram defaults to false, so existing callers, indexes, and the qmd query / qmd embed / qmd mcp paths run identically when the flag isn't set. The lowVram code paths are only entered when --low-vram / QMD_LOW_VRAM=1 is explicitly passed.

The win is concrete and matches the case in the body: on a 24 GB GPU sharing space with a ~20 GB Ollama, qmd query currently fails the rerank stage with Failed to create any rerank context. With --low-vram, peak VRAM drops from 5.4 GB to 2.6 GB and the full pipeline runs (3.0s → 5.6s; table in the body has measurements). Same fix applies to the smaller-card scenarios reported on #275.

Rebased onto current main (commit 29073bc); tsc --noEmit clean; six new concurrency tests in test/llm-low-vram.test.ts pass against a stubbed LlamaCpp so they run without a GPU and cover the dispose-mid-call race the chain is meant to prevent. The same engine flag underpins the qmd serve --low-vram passthrough in #663, but this PR stands alone.

Happy to rename the flag (the body's open question on --low-vram vs --sequential) or adjust scope.

brettdavies mentioned this pull request May 19, 2026

feat(serve): qmd serve — shared model server, continuing #511 #663

Open

brettdavies marked this pull request as ready for review May 20, 2026 21:40

brettdavies force-pushed the feat/llama-low-vram branch from c9d6c27 to 29073bc Compare May 26, 2026 03:39

brettdavies changed the title ~~feat(llm): --low-vram mode for memory-constrained GPUs~~ feat(llm): add --low-vram to share a GPU with another large model May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add `--low-vram` to share a GPU with another large model#662

feat(llm): add `--low-vram` to share a GPU with another large model#662
brettdavies wants to merge 1 commit into
tobi:mainfrom
brettdavies:feat/llama-low-vram

brettdavies commented May 19, 2026 •

edited

Loading

Uh oh!

brettdavies commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brettdavies commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

--low-vram: share a GPU with another large model

Why

What this PR does

Architecture

Tests

Backwards compatibility

Things I deliberately didn't do

Open questions for you

Uh oh!

brettdavies commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brettdavies commented May 19, 2026 •

edited

Loading

`--low-vram`: share a GPU with another large model