Skip to content

feat: add cpu_offload option for low-VRAM model loading#40

Open
omnificate wants to merge 1 commit intoOverworldai:mainfrom
omnificate:feat/cpu-quantize
Open

feat: add cpu_offload option for low-VRAM model loading#40
omnificate wants to merge 1 commit intoOverworldai:mainfrom
omnificate:feat/cpu-quantize

Conversation

@omnificate
Copy link
Copy Markdown

@omnificate omnificate commented Apr 12, 2026

When cpu_offload=True, the model is built and patched on CPU before being moved to GPU. Quantization runs on GPU after the move to ensure compatibility with all backends (FP8, INT8/GemLite, NVFP4).

This reduces peak VRAM during model initialization, making it feasible to run on systems with limited GPU memory.

Changes to WorldEngine.__init__:

  • New parameter: cpu_offload: bool = False
  • When enabled: model is created and patched inside torch.device('cpu') context, then moved to the target device with .to()
  • Quantization always runs on the target device (after move) for full backend compatibility
  • When disabled: zero behavioral change from the existing code path

Companion PR: Overworldai/Biome#97

When cpu_offload=True, the model is built and patched on CPU before
being moved to GPU. Quantization runs on GPU after the move to ensure
compatibility with all backends (FP8, INT8/GemLite, NVFP4).

This reduces peak VRAM during model initialization, making it feasible
to run on systems with limited GPU memory.
omnificate pushed a commit to omnificate/Biome that referenced this pull request Apr 12, 2026
Adds a 'CPU Model Loading' checkbox in Performance settings that sends
cpu_offload in the WebSocket init message. When enabled, the world_engine
server builds the model on CPU before moving to GPU, reducing peak VRAM
during initialization. Essential for systems with limited GPU memory.

Changes:
- Top-level cpu_offload setting (default: false)
- Checkbox in Performance section with i18n (en/ja/zh/goose)
- WebSocket init message includes cpu_offload flag
- Lifecycle model key encodes cpu_offload so toggling triggers reconnect
- Mode-switch modal shown when toggling during active streaming
- Server passes cpu_offload through to WorldEngine constructor

Companion PR: Overworldai/world_engine#40
@omnificate omnificate changed the title feat: add cpu_quantize option for low-VRAM systems feat: add cpu_offload option for low-VRAM model loading Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants