Adds Quark as Biome inference engine backend#132
Merged
Conversation
…ctor
Re-applies the Apple Silicon backend swap (originally
``e7a2850`` on ``feat/quark-engine-macos``) against the refactored
``server-components/`` layout. The refactor went CUDA-only at the
device-helpers layer (``engine/devices.py``) and dropped the
``Platform`` / ``available_quants`` / ``MLX`` machinery; this commit
re-introduces just enough platform multiplexing to keep the existing
CUDA path untouched while routing Apple Silicon through ``quark.Engine``.
Three files change:
* ``server-components/engine/devices.py`` — device helpers were
pure CUDA. Detect Apple Silicon (``sys.platform == "darwin"`` +
``arm64``); on that platform:
- ``WORLD_ENGINE_DEVICE`` / ``SCENE_AUTHORING_DEVICE`` /
``SAFETY_DEVICE`` route to ``"cpu"`` so every existing
``frame.to(device=WORLD_ENGINE_DEVICE)`` call site stays a
no-op without per-call branching. Quark owns its own Metal
allocator and consumes torch tensors / numpy at the API
boundary, so torch tensors stay on CPU until they cross into
``quark.Engine``.
- ``OutOfMemoryError`` aliases to plain ``MemoryError``
(``torch.cuda.OutOfMemoryError`` doesn't exist when CUDA
isn't built into the active wheel; the ``except
devices.OutOfMemoryError`` blocks just never trigger on this
path).
- ``pynvml`` import is soft-guarded — Apple Silicon ships no
NVML and importing the package would raise. Every NVML
caller (``open_nvml_handle`` / ``driver_version_via_nvml`` /
``utilization_via_nvml``) short-circuits when
``pynvml is None``. The existing NVML-call try/except blocks
already returned sentinels on failure, so the broader
contract is unchanged.
The other helpers (``is_available`` / ``memory_allocated`` /
``synchronize`` / ``empty_cache`` / ``reset_compiled_graphs``)
already short-circuit gracefully via ``torch.cuda.is_available()``;
they remain CUDA-only and just no-op on Apple Silicon.
* ``server-components/engine/manager.py`` — module-level conditional
import of the engine class:
if _IS_DARWIN_ARM64:
from quark import CtrlInput, Engine as WorldEngine
else:
from world_engine import CtrlInput, WorldEngine
Aliasing ``quark.Engine`` to the local ``WorldEngine`` symbol means
every existing type annotation / construction site reads as before.
In ``load_engine``, branch on ``_IS_DARWIN_ARM64`` to skip the
dtype-fallback loop (quark is bf16-only on Metal — no native fp8 in
MSL, no int8 KV path today) and skip the OOM-retry loop (no CUDA
allocator pressure). The Apple branch passes
``quant="bf16"`` and ignores the client's ``requested_quant``.
TAEHV cache plumbing: read ``BIOME_TAEHV_CACHE_DIR`` from the env
and pass it as ``taehv_cache_dir=`` to ``quark.Engine(...)``. The
Electron host sets this env var when spawning the server so that
pre-built CoreML artifacts pulled from HF land inside Biome's app
data dir — uninstall / "clear cache" flows can ``rm -rf`` it
without leaving artifacts in ``~/.cache/quark/taehv/``. Unset
falls through to quark's default (``~/.cache/quark/taehv``).
Electron-side wiring is a follow-up; the server-side hook is
ready.
* ``server-components/pyproject.toml`` — drop unconditional pins on
the CUDA-specific deps (``nvidia-ml-py``, ``bitsandbytes``,
``llama-cpp-python``, ``gguf``); mark them ``sys_platform !=
'darwin'`` so ``uv sync`` on Apple Silicon doesn't pull them. Add
``"quark[engine] ; sys_platform == 'darwin'"`` to the dependency
list and a ``[tool.uv.sources]`` pin for the
``experimental/biome-base`` integration branch on quark
(rev ``c2a87ba`` — carries the ``taehv_cache_dir`` kwarg + the
HF-fetched CoreML artifact pipeline that's needed for app-managed
cache storage). ``torch`` / ``torchvision`` already came from the
``pytorch-cu128`` index — that index is now marked ``sys_platform
!= 'darwin'`` so Apple resolves torch from PyPI's default index
(no CUDA build available there for darwin/arm64 anyway).
Add ``sys_platform == 'darwin' and platform_machine == 'arm64'``
to ``[tool.uv].environments`` so the lockfile resolves under that
marker too.
CUDA path: identical to ``the-great-server-refactor`` HEAD. None of
the platform-conditional code in this commit fires when
``sys.platform != "darwin"``; every CUDA call site reaches the same
``WorldEngine`` import, the same ``WORLD_ENGINE_DEVICE = "cuda"``,
the same dtype-fallback loop, and the same OOM-retry path it had
before.
Apple Silicon path: still requires the upstream
``Overworld-Models/taehv1_5-coreml`` HF repo to be populated (the
publish flow lives at ``scripts/publish_taehv_coreml.py`` in
quark). Until that repo exists, the runtime fall-back error message
points to the publish script.
Verified: ``ast.parse`` clean on both modified Python files. Full
runtime verification needs ``uv sync`` on a Mac with the quark hash
``c2a87ba`` reachable from origin — pushing
``Overworldai/quark experimental/biome-base`` is the prereq.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ARK toggle Generalises the Apple-Silicon-only quark integration so the same ``quark.Engine`` path runs on CUDA hosts too. quark's ``Engine.__new__`` factory dispatches to ``EngineCUDA`` (Linux / Windows / CUDA Macs) or ``EngineMetal`` (Apple Silicon), with both subclasses accepting the same ``model_uri / quant / device / dtype / taehv_cache_dir`` kwargs. A top-level ``USE_QUARK`` constant in ``manager.py`` selects between ``quark.Engine`` and the legacy ``world_engine.WorldEngine``; hard- coded to ``True`` for now, slated to become a runtime setting once the quark CUDA path stabilises. Imports are aliased so the rest of the file stays backend-agnostic. ``load_engine`` collapses the two prior branches (Apple bf16-only quark vs. CUDA dtype-fallback world_engine) into one unified body that shares the dtype/OOM fallback loop and the per-failure cleanup; ``taehv_cache_dir`` (a quark-only kwarg) is gated on ``USE_QUARK``. ``pyproject.toml`` drops the ``sys_platform == 'darwin'`` marker on ``quark[engine]`` so the package installs everywhere.
…_interval - gen ms from input latency
When ``cap_inference_fps=True``, the per-iteration order was
sleep(~frame_interval − gen) → read input → submit → flush_pending → wait → stash
so a frame stashed at the end of iter K-1 sat in memory through iter
K's full pacing sleep before being encoded and sent. The sleep window
was pure idle CPU time that could have been spent encoding.
Now the loop also flushes at the top, before the sleep — gated on
``cap_inference_fps`` so uncapped (benchmark) mode keeps its existing
encode-during-gen overlap. Pending is empty by the time the
post-submit flush runs, so the second call is a deliberate no-op.
End-to-end: shaves ~(frame_interval − gen_time) ms off the
client-observed ``inputLatency`` for the pacing path. On wp1.5 360p
@ 60fps that's ~37 ms (66.67 ms interval − 30 ms gen). Server
compute (INFER / SYNC / ENC / MTRC / OVER in the perf overlay) is
unchanged; the win shows up as a drop in XMIT.
CONTRIBUTING.md asks for pyproject + uv.lock changes to land together; the prior commit (`6fe8f36 update(pyproj): update quark ref`) updated the pin without the regenerated lockfile, which would trip the `uv lock --check` step in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings whose changes need a restart used to live in three drifting hand-written lists (useEngineRespawn for process-class, useSessionInit for session/live, plus EngineTab.hasChangesRequiringRestart for the modal trigger). Adding engine_backend to the session list was missed, so toggling it mid-session saved without a reset and dropped back to the pause menu. SETTING_CLASSES in types/settings.ts is now the single source of truth, with helpers in utils/settingsClassifier consuming it. Top-level keys cover whole subtrees; dot-paths split when siblings differ (debug_overlays.action_logging is live; siblings are UI-only).
Toggling a process-class setting mid-stream raced: useEngineRespawn called stopServer() fire-and-forget then transitioned to LOADING immediately. The warm flow read stale isServerRunning=true from React state, picked attachToRunningStandalone, and threw "Server exited before becoming ready" while the doomed server finished dying. Awaiting stopServer lets the IPC "server exited" event propagate to isServerRunning=false before LOADING fires, so the warm flow correctly picks bootStandalone and starts a fresh server with the new env vars.
In offline mode uv can't reach the index to verify the lockfile and fails with "No solution found when resolving dependencies" — even though the venv is already correctly populated from the previous online sync. Setting UV_NO_SYNC=1 (which also implies --frozen) inside getOfflineEnv skips both the resolve and sync passes on uv run, so the server just exec's python in the existing venv. Online mode is untouched: uv run still auto-syncs when pyproject changes.
The connectionLost flag was only cleared in MAIN_MENU, so once set it stuck through LOADING and back into STREAMING. Hit visibly when a process-class respawn (useEngineRespawn) fires its disconnect from outside the lifecycle reducer — the brief failed-connection render inside STREAMING latched connectionLost true, and entering LOADING didn't clear it. Clearing on LOADING entry is symmetric with the existing clearEngineErrorOnLoadingEntry: any LOADING transition is "starting fresh", so prior overlays should reset. Also fixes the click-reconnect recovery path, which previously kept the overlay up until the user backed out to main menu. (Follow-up: the overlay still flashes for a frame during a respawn because the disconnect lands in a STREAMING render before LOADING transitions in. To suppress it entirely, the reducer needs to know the respawn is intentional — separate change.)
Toggling a process-class setting mid-stream briefly flashed the
"Connection Lost" overlay. The session-class reconnect path already
suppresses the overlay via `intentionalReconnectInProgress`, but
process-class respawns are driven by useEngineRespawn (outside the
reducer) and never set that flag.
Rather than adding a parallel `intentionalRespawnInProgress` boolean
plus parallel `currentProcessSig` / `lastAppliedProcess` payload
fields plus parallel detection branches, collapse all of it through a
single discriminator:
- New `RestartSignatures = { session, process }` bundle in the
classifier; `getRestartSignatures(settings)` builds both at once.
- `useSessionInit` returns `lastApplied: RestartSignatures | null`
(one piece of state instead of two).
- Payload carries `currentSignatures` + `lastAppliedSignatures`.
- Reducer state replaces `intentionalReconnectInProgress` with
`intentionalRestart: 'reconnect' | 'respawn' | null`. A new
`computeIntentionalRestart()` picks the strongest applicable
intent (process beats session) and returns null when nothing's
pending.
- Side-effect dispatch keys off the discriminator: `'reconnect'`
fires the existing effect chain, `'respawn'` is silent because
useEngineRespawn owns those side effects. Suppression checks
become `intentionalRestart !== null`.
Consistent settings behaviour
Two corrections to the Pydantic→TS codegen so the on-disk file round-trips through both `codegen --check` and `prettier --check` on every platform: - `Path.write_text` defaults to translating `\n` → `os.linesep` on Windows, so the codegen was silently writing CRLF on dev machines while prettier expected LF; the resulting on-disk file failed prettier even though the in-memory comparison passed (read_text translates CRLF back). Pinning `newline="\n"` writes LF unconditionally. - `render_enum` always emitted multi-line `z.enum([...])`, but prettier collapses short enums onto a single line under the project's 120-char `printWidth`. Mirror that behaviour so the codegen-emitted shape matches the formatted shape — short enums inline, long ones (e.g. `ServerStageId`) break across lines with two-space indent. The `_PRINT_WIDTH` constant is the gate. Surfaced while a new short enum (`EngineBackendSchema`) landed in the generated output and tripped both gates at once on Windows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…os-v2 # Conflicts: # electron/ipc/server.ts # server-components/server/routes.py # src/components/settings/EngineTab.tsx # src/context/streaming/StreamingContext.tsx # src/context/streaming/streamingWarmConnection.ts # src/hooks/engine/useEngineApi.ts # src/hooks/streaming/useWarmConnection.ts # src/i18n/en.ts # src/i18n/goose.ts # src/i18n/he.ts # src/i18n/ja.ts # src/i18n/zh.ts # src/types/ipc.ts
…p ~frame_interval - gen ms from input latency" This reverts commit 55c71ad.
…n set Quark only supports waypoint-1.5 (NotImplementedError on wp-1 configs: no TAEHV VAE, different scheduler shape). The picker now hides those rows when backend=quark, and the settings panel refuses to leave on save if the saved model is incompatible with the in-flight backend. Server: /api/models accepts ?backend=… and drops rows whose model_type falls outside COMPATIBLE_MODEL_TYPES_BY_BACKEND. Each PickerModel carries its resolved model_type so unresolvable rows (offline / HF outage / malformed config) pass through and stay backend-agnostic rather than silently emptying the picker. _scan_cache now returns the cached model_type per repo; uncached collection entries get a TTL-cached HF config.yaml fetch (~1KB) via _resolve_model_type. Renderer: list-models IPC takes backend, threads it into the query string. EngineTab's loader refires on backend toggle and tracks menuWorldModelAvailable; validateBeforeSave shows a new "Incompatible Model" confirm modal and blocks the save when the saved model fell off the filtered list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etection of models
Contributor
|
Quick test seems to work. Will look over the code more thoroughly and run it through the gauntlet of tests, but I think we'll be shipping this tomorrow 🎉 |
HF ids with reserved characters (`#`, `?`, `&`, whitespace) typed via
the custom model picker would corrupt the URL. Split on `/`, encode each
segment, rejoin — preserves the slash the FastAPI `{model_id:path}`
matcher relies on.
With Quark in play, "which backend was active" is part of the diagnostic surface — a wp-1 + quark or intw8a8 + quark mismatch is the likeliest failure mode for early users. Thread `requested_backend` through the diagnostics payload alongside `requested_model` and `requested_quant` at all three call sites (connection-lost overlay, terminal-display loading error overlay, debug-tab copy-to-clipboard).
`useEngineRespawn` hand-checked `engine_mode`/`server_url`/`offline_mode` to skip a respawn when only `offline_mode` flipped in server mode. A future process-class field added to `SETTING_CLASSES` would be silently swallowed by that bespoke check. Pull the logic up: a new `pathsThatDiffer` helper returns the changed paths in a class, and the hook drops to "the only delta is `offline_mode` in non-standalone" against that set, so new process-class fields fall into the respawn branch by default.
`OutOfMemoryError = MemoryError` swept up unrelated CPU-side memory faults on macOS, which would falsely trigger the dtype-downgrade retry in `WorldEngineManager.load_engine`. Replace with a private `_UnreachableOOM(BaseException)` sentinel — `except devices.OutOfMemoryError` still type-checks and resolves at runtime, but never matches anything on Apple Silicon (where there's no CUDA allocator pressure to surface in the first place).
The fallback function it references doesn't exist — the real fallback is just empty dropdowns until a probe lands. Tighten the comment to describe what the code actually does.
`customLabel` was optional, so a caller that set `allowCustom={true}`
without a label would silently render a blank "Custom..." footer. Split
the props into a discriminated union: `customLabel` is required iff
`allowCustom` is true, and forbidden otherwise so accidental passes
where it's irrelevant fail at compile time.
`handleCustomModelBlur` used to persist `custom_models` immediately but leave `engine_model` for the Back-click — a mid-edit app crash between the two writes would land a custom id in the saved list without the corresponding selection. Save both fields together so the two advance as one. Other pending menu fields (backend, quant) still wait for Back; cleaning up that asymmetry is a separate refactor.
A user with `engine_backend: 'world_engine'` on a quark-only host (Apple Silicon, or a remote that only advertises quark) would see the wire silently clamp to a valid value on every session, but the saved setting on disk stayed stale — so every menu open while streaming surfaced the EngineTab snap-effect rewrite as a "settings changed, restart?" modal even when the user touched nothing. Introduce `useClampedSettings` as the single seam where the clamp policy is applied: it returns the effective settings to consumers *and* writes them back to disk on first divergence. `buildSessionConfig` drops its internal clamp and trusts the caller to pass effective settings; `useSessionInit` drops `serverCapabilities` and signs `lastApplied` straight off `settings`. With one upstream derivation feeding the wire, the lifecycle signature, and the persisted state, the three can't drift and re-introduce the post-clamp-save race-into-reconnect bug. Reuses the raw settings reference when the clamp is a no-op so downstream `useEffect` deps don't churn on every render that just touches `useSettings()`.
Treats the standalone server as a guaranteed-available resource for the duration of the app session rather than a per-stream process. Settings menu, model picker, capability probe etc. all depend on a live `/health`; making the server come and go forced every consumer to handle a death they couldn't recover from. - `EngineLifecycle.restartServer` becomes the only verb that touches the process. Atomic kill+spawn — no public stop. Refreshes engine status post-spawn so consumers see the new port and `isServerRunning` immediately. - `useConnectionActions` drops the stop calls from `cancelConnection` and `prepareReturnToMainMenu`. User-facing teardown leaves the server alive. - `useEngineRespawn` calls `restartServer` for genuine process-class changes (offline_mode, server_url in standalone), and skips it for `engine_mode` flips because the lifecycle's own orchestration effect already handles that — avoids a redundant double-spawn race. - `useLoadingFailureCleanup` collapses to just `runWarmConnection`. Server cleanup on failed load happens server-side in `_unload_engine_sync`, so the client only needs to re-establish the WS; the engine-error overlay persists across the reconnect. - New lifecycle reconciliation effect detects "state.kind=ready but isServerRunning=false" and auto-fires `restartServer`. Covers external crashes (Python OOM, user pkill) that pre-existing flows couldn't recover from. Status is re-polled on MAIN_MENU entry so the reconciliation has fresh data when the user is most likely to next touch settings. - Bootstrap bails while engineError is set; session-class settings changes clear engineError so a retry can fire against the still-warm WS. - `clearEngineErrorOnLoadingEntry` reducer effect dropped: every explicit-action path (cancelConnection, useEngineRespawn) already clears engineError, and the recovery path now deliberately preserves it across the LOADING re-entry.
`_resolve_model_type` fetches `config.yaml` on demand for collection entries via `hf_hub_download`, which materialises a repo directory with only that file. `_scan_cache` then sees the repo, computes `total = 0` (no `.safetensors`), and registers it in `cached_sizes` anyway — so the picker reports the model as `is_local: true, size_bytes: 0`. Skip the registration when no weight files exist; the picker falls back to the HF size lookup and renders the row as downloadable.
With Quark in play, "World Engine" the generic-concept is ambiguous
against "World Engine" the specific backend. The renderer's surfaces
collapse to "Engine":
- `WorldEngineSection` component → `EngineSection`, file renamed too.
- i18n keys `app.settings.worldEngine.*` → `app.settings.engine.*`.
- `DEFAULT_WORLD_ENGINE_MODEL` → `DEFAULT_ENGINE_MODEL`.
- Comments referring to "WorldEngine server" → "engine server".
Settings-page copy also resolves a redundancy (every section was
prefixed with "Engine" inside the Engine tab) and tightens the
question-style subtitles:
- "Engine Mode" → "Mode"; subtitle now points at the engine
("where will the engine run?") rather than the model.
- "Engine" status section → "Local Engine"; subtitle becomes a
question ("how's the engine doing?") whose answer is the
status dot beside it.
- "Simulation" subtitle moves from "how will your world be
simulated?" to "what should simulate your world?" so it umbrellas
both the World Model dropdown and the Backend dropdown.
The literal backend name "World Engine" stays — that refers to the
specific upstream package, not the generic concept. Server-side
Python (`WorldEngineManager`, `WORLD_ENGINE_DEVICE`, etc.) is
unchanged: both backends implement a WorldEngine-style interface
and the manager's name reflects that contract.
Two near-simultaneous `/api/models` calls (e.g. EngineTab's snap-effect re-firing the loader once the capability probe lands) both missed the TTL cache and both fired the underlying HF requests in parallel — one collection fetch and one `model_info` per model, doubled. Add `get_or_fetch(key, fetcher)` that tracks in-flight `Future`s per key. Concurrent misses share a single fetch; the cache is populated exactly once and every coalesced waiter resolves off the same result. Failures propagate to all waiters and aren't cached, so callers can soft-fall a transient outage without pinning the failure for the full TTL. Refactor `_get_size`, `_get_model_type`, `_fetch_waypoint_ids`, and `get_model_info` onto the new primitive. For the two that have fallback-without-caching semantics (waypoint collection, transient HF errors on `model-info`), let the fetcher raise and catch outside `get_or_fetch` so the cache stays empty for the retry. Also tightens `model_type_cache` typing from `TtlCache[str, str | None]` to `TtlCache[str, str]` — None was never stored, the sentinel string was. The old annotation was misleading and would have masked a real None value if one ever slipped through.
Two blank lines before the `_UnreachableOOM` class definition. Missed by the original commit since the relevant style check only kicked in once the file had passed through `ruff format`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #125. Wires up quark as a backend dependency, with selectable UI option.