Skip to content

Adds Quark as Biome inference engine backend#132

Merged
philpax merged 47 commits into
mainfrom
feat/quark-engine-macos-v2
May 13, 2026
Merged

Adds Quark as Biome inference engine backend#132
philpax merged 47 commits into
mainfrom
feat/quark-engine-macos-v2

Conversation

@Clydingus
Copy link
Copy Markdown
Collaborator

@Clydingus Clydingus commented May 7, 2026

Fixes #125. Wires up quark as a backend dependency, with selectable UI option.

Clydingus and others added 8 commits May 6, 2026 23:31
…ctor

Re-applies the Apple Silicon backend swap (originally
``e7a2850`` on ``feat/quark-engine-macos``) against the refactored
``server-components/`` layout. The refactor went CUDA-only at the
device-helpers layer (``engine/devices.py``) and dropped the
``Platform`` / ``available_quants`` / ``MLX`` machinery; this commit
re-introduces just enough platform multiplexing to keep the existing
CUDA path untouched while routing Apple Silicon through ``quark.Engine``.

Three files change:

* ``server-components/engine/devices.py`` — device helpers were
  pure CUDA. Detect Apple Silicon (``sys.platform == "darwin"`` +
  ``arm64``); on that platform:
    - ``WORLD_ENGINE_DEVICE`` / ``SCENE_AUTHORING_DEVICE`` /
      ``SAFETY_DEVICE`` route to ``"cpu"`` so every existing
      ``frame.to(device=WORLD_ENGINE_DEVICE)`` call site stays a
      no-op without per-call branching. Quark owns its own Metal
      allocator and consumes torch tensors / numpy at the API
      boundary, so torch tensors stay on CPU until they cross into
      ``quark.Engine``.
    - ``OutOfMemoryError`` aliases to plain ``MemoryError``
      (``torch.cuda.OutOfMemoryError`` doesn't exist when CUDA
      isn't built into the active wheel; the ``except
      devices.OutOfMemoryError`` blocks just never trigger on this
      path).
    - ``pynvml`` import is soft-guarded — Apple Silicon ships no
      NVML and importing the package would raise. Every NVML
      caller (``open_nvml_handle`` / ``driver_version_via_nvml`` /
      ``utilization_via_nvml``) short-circuits when
      ``pynvml is None``. The existing NVML-call try/except blocks
      already returned sentinels on failure, so the broader
      contract is unchanged.

  The other helpers (``is_available`` / ``memory_allocated`` /
  ``synchronize`` / ``empty_cache`` / ``reset_compiled_graphs``)
  already short-circuit gracefully via ``torch.cuda.is_available()``;
  they remain CUDA-only and just no-op on Apple Silicon.

* ``server-components/engine/manager.py`` — module-level conditional
  import of the engine class:

      if _IS_DARWIN_ARM64:
          from quark import CtrlInput, Engine as WorldEngine
      else:
          from world_engine import CtrlInput, WorldEngine

  Aliasing ``quark.Engine`` to the local ``WorldEngine`` symbol means
  every existing type annotation / construction site reads as before.
  In ``load_engine``, branch on ``_IS_DARWIN_ARM64`` to skip the
  dtype-fallback loop (quark is bf16-only on Metal — no native fp8 in
  MSL, no int8 KV path today) and skip the OOM-retry loop (no CUDA
  allocator pressure). The Apple branch passes
  ``quant="bf16"`` and ignores the client's ``requested_quant``.

  TAEHV cache plumbing: read ``BIOME_TAEHV_CACHE_DIR`` from the env
  and pass it as ``taehv_cache_dir=`` to ``quark.Engine(...)``. The
  Electron host sets this env var when spawning the server so that
  pre-built CoreML artifacts pulled from HF land inside Biome's app
  data dir — uninstall / "clear cache" flows can ``rm -rf`` it
  without leaving artifacts in ``~/.cache/quark/taehv/``. Unset
  falls through to quark's default (``~/.cache/quark/taehv``).
  Electron-side wiring is a follow-up; the server-side hook is
  ready.

* ``server-components/pyproject.toml`` — drop unconditional pins on
  the CUDA-specific deps (``nvidia-ml-py``, ``bitsandbytes``,
  ``llama-cpp-python``, ``gguf``); mark them ``sys_platform !=
  'darwin'`` so ``uv sync`` on Apple Silicon doesn't pull them. Add
  ``"quark[engine] ; sys_platform == 'darwin'"`` to the dependency
  list and a ``[tool.uv.sources]`` pin for the
  ``experimental/biome-base`` integration branch on quark
  (rev ``c2a87ba`` — carries the ``taehv_cache_dir`` kwarg + the
  HF-fetched CoreML artifact pipeline that's needed for app-managed
  cache storage). ``torch`` / ``torchvision`` already came from the
  ``pytorch-cu128`` index — that index is now marked ``sys_platform
  != 'darwin'`` so Apple resolves torch from PyPI's default index
  (no CUDA build available there for darwin/arm64 anyway).

  Add ``sys_platform == 'darwin' and platform_machine == 'arm64'``
  to ``[tool.uv].environments`` so the lockfile resolves under that
  marker too.

CUDA path: identical to ``the-great-server-refactor`` HEAD. None of
the platform-conditional code in this commit fires when
``sys.platform != "darwin"``; every CUDA call site reaches the same
``WorldEngine`` import, the same ``WORLD_ENGINE_DEVICE = "cuda"``,
the same dtype-fallback loop, and the same OOM-retry path it had
before.

Apple Silicon path: still requires the upstream
``Overworld-Models/taehv1_5-coreml`` HF repo to be populated (the
publish flow lives at ``scripts/publish_taehv_coreml.py`` in
quark). Until that repo exists, the runtime fall-back error message
points to the publish script.

Verified: ``ast.parse`` clean on both modified Python files. Full
runtime verification needs ``uv sync`` on a Mac with the quark hash
``c2a87ba`` reachable from origin — pushing
``Overworldai/quark experimental/biome-base`` is the prereq.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ARK toggle

Generalises the Apple-Silicon-only quark integration so the same
``quark.Engine`` path runs on CUDA hosts too. quark's
``Engine.__new__`` factory dispatches to ``EngineCUDA`` (Linux /
Windows / CUDA Macs) or ``EngineMetal`` (Apple Silicon), with both
subclasses accepting the same ``model_uri / quant / device / dtype /
taehv_cache_dir`` kwargs.

A top-level ``USE_QUARK`` constant in ``manager.py`` selects between
``quark.Engine`` and the legacy ``world_engine.WorldEngine``; hard-
coded to ``True`` for now, slated to become a runtime setting once
the quark CUDA path stabilises. Imports are aliased so the rest of
the file stays backend-agnostic. ``load_engine`` collapses the two
prior branches (Apple bf16-only quark vs. CUDA dtype-fallback
world_engine) into one unified body that shares the dtype/OOM
fallback loop and the per-failure cleanup; ``taehv_cache_dir`` (a
quark-only kwarg) is gated on ``USE_QUARK``.

``pyproject.toml`` drops the ``sys_platform == 'darwin'`` marker on
``quark[engine]`` so the package installs everywhere.
Brings in PR #128's StreamingContext hook split + src/ reorganisation
so the quark engine integration on this branch lands on top of the new
context structure. Pure-frontend merge (all PR #128 changes are under
src/); no overlap with the server-components/ Python touched by this
branch.
…_interval - gen ms from input latency

When ``cap_inference_fps=True``, the per-iteration order was

    sleep(~frame_interval − gen)  →  read input  →  submit  →  flush_pending  →  wait  →  stash

so a frame stashed at the end of iter K-1 sat in memory through iter
K's full pacing sleep before being encoded and sent. The sleep window
was pure idle CPU time that could have been spent encoding.

Now the loop also flushes at the top, before the sleep — gated on
``cap_inference_fps`` so uncapped (benchmark) mode keeps its existing
encode-during-gen overlap. Pending is empty by the time the
post-submit flush runs, so the second call is a deliberate no-op.

End-to-end: shaves ~(frame_interval − gen_time) ms off the
client-observed ``inputLatency`` for the pacing path. On wp1.5 360p
@ 60fps that's ~37 ms (66.67 ms interval − 30 ms gen). Server
compute (INFER / SYNC / ENC / MTRC / OVER in the perf overlay) is
unchanged; the win shows up as a drop in XMIT.
@Clydingus Clydingus requested a review from philpax May 7, 2026 18:48
Clydingus and others added 21 commits May 8, 2026 02:50
CONTRIBUTING.md asks for pyproject + uv.lock changes to land
together; the prior commit (`6fe8f36 update(pyproj): update
quark ref`) updated the pin without the regenerated lockfile,
which would trip the `uv lock --check` step in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Settings whose changes need a restart used to live in three drifting
hand-written lists (useEngineRespawn for process-class, useSessionInit
for session/live, plus EngineTab.hasChangesRequiringRestart for the
modal trigger). Adding engine_backend to the session list was missed,
so toggling it mid-session saved without a reset and dropped back to
the pause menu.

SETTING_CLASSES in types/settings.ts is now the single source of
truth, with helpers in utils/settingsClassifier consuming it.
Top-level keys cover whole subtrees; dot-paths split when siblings
differ (debug_overlays.action_logging is live; siblings are UI-only).
Toggling a process-class setting mid-stream raced: useEngineRespawn
called stopServer() fire-and-forget then transitioned to LOADING
immediately. The warm flow read stale isServerRunning=true from React
state, picked attachToRunningStandalone, and threw "Server exited
before becoming ready" while the doomed server finished dying.

Awaiting stopServer lets the IPC "server exited" event propagate to
isServerRunning=false before LOADING fires, so the warm flow correctly
picks bootStandalone and starts a fresh server with the new env vars.
In offline mode uv can't reach the index to verify the lockfile and
fails with "No solution found when resolving dependencies" — even
though the venv is already correctly populated from the previous
online sync.

Setting UV_NO_SYNC=1 (which also implies --frozen) inside getOfflineEnv
skips both the resolve and sync passes on uv run, so the server just
exec's python in the existing venv. Online mode is untouched: uv run
still auto-syncs when pyproject changes.
The connectionLost flag was only cleared in MAIN_MENU, so once set it
stuck through LOADING and back into STREAMING. Hit visibly when a
process-class respawn (useEngineRespawn) fires its disconnect from
outside the lifecycle reducer — the brief failed-connection render
inside STREAMING latched connectionLost true, and entering LOADING
didn't clear it.

Clearing on LOADING entry is symmetric with the existing
clearEngineErrorOnLoadingEntry: any LOADING transition is "starting
fresh", so prior overlays should reset. Also fixes the click-reconnect
recovery path, which previously kept the overlay up until the user
backed out to main menu.

(Follow-up: the overlay still flashes for a frame during a respawn
because the disconnect lands in a STREAMING render before LOADING
transitions in. To suppress it entirely, the reducer needs to know the
respawn is intentional — separate change.)
Toggling a process-class setting mid-stream briefly flashed the
"Connection Lost" overlay. The session-class reconnect path already
suppresses the overlay via `intentionalReconnectInProgress`, but
process-class respawns are driven by useEngineRespawn (outside the
reducer) and never set that flag.

Rather than adding a parallel `intentionalRespawnInProgress` boolean
plus parallel `currentProcessSig` / `lastAppliedProcess` payload
fields plus parallel detection branches, collapse all of it through a
single discriminator:

- New `RestartSignatures = { session, process }` bundle in the
  classifier; `getRestartSignatures(settings)` builds both at once.
- `useSessionInit` returns `lastApplied: RestartSignatures | null`
  (one piece of state instead of two).
- Payload carries `currentSignatures` + `lastAppliedSignatures`.
- Reducer state replaces `intentionalReconnectInProgress` with
  `intentionalRestart: 'reconnect' | 'respawn' | null`. A new
  `computeIntentionalRestart()` picks the strongest applicable
  intent (process beats session) and returns null when nothing's
  pending.
- Side-effect dispatch keys off the discriminator: `'reconnect'`
  fires the existing effect chain, `'respawn'` is silent because
  useEngineRespawn owns those side effects. Suppression checks
  become `intentionalRestart !== null`.
Two corrections to the Pydantic→TS codegen so the on-disk file
round-trips through both `codegen --check` and `prettier --check`
on every platform:

- `Path.write_text` defaults to translating `\n` → `os.linesep` on
  Windows, so the codegen was silently writing CRLF on dev machines
  while prettier expected LF; the resulting on-disk file failed
  prettier even though the in-memory comparison passed (read_text
  translates CRLF back). Pinning `newline="\n"` writes LF
  unconditionally.

- `render_enum` always emitted multi-line `z.enum([...])`, but
  prettier collapses short enums onto a single line under the
  project's 120-char `printWidth`. Mirror that behaviour so the
  codegen-emitted shape matches the formatted shape — short enums
  inline, long ones (e.g. `ServerStageId`) break across lines with
  two-space indent. The `_PRINT_WIDTH` constant is the gate.

Surfaced while a new short enum (`EngineBackendSchema`) landed in
the generated output and tripped both gates at once on Windows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…os-v2

# Conflicts:
#	electron/ipc/server.ts
#	server-components/server/routes.py
#	src/components/settings/EngineTab.tsx
#	src/context/streaming/StreamingContext.tsx
#	src/context/streaming/streamingWarmConnection.ts
#	src/hooks/engine/useEngineApi.ts
#	src/hooks/streaming/useWarmConnection.ts
#	src/i18n/en.ts
#	src/i18n/goose.ts
#	src/i18n/he.ts
#	src/i18n/ja.ts
#	src/i18n/zh.ts
#	src/types/ipc.ts
…p ~frame_interval - gen ms from input latency"

This reverts commit 55c71ad.
…n set

Quark only supports waypoint-1.5 (NotImplementedError on wp-1 configs:
no TAEHV VAE, different scheduler shape). The picker now hides those
rows when backend=quark, and the settings panel refuses to leave on
save if the saved model is incompatible with the in-flight backend.

Server: /api/models accepts ?backend=… and drops rows whose model_type
falls outside COMPATIBLE_MODEL_TYPES_BY_BACKEND. Each PickerModel
carries its resolved model_type so unresolvable rows (offline / HF
outage / malformed config) pass through and stay backend-agnostic
rather than silently emptying the picker. _scan_cache now returns the
cached model_type per repo; uncached collection entries get a
TTL-cached HF config.yaml fetch (~1KB) via _resolve_model_type.

Renderer: list-models IPC takes backend, threads it into the query
string. EngineTab's loader refires on backend toggle and tracks
menuWorldModelAvailable; validateBeforeSave shows a new "Incompatible
Model" confirm modal and blocks the save when the saved model fell
off the filtered list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@philpax
Copy link
Copy Markdown
Contributor

philpax commented May 12, 2026

Quick test seems to work. Will look over the code more thoroughly and run it through the gauntlet of tests, but I think we'll be shipping this tomorrow 🎉

philpax added 13 commits May 13, 2026 18:17
HF ids with reserved characters (`#`, `?`, `&`, whitespace) typed via
the custom model picker would corrupt the URL. Split on `/`, encode each
segment, rejoin — preserves the slash the FastAPI `{model_id:path}`
matcher relies on.
With Quark in play, "which backend was active" is part of the
diagnostic surface — a wp-1 + quark or intw8a8 + quark mismatch is the
likeliest failure mode for early users. Thread `requested_backend`
through the diagnostics payload alongside `requested_model` and
`requested_quant` at all three call sites (connection-lost overlay,
terminal-display loading error overlay, debug-tab copy-to-clipboard).
`useEngineRespawn` hand-checked `engine_mode`/`server_url`/`offline_mode`
to skip a respawn when only `offline_mode` flipped in server mode. A
future process-class field added to `SETTING_CLASSES` would be silently
swallowed by that bespoke check. Pull the logic up: a new `pathsThatDiffer`
helper returns the changed paths in a class, and the hook drops to
"the only delta is `offline_mode` in non-standalone" against that set,
so new process-class fields fall into the respawn branch by default.
`OutOfMemoryError = MemoryError` swept up unrelated CPU-side memory
faults on macOS, which would falsely trigger the dtype-downgrade retry
in `WorldEngineManager.load_engine`. Replace with a private
`_UnreachableOOM(BaseException)` sentinel — `except devices.OutOfMemoryError`
still type-checks and resolves at runtime, but never matches anything
on Apple Silicon (where there's no CUDA allocator pressure to surface
in the first place).
The fallback function it references doesn't exist — the real fallback
is just empty dropdowns until a probe lands. Tighten the comment to
describe what the code actually does.
`customLabel` was optional, so a caller that set `allowCustom={true}`
without a label would silently render a blank "Custom..." footer. Split
the props into a discriminated union: `customLabel` is required iff
`allowCustom` is true, and forbidden otherwise so accidental passes
where it's irrelevant fail at compile time.
`handleCustomModelBlur` used to persist `custom_models` immediately
but leave `engine_model` for the Back-click — a mid-edit app crash
between the two writes would land a custom id in the saved list
without the corresponding selection. Save both fields together so
the two advance as one. Other pending menu fields (backend, quant)
still wait for Back; cleaning up that asymmetry is a separate
refactor.
A user with `engine_backend: 'world_engine'` on a quark-only host
(Apple Silicon, or a remote that only advertises quark) would see the
wire silently clamp to a valid value on every session, but the saved
setting on disk stayed stale — so every menu open while streaming
surfaced the EngineTab snap-effect rewrite as a "settings changed,
restart?" modal even when the user touched nothing.

Introduce `useClampedSettings` as the single seam where the clamp
policy is applied: it returns the effective settings to consumers
*and* writes them back to disk on first divergence. `buildSessionConfig`
drops its internal clamp and trusts the caller to pass effective
settings; `useSessionInit` drops `serverCapabilities` and signs
`lastApplied` straight off `settings`. With one upstream derivation
feeding the wire, the lifecycle signature, and the persisted state,
the three can't drift and re-introduce the post-clamp-save
race-into-reconnect bug.

Reuses the raw settings reference when the clamp is a no-op so
downstream `useEffect` deps don't churn on every render that just
touches `useSettings()`.
Treats the standalone server as a guaranteed-available resource for
the duration of the app session rather than a per-stream process.
Settings menu, model picker, capability probe etc. all depend on a
live `/health`; making the server come and go forced every consumer
to handle a death they couldn't recover from.

- `EngineLifecycle.restartServer` becomes the only verb that touches
  the process. Atomic kill+spawn — no public stop. Refreshes engine
  status post-spawn so consumers see the new port and `isServerRunning`
  immediately.
- `useConnectionActions` drops the stop calls from `cancelConnection`
  and `prepareReturnToMainMenu`. User-facing teardown leaves the
  server alive.
- `useEngineRespawn` calls `restartServer` for genuine process-class
  changes (offline_mode, server_url in standalone), and skips it for
  `engine_mode` flips because the lifecycle's own orchestration effect
  already handles that — avoids a redundant double-spawn race.
- `useLoadingFailureCleanup` collapses to just `runWarmConnection`.
  Server cleanup on failed load happens server-side in
  `_unload_engine_sync`, so the client only needs to re-establish the
  WS; the engine-error overlay persists across the reconnect.
- New lifecycle reconciliation effect detects "state.kind=ready but
  isServerRunning=false" and auto-fires `restartServer`. Covers
  external crashes (Python OOM, user pkill) that pre-existing flows
  couldn't recover from. Status is re-polled on MAIN_MENU entry so the
  reconciliation has fresh data when the user is most likely to next
  touch settings.
- Bootstrap bails while engineError is set; session-class settings
  changes clear engineError so a retry can fire against the still-warm
  WS.
- `clearEngineErrorOnLoadingEntry` reducer effect dropped: every
  explicit-action path (cancelConnection, useEngineRespawn) already
  clears engineError, and the recovery path now deliberately
  preserves it across the LOADING re-entry.
`_resolve_model_type` fetches `config.yaml` on demand for collection
entries via `hf_hub_download`, which materialises a repo directory
with only that file. `_scan_cache` then sees the repo, computes
`total = 0` (no `.safetensors`), and registers it in `cached_sizes`
anyway — so the picker reports the model as `is_local: true,
size_bytes: 0`. Skip the registration when no weight files exist;
the picker falls back to the HF size lookup and renders the row
as downloadable.
With Quark in play, "World Engine" the generic-concept is ambiguous
against "World Engine" the specific backend. The renderer's surfaces
collapse to "Engine":

- `WorldEngineSection` component → `EngineSection`, file renamed too.
- i18n keys `app.settings.worldEngine.*` → `app.settings.engine.*`.
- `DEFAULT_WORLD_ENGINE_MODEL` → `DEFAULT_ENGINE_MODEL`.
- Comments referring to "WorldEngine server" → "engine server".

Settings-page copy also resolves a redundancy (every section was
prefixed with "Engine" inside the Engine tab) and tightens the
question-style subtitles:

- "Engine Mode" → "Mode"; subtitle now points at the engine
  ("where will the engine run?") rather than the model.
- "Engine" status section → "Local Engine"; subtitle becomes a
  question ("how's the engine doing?") whose answer is the
  status dot beside it.
- "Simulation" subtitle moves from "how will your world be
  simulated?" to "what should simulate your world?" so it umbrellas
  both the World Model dropdown and the Backend dropdown.

The literal backend name "World Engine" stays — that refers to the
specific upstream package, not the generic concept. Server-side
Python (`WorldEngineManager`, `WORLD_ENGINE_DEVICE`, etc.) is
unchanged: both backends implement a WorldEngine-style interface
and the manager's name reflects that contract.
Two near-simultaneous `/api/models` calls (e.g. EngineTab's snap-effect
re-firing the loader once the capability probe lands) both missed the
TTL cache and both fired the underlying HF requests in parallel — one
collection fetch and one `model_info` per model, doubled.

Add `get_or_fetch(key, fetcher)` that tracks in-flight `Future`s per key.
Concurrent misses share a single fetch; the cache is populated exactly
once and every coalesced waiter resolves off the same result. Failures
propagate to all waiters and aren't cached, so callers can soft-fall a
transient outage without pinning the failure for the full TTL.

Refactor `_get_size`, `_get_model_type`, `_fetch_waypoint_ids`, and
`get_model_info` onto the new primitive. For the two that have
fallback-without-caching semantics (waypoint collection, transient
HF errors on `model-info`), let the fetcher raise and catch outside
`get_or_fetch` so the cache stays empty for the retry.

Also tightens `model_type_cache` typing from `TtlCache[str, str | None]`
to `TtlCache[str, str]` — None was never stored, the sentinel string
was. The old annotation was misleading and would have masked a real
None value if one ever slipped through.
Two blank lines before the `_UnreachableOOM` class definition. Missed
by the original commit since the relevant style check only kicked in
once the file had passed through `ruff format`.
@philpax philpax merged commit a3aa792 into main May 13, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Biome Quark Integration

2 participants