Skip to content

ABX-375: All-in-HV + FEX runtime#293

Merged
AprilNEA merged 35 commits into
masterfrom
feat/fex64-hv-runtime-plan
Jun 10, 2026
Merged

ABX-375: All-in-HV + FEX runtime#293
AprilNEA merged 35 commits into
masterfrom
feat/fex64-hv-runtime-plan

Conversation

@AprilNEA

Copy link
Copy Markdown
Member

ABX-375: All-in-HV + FEX64 for linux/amd64 — status & handoff

Goal: run linux/amd64 OCI containers through FEX64 (binfmt) inside the single
HV utility VM, so no VZ/Rosetta runtime VM is needed. VZ/Rosetta kept only as an
optional build backend / fallback (ABX-374, PR #291, branch
feat/dual-utility-vm-routing — preserved, do not delete).

Status: ⛔ BLOCKED at Gate A on Apple Silicon (Apple M5 Max / macOS 26)

  • ✅ arm64 native containers: PASS.
  • linux/amd64 via FEX64: FEX SIGILLs — does not run.

Root cause (diagnosed live)

The HVF guest is advertised phantom SVE it cannot execute. /proc/cpuinfo in
the guest reports sve2 sve2p1 svebf16 …, yet a single SVE instruction (rdvl)
SIGILLs. FEX trusts HWCAP_SVE, runs SVE, and traps. This appears specific to
M5 + macOS-26 HVF (a real M1 advertises no SVE → FEX uses NEON and works).

Fixes attempted

  1. arm64.nosve arm64.nosme guest cmdline (this branch, 514f36b,
    app/arcbox-core/src/vm_lifecycle/mod.rs) — ✅ verified: guest → 0 SVE
    features
    . Correct hardening regardless of the amd64 decision; keep.
  2. apple-m1 FEX build (boot-assets PR refactor(vmm): split vmm.rs into platform submodules #23) — removed FEX's auto-vectorized
    SVE, but FEX-2605 also emits explicit, unconditional SVE (gdb:
    ld1rd {z5.d}, p5/z in FEX's own code) that runs and traps even with guest
    SVE disabled
    . So neither fix, nor both together, makes FEX run amd64 here.

Decision point

Per the ABX-375 plan, a Gate-A failure → resume ABX-374 (VZ Rosetta) for amd64
rather than duct-taping. Options:

  • A (recommended): route linux/amd64 runtime to VZ Rosetta (proven); keep the
    HV path + arm64.nosve for arm64. FEX stays optional/experimental.
  • B: patch FEX-2605 to stop emitting unconditional SVE (or properly gate it),
    or try a newer FEX — uncertain; may be a FEX-vs-M5/HVF incompatibility.

What's in this branch

  • ABX-375 routing pivot — amd64→FEX translator, fail-closed when FEX absent, no VZ
    runtime VM (f4387f9, app/arcbox-docker/src/routing.rs + handlers).
  • assets.lock pinned to boot-assets v0.5.9 (static FEX at
    /arcbox/runtime/bin/FEX) — c9e2e6b.
  • Reproducible gate harness — tests/fex/validate-fex64.sh.
  • arm64.nosve guest fix — 514f36b.

Related resources

Reproduce the blocker

# in the guest (via any container):
grep -m1 Features /proc/cpuinfo | tr ' ' '\n' | grep sve   # -> sve2p1, svebf16, ...
# a one-line `rdvl` program compiled with -march=armv8-a+sve SIGILLs (exit 132).
# amd64 path:
docker --context arcbox run --rm --platform linux/amd64 alpine uname -m   # SIGILL

AprilNEA and others added 21 commits May 28, 2026 20:05
Records the chosen utility VM role against the canonical container ID
returned by `POST /containers/create` (and the exec ID returned by
`POST /containers/{id}/exec`) so every follow-up lifecycle call —
start, stop, kill, restart, rename, remove, inspect, logs, top, stats,
changes, wait, pause/unpause, attach, exec start/resize/inspect — is
proxied to the same role.

The registry is process-local; lookups for pre-existing or
post-restart workloads return `None` and callers fall back to the
native default. Durable persistence and strict fail-closed behavior
are deferred to a later slice once the connector layer actually
resolves each role to a distinct guest dockerd endpoint.
Address routing gaps that would surface once `native` and `rosetta`
resolve to distinct guest dockerds:

- WorkloadRoleRegistry now tracks container `--name` aliases (with
  rename propagation) and resolves short hex IDs by canonical prefix,
  so `docker start web` and `docker logs ab12c3` land on the same role
  as the canonical entry instead of falling through to native.
- proxy_fallback resolves the role from the URI (container ID, then
  exec ID, then native default), so unrouted endpoints like
  `/containers/{id}/archive`, `/containers/{id}/update`, and
  `/exec/{id}/resize` follow the workload's VM.
- Module docs reworded to make clear the registry tracks bindings
  in-process rather than persisting them; durable persistence remains
  out of scope.

BuildKit `/session` routing is intentionally not addressed in this
commit: the protocol opens `/session` before the matching `/build`
sets the platform, so a session-UUID lookup cannot be honored at
session-open time. A follow-up needs either lazy session forwarding
or a pending-session buffer keyed by `X-Docker-Expose-Session-Uuid`.
Same goes for per-role host port forwarding, which still uses
`runtime.default_machine_name()` and needs a Runtime API for role →
machine identity.
…rship consistent

Two correctness fixes on the workload role registry, before native and
rosetta resolve to distinct guest dockerds:

- Short hex prefix lookup now collects every canonical that matches the
  requested prefix. If those canonicals agree on a role, that role is
  returned; if they disagree the registry returns `None` so callers
  fall back to the native default instead of silently choosing whichever
  HashMap iteration order surfaced first.
- WorkloadRoleRegistry gained an `alias_owner` reverse map. `add_alias`
  and `rename_alias` now detach an alias from any previous owner before
  reassigning it, so forgetting the previous canonical can no longer
  delete a binding that has since been adopted by another canonical.
  `forget` honors the reverse map symmetrically. Duplicate alias adds
  for the same canonical are deduped.

`query_param` docstring now flags that values are returned raw; the
current callers only use ASCII-safe identifiers (`platform`, `name`),
so percent-decoding stays deferred until something that may actually
carry encoded bytes wants this helper.
PLAN.md step 3. Lift UtilityVmRole into arcbox-core (workload module)
so it is the shared currency between the daemon, runtime, and Docker
compatibility layer; arcbox-docker now re-exports it.

Runtime gains three role-keyed accessors:

- `ensure_role_ready(role)` — role-aware boot/ready hook. Both roles
  resolve to the existing default VM today; the rosetta branch
  diverges once the dual lifecycle lands.
- `machine_name_for_role(role)` — machine name to address.
- `guest_docker_vsock_port_for_role(role)` — dockerd vsock port.

VsockConnector::connect_for(role) consumes the new lookups so the
machine + port chosen for every connection follows the requested
role. The Docker handler and fallback paths now drive the role-aware
ensure hook and bubble the role into ensure errors, so a failure on
the rosetta VM surfaces as such instead of a generic native error.

No behavior change yet — the lookups still alias both roles to the
default VM. The seam is now in place for the dual-VM lifecycle slice.
PLAN.md step 4 prep. Threads the machine name and persistent dockerd
data image through VmLifecycleManager so a single struct can drive
either the default native machine or a secondary VZ Rosetta machine.

- New for_machine() constructor that takes the machine name and the
  docker.img filename. The existing new() delegates to it with the
  default values so all current callers (Runtime, daemon startup)
  behave identically.
- Internal create_default_machine / start_default_vm / wait_for_agent
  / idle monitor / event payloads now use self.machine_name instead
  of DEFAULT_MACHINE_NAME so a rosetta lifecycle reports the right
  machine in events and logs.
- data_image_path() yields the absolute path of this manager's
  docker.img, replacing the hard-coded DOCKER_DATA_IMAGE_NAME join in
  create_default_machine.

No behavior change: callers still construct one lifecycle on the
"default" machine. Adding a second VmLifecycleManager for the rosetta
role and wiring it into Runtime is the next slice.
PLAN.md step 4 prep. Replaces the hard-coded `VmBackend::Hv` in
`VmManager::build_vmm_config` with a per-machine backend so the
rosetta utility VM can run on VZ while the native one keeps running
on HV.

- `VmConfig` gains a `backend: VmBackend` field defaulting to `Hv`.
  `build_vmm_config` now reads from it.
- `MachineConfig` gains matching `backend` and `enable_rosetta`
  fields so callers can set them at create-time. `MachineManager::create`
  threads both into the underlying `VmConfig`.
- `VmLifecycleConfig` gains `backend` so each per-role lifecycle
  decides its own backend. The `create_default_machine` path now
  feeds it (plus `default_vm.rosetta`) into the `MachineConfig` it
  builds.
- `arcbox-core` re-exports `VmBackend` so downstream crates
  (`arcbox-api`'s gRPC machine handler) can construct
  `MachineConfig` without taking a direct `arcbox-vmm` dependency.

Existing single-VM behavior is preserved: every constructor and
default keeps `backend = Hv`. The rosetta lifecycle starts using `Vz`
when the dual-VM Runtime wiring lands.
PLAN.md step 4. Runtime now builds a per-role lifecycle slot at
construction time:

- Native (HV) slot — always present, drives the existing default
  machine and stays the eager-started utility VM.
- Rosetta (VZ) slot — present only on macOS Apple Silicon. Built with
  machine name "rosetta", docker-rosetta.img as its persistent data
  image, VmBackend::Vz, and default_vm.rosetta=true. The lifecycle is
  constructed up front so the slot's state is addressable, but the VM
  itself stays cold until `ensure_role_ready(Rosetta)` is first
  called by the Docker layer.

Role-keyed accessors now read from the slot map:

- `ensure_role_ready(role)` drives the role's container backend.
- `machine_name_for_role(role)` / `guest_docker_vsock_port_for_role(role)`
  return the slot's machine name and dockerd port.
- `lifecycle_for_role(role)` exposes the per-role lifecycle for
  diagnostics and future shared-control-plane wiring.
- `role_is_distinct(role)` lets callers tell whether a role has its
  own slot or is aliasing onto native on this platform.

The pre-existing `vm_lifecycle` / `container_backend` fields are kept
and pinned to the native slot so the daemon-wide flows (Kubernetes,
shutdown) keep behaving identically. When a role is not configured on
the host (e.g. rosetta on non-Apple-Silicon), the slot lookup falls
back to native so the Docker layer keeps working as a single-VM setup.

Daemon startup wait-for-resources still waits only on the native
docker.img; per-role XPC holder handling lands in the shared control
plane slice.
PLAN.md step 5 (partial). Replaces the single inbound listener slot
in Runtime with a per-machine map so each utility VM owns its own
listener, then teaches the Docker handler to bind a container's
published ports against the role the container was created on.

- Runtime now holds `inbound_listeners: HashMap<String,
  InboundListenerManager>` keyed by machine name and tracks the
  per-container rules as `(machine_name, rules)` so teardown reaches
  the correct listener even when the container migrated roles.
  `start_port_forwarding_macos`, `stop_port_forwarding_by_id`, and
  `stop_port_forwarding_all` follow the per-machine map.
- The Docker `setup_port_forwarding_from_inspect` path now resolves
  the machine name via `runtime.machine_name_for_role(role)` instead
  of `default_machine_name`, so an `amd64` container running on the
  rosetta VM lands on the rosetta bridge's listener.

Existing single-VM deployments are unaffected: when only the native
slot is configured every container ends up on the same machine, the
old single-listener behavior reduces to one map entry.
…roles

PLAN.md step 5. Two host-side coordination changes needed before a
real dual VM deployment is safe.

- daemon `wait_for_resources` now scans every persistent dockerd image
  owned by a configured utility VM role (native `docker.img`, rosetta
  `docker-rosetta.img`) so a stale VZ XPC holder on either image is
  drained before `init_runtime` brings up either VM. Same 10-second
  bound applies per image.
- Docker handler `ensure_role_ready` refuses requests for a role that
  is not configured on this host (e.g. Rosetta on non-Apple-Silicon)
  with a clear platform-specific error rather than silently falling
  back to native. Silent fallback would land a `linux/amd64` workload
  on the HV native VM that cannot translate x86_64, with no useful
  diagnostic.

The native default remains the fallback for Rosetta requests that
fail open elsewhere; this only short-circuits the case where Rosetta
is definitively unsupported by the host.
PLAN.md step 7. Compose-managed containers carry
`com.docker.compose.project` on every service; ArcBox now uses that
label to pin every service in a project to the same utility VM role
so DNS, port forwarding, and volume sharing remain coherent within a
project.

- `WorkloadRoleRegistry` gains `project_role(name)` and
  `record_project(name, role)`. Bindings are sticky across compose
  up/down cycles to keep group routing predictable.
- `UtilityVmRoleExt::can_host(platform)` codifies which roles can
  accept which platforms: rosetta hosts both arm64 and amd64; native
  refuses amd64. Used by the create handler to decide whether the
  next service in a project is compatible.
- `extract_compose_project(body)` reads the project label from a
  container-create payload.
- `create_container` now:
  1. Parses the routing decision from platform metadata.
  2. Reads the compose project label.
  3. If the project is already bound, uses that role when compatible
     and rejects with a 400 ("mixed-backend compose projects are not
     supported") otherwise.
  4. If not yet bound, records the first service's role as the
     project's role.

Containers without a compose label retain the per-container behavior.
PLAN.md step 2 follow-up. The Docker CLI opens `/session` before it
sends the matching `/build` that carries platform metadata, so the
session role cannot be derived at session-open time. ArcBox forwards
`/session` to the native (HV) utility VM by default; an `amd64`
build that needs Rosetta-side BuildKit features will not see this
session and side channels (secrets, ssh, build mounts) will fail.

Adds a doc comment explaining the limitation and a debug log of the
session UUID so operators can correlate `/session` and `/build`
forwarding. Routing both endpoints together requires lazy session
forwarding keyed by `X-Docker-Expose-Session-Uuid`; left as a
follow-up rather than landed here, since correctly buffering the
HTTP/1.1 upgrade until `/build` arrives is substantial work that
deserves its own change.
…after daemon restart

Closes the two items previously deferred under PLAN.md step 2.

BuildKit /session role routing:
- `WorkloadRoleRegistry` gains `wait_for_role(key, max_wait)` backed by
  a `tokio::sync::Notify`; `record(...)` fires the signal.
- `build_image` records `X-Docker-Expose-Session-Uuid → role` so the
  parallel `/session` request can be routed coherently.
- `session()` reads the same UUID and parks on `wait_for_role` for up
  to 30 seconds (matches BuildKit's own session-handshake timeout),
  forwarding the upgrade to the role declared by `/build`. Both
  ordering races (`/build`-first or `/session`-first) resolve
  correctly. On timeout, `/session` forwards to native so the user
  sees a BuildKit-level error rather than a hung connection.

Cross-restart durability via lazy guest-probe rebuild:
- `resolve_container_role` and `resolve_role_from_uri` now treat a
  registry miss as a recovery signal: they probe each configured
  role's guest dockerd with `GET /containers/{id}/json`, accept the
  first 200 as the workload's role, and re-record it. Native is
  always probed first because it's already up; the Rosetta probe
  triggers lazy startup on first miss after a restart, which is
  exactly the recovery behavior we want for surviving rosetta
  workloads.
- No on-disk schema is introduced; correctness is recovered from the
  guest dockerds, which already persist their own container state.
- The dead `proxy_upgrade()` helper is dropped — all upgrade paths
  now go through `proxy_upgrade_to_role`.

Tests cover the three `wait_for_role` paths (cache hit, late record
wakeup, timeout), bringing the docker-lib suite to 110 passing.
Closes the routing-correctness gaps surfaced by review: the
Missing/Ambiguous conflation, the first-hit-wins rebuild, and the
silent /session timeout fallback. Every place that previously
collapsed "ambiguous" into "missing" and quietly fell back to native
now surfaces a Docker-compatible 4xx instead.

WorkloadRoleRegistry::lookup returns a new WorkloadRoleLookup
{Found(role), Missing, Ambiguous} tri-state. Cross-VM short-ID
collisions report Ambiguous (previously None). wait_for_role
propagates the same shape, and BuildKit /session no longer routes a
timed-out session to native — it returns 400 with a clear message
naming the UUID, since silently attaching the upgraded gRPC stream
to the wrong VM would just leak the misroute into BuildKit's session
layer.

resolve_container_role / resolve_exec_role / resolve_role_from_uri
all return Result<UtilityVmRole>. Ambiguous identifiers (registry
prefix collision *or* multi-guest probe match) surface as 409
Conflict via a new ambiguous_workload_error helper. Macros and the
catch-all proxy fallback propagate the Result via `?`.

rebuild_container_role_from_guests now probes Native AND Rosetta
unconditionally and collects every hit before deciding. Zero hits =>
Missing, one hit => Found, more than one => Ambiguous. Returning on
the first match was a silent-misroute bug for cross-VM short-ID
collisions after a daemon restart — fixed.

Compose project scheduling docs/comments updated to call the current
behavior what it actually is: "first-service-wins binding with
mixed-VM rejection". PLAN's stronger "any amd64 service → whole
project rosetta" requires reading the full compose file before any
service is created, which is out of scope for a per-API-request
routing layer; the limitation is documented in code and in PLAN.md
rather than papered over.

All 110 lib tests pass; existing prefix-collision test renamed to
cross_role_prefix_collision_is_ambiguous and updated to assert the
new Ambiguous outcome.
Host-driven validation script + README covering PLAN.md Decision
Gates A/B/C: arm64 native + amd64-via-FEX64 `uname -m`, representative
amd64 images (musl/glibc/busybox/node/python/go/apt), exit-status and
stderr propagation, BuildKit amd64 build, and a mixed arm64/amd64
Compose project staying in the single HV VM. Records an environment
header (macOS version, guest kernel, FEX version, binfmt status,
arcbox commit) for reproducibility, and tags every check
PASS/FAIL/UNSUPPORTED/INFRA so a real gate failure is distinguishable
from a setup problem.

The harness is executed by a developer on Apple Silicon — it cannot
run where the daemon can't boot a VM. README documents the
FEX-at-/arcbox/bin/FEX contract (registered by the boot-assets rootfs
init) and the hardware-TSO go/no-go probe.
… (ABX-375)

Pivots the default runtime to a single HV utility VM. Platform no
longer selects a utility VM role; it selects an in-guest translator:

- routing.rs: `RuntimeTranslator { Native, Fex64 }`; amd64 →
  Fex64, arm64/unspecified → Native. `RoutingDecision` carries the
  translator and always resolves the workload to the single HV VM
  (`utility_vm()` → Native). `is_admissible(decision, fex64_available)`
  is the fail-closed gate. Drops the dual-VM-only helpers
  (`utility_vm_role`, `UtilityVmRoleExt::can_host`,
  `extract_compose_project`).
- handlers: `require_amd64_runtime` rejects amd64 with a clear
  "requires FEX64 in the HV guest" error when FEX64 is not provisioned
  — never silently falling back to VZ/Rosetta or QEMU. `create_container`
  drops Compose project-role binding (single VM needs none) and the
  per-request build/session role machinery is removed; `/build` and
  `/session` go to the one HV VM. The registry rebuild probes only the
  HV VM so a lifecycle lookup can never boot the VZ build backend.
- core: `Runtime::amd64_runtime_supported()` returns whether
  `<data_dir>/bin/FEX` (guest `/arcbox/bin/FEX`) is present — the same
  artifact whose presence makes the boot-assets rootfs init register
  the x86_64 binfmt handler, so host admission and guest registration
  share one signal.
- workload.rs: drop the Compose-project and BuildKit `/session`
  role-sync machinery (unnecessary in single-VM); the registry now
  only maps IDs/aliases → role and fails closed on ambiguity.

VZ/Rosetta is demoted, not deleted: the role enum, slot, and ABX-374
machinery remain as the preserved fallback / future explicit build
backend, but the runtime path never selects Rosetta or boots the VZ
VM. Routing/admission unit tests updated (amd64 → fex64 translator,
amd64 fail-closed without FEX64, native always admissible).
Running the harness against a live arcbox daemon exposed a
misclassification: amd64 `exec format error` (no x86_64 binfmt
handler) and the ABX-375 fail-closed error were tagged as a Gate-A
FAIL, which per the goal would wrongly trigger "resume ABX-374".

That is the FEX64-*unavailable* state (interpreter not provisioned),
not a FEX64 gate failure. Only FEX64 actually running and
mis-executing (wrong arch / crash) is a real FAIL. The harness now:

- captures stderr and distinguishes "not provisioned"
  (exec format error / requires-FEX64 / binfmt / missing interpreter)
  → INFRA, sets an `amd64_blocked` flag;
- reserves FAIL for FEX64 running but returning the wrong result;
- reports RESULT: BLOCKED (exit 2) with explicit "do not resume
  ABX-374 on this basis" guidance when amd64 is unprovisioned;
- applies the same distinction to the Gate B image matrix.

Also corrects the README Docker context endpoint to
unix:///<home>/.arcbox/run/docker.sock.

Verified against the live `arcbox` context: arm64 PASS, amd64 now
INFRA (FEX64 not provisioned) instead of a false FAIL.
…t path

ArcBox installs boot-manifest runtime binaries to <data_dir>/runtime/bin
(guest /arcbox/runtime/bin, via prepare_binaries), the same set the guest
runs dockerd/containerd from. boot-assets v0.5.8+ registers the FEX
binfmt handler at /arcbox/runtime/bin/FEX. Align the host-side
amd64_runtime_supported() fail-closed gate to <data_dir>/runtime/bin/FEX
(was <data_dir>/bin/FEX, which prepare_binaries never populates), and fix
the amd64-unavailable error message to name the correct path.

Pairs with boot-assets fix/fex-static-runtime-paths (static FEX +
/arcbox/runtime/bin/FEX binfmt path).
…provisioned

Match the FEX runtime path to the boot-assets binfmt registration
(/arcbox/runtime/bin/FEX). Also short-circuit to a BLOCKED summary when
Gate A finds amd64 unprovisioned, so the runtime/build/compose amd64
sub-checks don't emit misleading FAIL lines and the verdict stays
"decision pending" rather than falsely triggering "resume ABX-374".
Bump [boot] to v0.5.9, which ships the statically-linked FEX64 x86_64
interpreter (FEX/FEXServer, arm64) staged at bin/FEX/... in the manifest.
The host syncs it to <data_dir>/runtime/bin/FEX, shared into the guest as
/arcbox/runtime/bin/FEX — the path the rootfs init registers as the
x86_64 binfmt_misc handler and that amd64_runtime_supported() probes.

Static linking makes FEX usable as a binfmt interpreter inside OCI
container mount namespaces (no external loader/library closure to resolve
against the container rootfs). Update manifest_sha256 to the published
v0.5.9 manifest so the boot-time integrity check passes.

Also correct two stale comments that listed the synced runtime binaries
without FEX: prepare_binaries downloads every manifest binary, and FEX is
optional (absent FEX does not block boot; amd64 then fails closed).
On Apple Silicon under HVF (observed on M5 Max / macOS 26) the guest is
advertised SVE feature bits it cannot execute: /proc/cpuinfo reports
`sve2 sve2p1 svebf16 ...`, yet a single SVE instruction (e.g. `rdvl`)
SIGILLs. Userspace that trusts HWCAP_SVE then crashes — glibc's
ifunc-selected SVE memcpy/str* and FEX64's SVE paths.

Append `arm64.nosve arm64.nosme` to the default machine's kernel cmdline so
the guest kernel ignores the phantom features and userspace falls back to
NEON. Verified live: guest then reports 0 SVE features and `rdvl` is no
longer selected by HWCAP-gated code.

Note: this alone does NOT make FEX64 amd64 work — FEX-2605 also emits
unconditional explicit SVE that traps regardless (see ABX-375 handoff).
@linear-code

linear-code Bot commented May 31, 2026

Copy link
Copy Markdown

ABX-375

PeronGH added 7 commits June 5, 2026 22:06
v0.5.9 shipped a FEX that SIGILLs on Apple Silicon (compiler-emitted SVE)
and required a FEXServer unreachable in container namespaces. v0.5.10 ships
the fixed static-pie FEX (no SVE codegen, runs standalone), so amd64
containers route through FEX64 and run.

Manifest verified: FEX present (arm64, FEX-2605), no FEXServer.
514f36b appended `arm64.nosve arm64.nosme` to the default machine's
kernel cmdline on the theory that HVF advertised the guest phantom SVE
feature bits it could not execute. That diagnosis was wrong: the real
cause was the FEX binary being built with `-mcpu=native` on an
SVE-capable host, so FEX's own codegen emitted unconditional SVE that
trapped. That is fixed in the FEX build shipped in boot-assets 0.5.10
(ab3a218).

With the root cause fixed at the build level, the cmdline workaround is
dead weight — and unconditional (no model/macOS-version/flag gate), so
it would silently force NEON fallback and lose SVE-backed glibc ifuncs
and FEX translation paths on hardware where SVE works correctly.
f4387f9 documented the FEX probe at <data_dir>/bin/FEX; f019567 moved the
actual check and error string to <data_dir>/runtime/bin/FEX (the path
prepare_binaries populates) but missed this doc comment. Align it.
The 0.5.10 FEX build carries a small patch that strips the FEXServer
requirement and runs purely as a binfmt_misc interpreter, so no server process
needs to be reachable across container mount namespaces. Drop the stale
"FEX/FEXServer" pairing from the boot-asset comments, and correct the harness
to probe the actual binary at /arcbox/runtime/bin/FEX (not the upstream
FEXInterpreter name).
…est setup_fex()

The harness README credited a guest-agent setup_fex() for the x86_64 binfmt_misc
registration, but no such function exists in this repo — the only binfmt code in
arcbox-agent is for Rosetta. Per the boot-assets repo, the rootfs /sbin/init
trampoline checks for /arcbox/runtime/bin/FEX and registers the handler with
POCF flags. Fix the description and the scope note (runtime.rs already described
it correctly).
The default-VM drift check compared the persisted kernel only against the
boot-asset version string, so a `--kernel` override that kept the same
boot-asset version was ignored and the stale VM was reused with its old
kernel. Compare the persisted kernel against the resolved desired kernel
path (custom `--kernel` override, else the versioned cache path) so a
kernel change is detected even when the boot-asset version is unchanged.
Apple SME cores (e.g. M4 Pro) advertise SME — and SME-derived SVE — to the
guest, but plain non-streaming SVE cannot execute on this silicon (a bare
`rdvl` SIGILLs on the host). FEX's x86-64 JIT detects the feature by reading
ID_AA64PFR1_EL1/ID_AA64PFR0_EL1 directly and emits SVE that traps, so amd64
containers SIGILL. `arm64.nosve` does not help — it only clears HWCAP, which
FEX ignores.

Clear the SME field [27:24] of the guest's ID_AA64PFR1_EL1 at vCPU init using
the get/modify/set_sys_reg pattern QEMU's HVF backend uses to sanitize guest
ID registers, presenting a NEON-only guest like Virtualization.framework. This
eliminates the SIGILL; the remaining amd64 allocator fault is unrelated.
PeronGH added 2 commits June 9, 2026 20:11
…le fields

Drift detection previously checked only cpus/memory and the kernel path
inline, so cmdline changes (e.g. the guest docker vsock port, or an
`arm64.nosve` toggle) silently reused the stale VM. Extract the kernel +
cmdline resolution into a shared `resolve_desired_boot` used by both
`create_default_machine` and the drift check, so the comparison can never
diverge from what would be created, and add `machine_drift_reason` as the
single place that compares every overridable persisted field (cpus,
memory_mb, kernel, cmdline). Covered by a regression test.
@PeronGH

PeronGH commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Root cause found — it's randomize_va_space, not SVE

The Gate-A blocker (linux/amd64 via FEX64) is resolved, and the "phantom SVE / SIGILL" diagnosis in the description was a red herring — those SIGILLs were collateral from a deeper allocator failure.

Root cause: the guest kernel ships CONFIG_COMPAT_BRK=y, which forces kernel.randomize_va_space=1 at boot (mmap/stack/vDSO randomized, brk/heap not). FEX's x86-64 allocator can't lay out its VA space under =1 and fails non-deterministically with Failed to map VMA region → SIGSEGV. The SIGILLs only surfaced on runs that got far enough; once the allocator is fixed they're gone.

How it was isolated: OrbStack runs the same FEX binary cleanly — its guest has randomize_va_space=2; ours had =1. Toggling the sysctl at runtime: 1 → fails, 2 → works (x86_64, 10/10) on both a 52-bit (LPA2) and a 48-bit kernel. So VA width and page size were not the cause — the arm64.nosve and VA_BITS=48 detours were dead ends.

Fix: one line in the guest kernel config — disable COMPAT_BRK so it boots randomize_va_space=2 (matching OrbStack / stock distros): arcboxlabs/kernel#7

Result: with a kernel built from that PR, docker run --platform linux/amd64 alpine uname -mx86_64, deterministic, no SVE masking and no manual sysctls. Gate A passes.

What changes on this branch: keep the drift-detection fixes (recreate the default VM when kernel / cmdline / cpus / memory drift). The VMM SME-mask and the guest VA_BITS=48 kernel change explored during debugging are unnecessary and were reverted/dropped — COMPAT_BRK is the sole fix.

@PeronGH PeronGH marked this pull request as ready for review June 9, 2026 13:42
Copilot AI review requested due to automatic review settings June 9, 2026 13:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR advances ABX-375 by routing linux/amd64 runtime workloads through FEX64 (binfmt_misc) inside the single HV utility VM, while making the proxy/lifecycle stack role-aware so per-workload follow-ups consistently hit the same utility VM role (supporting the preserved ABX-374 dual-VM fallback).

Changes:

  • Add a reproducible local FEX64 validation harness (tests/fex/*) and update boot-assets pin in assets.lock.
  • Introduce role-aware routing + workload role registry in arcbox-docker (container/exec/build/fallback proxying now selects a utility VM role deterministically).
  • Extend arcbox-core VM lifecycle/runtime to support per-role machine identity (machine name, data image) and per-machine hypervisor backend selection (HV vs VZ).

Reviewed changes

Copilot reviewed 25 out of 26 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/fex/validate-fex64.sh Adds a local gate-based harness to validate FEX64 behavior for linux/amd64 under HV.
tests/fex/README.md Documents how to run/interpret the FEX64 harness and the decision gates.
assets.lock Updates boot-assets version/manifest pin used to provision guest runtime binaries.
app/arcbox-docker/src/workload.rs Adds an in-process registry mapping workload IDs/aliases to utility VM roles with ambiguity handling.
app/arcbox-docker/src/routing.rs Introduces platform parsing and a translator concept (native vs FEX64) for single-HV routing decisions.
app/arcbox-docker/src/proxy/upload.rs Adds role-selectable streaming upload forwarding into guest dockerd.
app/arcbox-docker/src/proxy/upgrade.rs Adds role-selectable upgrade proxying (exec attach/session) into guest dockerd.
app/arcbox-docker/src/proxy/mod.rs Extends the guest connector trait to support role-aware connections.
app/arcbox-docker/src/proxy/forward.rs Adds role-selectable buffered/streaming proxy helpers to guest dockerd.
app/arcbox-docker/src/proxy/fallback.rs Routes unmatched Docker API requests by resolving role from URI (container/exec IDs).
app/arcbox-docker/src/proxy/connector.rs Implements connect_for(role) using per-role machine name + vsock port.
app/arcbox-docker/src/lib.rs Exposes new routing and workload modules.
app/arcbox-docker/src/handlers/mod.rs Adds role extraction/resolution from URIs plus role-aware proxy helpers and fail-closed admission for amd64 runtime.
app/arcbox-docker/src/handlers/exec.rs Records exec IDs to role on exec-create and routes exec follow-ups to the recorded role.
app/arcbox-docker/src/handlers/container.rs Adds fail-closed amd64 admission, records container IDs/names to role, and routes lifecycle/networking by role.
app/arcbox-docker/src/handlers/build.rs Routes builds through HV and fails closed for amd64 when FEX64 is absent.
app/arcbox-docker/src/api.rs Stores the workload role registry in shared app state.
app/arcbox-daemon/src/startup/mod.rs Expands startup resource-wait to scan both native and rosetta dockerd images.
app/arcbox-core/src/workload.rs Defines shared UtilityVmRole enum and helpers for cross-crate role identity.
app/arcbox-core/src/vm_lifecycle/mod.rs Adds per-machine identity/config, drift detection based on resolved boot params, and backend selection.
app/arcbox-core/src/vm.rs Makes VM backend selection a per-machine config property instead of hardcoding HV.
app/arcbox-core/src/runtime.rs Introduces per-role slots (native + optional rosetta) and per-machine inbound port-forwarding state.
app/arcbox-core/src/machine.rs Adds backend + rosetta exposure flags to machine creation config.
app/arcbox-core/src/lib.rs Re-exports VmBackend and UtilityVmRole.
app/arcbox-core/src/boot_assets.rs Updates docs for preparing all runtime binaries in the boot manifest (including optional FEX).
app/arcbox-api/src/grpc/machine.rs Updates machine creation requests to populate new backend/rosetta fields.

Comment on lines +220 to +225
fn ambiguous_workload_error(id: &str) -> DockerError {
DockerError::Conflict(format!(
"workload identifier '{id}' is ambiguous: it matches multiple workloads. \
Use the full canonical container ID."
))
}
Comment on lines +402 to +408
/// Ensures the utility VM for `role` is running and ready.
///
/// Drives the per-role lifecycle so the native and rosetta VMs are
/// reachable independently. If `role` is not configured on this host
/// (e.g. rosetta on non-Apple-Silicon) the native slot answers as a
/// degradation path — the dockerd connector still works, but the
/// workload runs on HV instead of VZ+Rosetta.
Comment on lines +414 to +416
pub async fn ensure_role_ready(&self, role: UtilityVmRole) -> Result<u32> {
self.slot_for(role).container_backend.ensure_ready().await
}
Comment on lines +481 to +490
/// Returns the role slot, falling back to native if `role` is not
/// configured on this host.
fn slot_for(&self, role: UtilityVmRole) -> &RoleSlot {
if let Some(slot) = self.role_slots.get(&role) {
return slot;
}
self.role_slots
.get(&UtilityVmRole::Native)
.expect("Native role slot must always be present")
}
Comment on lines +124 to +128
/// If the alias is currently owned by a different canonical (e.g. a
/// previous container with the same name that has not yet been
/// forgotten), the alias is detached from the previous owner first so
/// the old owner's alias list never points to a key that now resolves
/// to a different role.
Comment on lines +146 to +150
detach_alias_from_previous_owner(&mut guard, &alias);
guard.roles.insert(alias.clone(), role);
guard
.alias_owner
.insert(alias.clone(), canonical.to_string());
Comment on lines 620 to 624
async fn get_cid(&self) -> Result<u32> {
self.machine_manager
.get_cid(DEFAULT_MACHINE_NAME)
.get_cid(&self.machine_name)
.ok_or_else(|| CoreError::Machine("default machine has no CID".to_string()))
}
Comment on lines +696 to 698
match self.machine_manager.start(&self.machine_name).await {
Ok(()) => {
tracing::info!("Default VM started successfully");
Comment thread tests/fex/validate-fex.sh
# --- Gate A: basic viability ------------------------------------------------
section "Gate A: basic viability"

arch="$("${DC[@]}" run --rm --platform linux/arm64 alpine uname -m 2>/dev/null)"
Comment thread tests/fex/validate-fex.sh
Comment on lines +87 to +88
amd64_out="$("${DC[@]}" run --rm --platform linux/amd64 alpine uname -m 2>&1)"
if [ "$amd64_out" = "x86_64" ]; then
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown

Greptile Summary

This PR (ABX-375) routes all Docker runtime containers through the single HV utility VM, using FEX (binfmt_misc) for linux/amd64 workloads instead of a separate VZ/Rosetta VM. VZ/Rosetta is retained as an opt-in build backend only. The PR is explicitly marked blocked at Gate A on M5 Max/macOS-26 due to phantom SVE in HVF guests causing FEX SIGILLs; the code is correct but the hardware/firmware layer is broken.

  • Routing pivot (routing.rs, handlers/): linux/amd64 requests are admitted only when <data_dir>/runtime/bin/FEX is present; amd64 workloads fail closed with a clear error rather than silently falling back to VZ/Rosetta. Native arm64 workloads are unaffected.
  • Lifecycle refactor (vm_lifecycle/mod.rs, runtime.rs): VmLifecycleManager gains a machine_name/data_image_filename seam via for_machine() so a second (Rosetta) lifecycle can coexist; drift detection is extended to cover kernel cmdline changes.
  • WorkloadRoleRegistry (workload.rs): New in-process registry tracks container/exec IDs → utility VM role; handles short-ID prefix lookup and ambiguity detection.

Confidence Score: 3/5

Safe to continue iterating on, but two issues should be resolved before the branch is considered production-ready: the Rosetta lifecycle shutdown gap and the test harness false-FAIL path.

The routing pivot, WorkloadRoleRegistry, and FEX admission gate are all well-implemented and tested. runtime::shutdown() tears down the Rosetta VM machine directly but never calls the Rosetta lifecycle manager's own shutdown, leaving its in-memory state stale and its cancellation token uncancelled — any subsequent call to lifecycle_for_role(Rosetta).shutdown() would then target the wrong machine name. The validate-fex harness exit-code and stderr tests emit FAIL (the resume-ABX-374 signal) for transient infrastructure errors, making the gate unreliable as a decision artifact.

app/arcbox-core/src/runtime.rs (shutdown/shutdown_force Rosetta lifecycle teardown) and tests/fex/validate-fex.sh (exit-code and stderr propagation gate checks).

Important Files Changed

Filename Overview
app/arcbox-docker/src/routing.rs New routing module: clean platform parsing, RoutingDecision, is_admissible gate, and query_param helper. All key cases covered by unit tests.
app/arcbox-docker/src/workload.rs New WorkloadRoleRegistry: short-ID prefix lookup, alias tracking, and ambiguity detection are correct and well-tested.
app/arcbox-core/src/vm_lifecycle/mod.rs Adds for_machine() seam and cmdline drift detection; shutdown() still uses hardcoded DEFAULT_MACHINE_NAME in graceful_stop (previously flagged), stale error string in get_cid (previously flagged).
app/arcbox-core/src/runtime.rs Per-role slot map + macOS InboundListenerMap refactor look correct; amd64_runtime_supported() uses synchronous is_file() on async executor (previously flagged); Rosetta lifecycle not shut down cleanly.
app/arcbox-docker/src/handlers/mod.rs New resolve_container_role, resolve_exec_role, require_amd64_runtime helpers are correct; fail-closed semantics properly implemented.
app/arcbox-docker/src/handlers/container.rs create_container, start_container, remove_container correctly use require_amd64_runtime and record role bindings; exec ID cleanup gap flagged previously.
tests/fex/validate-fex.sh Gate A/B/C harness is well-structured but exit-code and stderr propagation tests (lines 156-160) may emit FAIL for infrastructure errors, triggering the wrong architecture decision.
app/arcbox-docker/src/handlers/build.rs build_image correctly gates on require_amd64_runtime; session handler correctly targets Native role only.

Comments Outside Diff (1)

  1. app/arcbox-core/src/runtime.rs, line 652-705 (link)

    P1 Rosetta lifecycle not shut down through its own lifecycle manager

    shutdown() calls self.vm_lifecycle.shutdown() for the native VM then stops remaining machines directly via machine_manager.graceful_stop. The Rosetta lifecycle manager (stored in role_slots[Rosetta]) is never told to shut down — its health-monitor cancellation token is never cancelled and its in-memory state stays at Running/Idle rather than transitioning to Stopped. Any subsequent call to lifecycle_for_role(Rosetta).shutdown() would then hit the pre-existing wrong-machine-name bug on an already-stopped VM. The same gap exists in shutdown_force. Iterating role_slots and calling each slot's lifecycle.shutdown() directly would keep lifecycle state and event publishing consistent.

Reviews (4): Last reviewed commit: "chore(vmm): drop redundant clone in drif..." | Re-trigger Greptile

Comment thread app/arcbox-core/src/vm_lifecycle/mod.rs
Comment thread app/arcbox-core/src/runtime.rs
Comment thread app/arcbox-docker/src/handlers/container.rs

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ No critical issues — the findings are latent, scoped to the not-yet-routed Rosetta path. The Native runtime path (all that this PR exercises) is sound.

Reviewed changes — the ABX-375 pivot from multi-utility-VM routing to a single HV utility VM that runs linux/amd64 via in-guest FEX64, with VZ/Rosetta demoted to an opt-in build backend.

  • New routing.rs placement layerWorkloadPlatform parsing maps to a RuntimeTranslator (Native/Fex64); RoutingDecision::utility_vm() always returns UtilityVmRole::Native, and is_admissible() fails closed for amd64 when FEX64 is absent rather than falling back to VZ/QEMU.
  • Per-role VM abstraction in Runtime — adds RoleSlot/role_slots, ensure_role_ready, machine_name_for_role, lifecycle_for_role, and amd64_runtime_supported. The Rosetta (VZ) slot is constructed lazily on macOS aarch64 but its VM only boots on ensure_role_ready(Rosetta), which nothing currently calls.
  • VmLifecycleManager parameterized per machine — new for_machine(name, data_image) constructor; new() delegates with DEFAULT_MACHINE_NAME/docker.img; DEFAULT_MACHINE_NAME replaced with &self.machine_name across the manager. Default-VM drift detection generalized via machine_drift_reason.
  • Role-aware Docker handlersproxy_to_role, resolve_container_role, and a workload→role registry with alias tracking and fail-closed prefix-collision handling.
  • Per-machine inbound port forwardinginbound_listeners keyed by machine name and inbound_rules carry the owning machine so teardown reaches the right listener.
  • Daemon resource waitwait_for_resources now scans both docker.img and docker-rosetta.img (filtered by existence).
  • Assets + toolingassets.lock bumped to boot-assets v0.5.10 (static FEX64); tests/fex/ validation harness + README added. The guest arm64.nosve/SME disable was added then reverted later in the branch (net no cmdline change).

ℹ️ Runtime shutdown never tears down the Rosetta role slot's lifecycle

The per-role refactor constructs a Rosetta VmLifecycleManager into role_slots on Apple Silicon, but Runtime::shutdown() / shutdown_force() only operate on self.vm_lifecycle (the native lifecycle). The Rosetta slot's lifecycle is never told to shut down, so its health-monitor task, state machine, and MachineStopped event are skipped on daemon teardown.

This is latent today: nothing routes to Rosetta (utility_vm() is always Native), so the Rosetta VM never boots and there is nothing to tear down. It becomes a real resource/teardown gap the moment the Rosetta route is re-enabled (ABX-374). shutdown()/shutdown_force() are outside this PR's diff, which is why this is in the body rather than inline.

Technical details
# Runtime shutdown skips non-native role slots

## Affected sites
- `app/arcbox-core/src/runtime.rs``Runtime::shutdown()` (~line 659) and `shutdown_force()` (~line 727) call only `self.vm_lifecycle.shutdown()` / `.force_stop()`.
- `role_slots[UtilityVmRole::Rosetta].lifecycle` is never shut down.

## Required outcome
- On runtime shutdown, every configured role slot's lifecycle manager is shut down (or force-stopped), not just the native one — so the health monitor is cancelled, state transitions to `Stopped`/`NotExist`, and `MachineStopped` is published per machine.

## Suggested approach (optional)
- Iterate `self.role_slots.values()` and call `lifecycle.shutdown()` / `force_stop()` on each, deduplicating the native slot which is already covered by `self.vm_lifecycle`. The generic step-3 "stop non-default machines" loop terminates the VM process but does not drive the lifecycle state machine.

## Open questions for the human
- Is Rosetta intended to stay a fully cold/dead path on this branch (handoff to ABX-374), or should the teardown wiring land here so the slot is safe to activate later?

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

DEFAULT_MACHINE_NAME
);
self.machine_manager.stop(DEFAULT_MACHINE_NAME)
self.machine_manager.stop(&self.machine_name)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-machine rename is incomplete in shutdown(): the graceful_stop(DEFAULT_MACHINE_NAME, …) call (line 1105) and both warning logs (lines 1112, 1119) still use the "default" constant, while these force-stop fallbacks correctly use &self.machine_name. For a non-native lifecycle (the rosetta slot), shutdown() would graceful-stop the native machine and only force-stop the right one — and the diagnostics would name the wrong machine.

Latent today (nothing calls the rosetta lifecycle's shutdown()), but it's a clear gap in the rename this branch otherwise completed.

Technical details
# shutdown() still hardcodes DEFAULT_MACHINE_NAME

## Affected sites
- `app/arcbox-core/src/vm_lifecycle/mod.rs:1105``graceful_stop(DEFAULT_MACHINE_NAME, …)`
- `app/arcbox-core/src/vm_lifecycle/mod.rs:1112`, `:1119``tracing::warn!` interpolate `DEFAULT_MACHINE_NAME`

## Required outcome
- All four sites address `&self.machine_name`, consistent with the force-stop fallbacks at 1114/1122.

if let Some(id) = extract_container_id(&uri) {
let _ = state.runtime.ensure_vm_ready().await;
if let Some(body_bytes) = inspect_container_body(&state, &id).await {
if let Some(body_bytes) = inspect_container_body(&state, role, &id).await {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restart_container resolves role (line 201) but the best-effort wake on line 209 calls state.runtime.ensure_vm_ready(), which always wakes the native VM, while the subsequent inspect_container_body(&state, role, &id) proxies to the role's VM. The same pattern is in resolve_canonical_from_uri (line 531). Should be state.runtime.ensure_role_ready(role).await.

Latent today since every container resolves to Native, but it would wake the wrong VM once Rosetta is routed.

// Resolve role → machine/port via the runtime. Today both
// roles still alias to the default machine; once the dual VM
// lifecycle lands the rosetta branch returns its own machine
// name and dockerd port without any change here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is already inaccurate as of this PR: machine_name_for_role(Rosetta) returns "rosetta" (not the default machine) on macOS aarch64, since the rosetta slot is populated. The dual-VM lifecycle has landed; the comment describes the prior state. Consider dropping or correcting it.

@PeronGH PeronGH changed the title ABX-375: All-in-HV + FEX64 runtime (BLOCKED on M5/HVF phantom SVE — handoff) ABX-375: All-in-HV + FEX runtime Jun 10, 2026
PeronGH added 2 commits June 10, 2026 20:41
The interpreter has always been plain upstream FEX; FEX64 was an
invented name. Renames RuntimeTranslator::Fex64, needs_fex64(), the
fex64 tracing label, the fail-closed error message, and the validation
script (validate-fex64.sh -> validate-fex.sh, with its 'requires fex64'
match updated in the same change). Binary paths and the binfmt entry
were already correct.
Pre-existing clippy::redundant_clone on the last use of `current`.
@AprilNEA AprilNEA merged commit 684ce18 into master Jun 10, 2026
6 checks passed
@AprilNEA AprilNEA deleted the feat/fex64-hv-runtime-plan branch June 10, 2026 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants