ABX-375: All-in-HV + FEX runtime#293
Conversation
Amp-Thread-ID: https://ampcode.com/threads/T-019e68f7-9a35-745d-b9db-c8085864e5a7 Co-authored-by: Amp <amp@ampcode.com>
Records the chosen utility VM role against the canonical container ID
returned by `POST /containers/create` (and the exec ID returned by
`POST /containers/{id}/exec`) so every follow-up lifecycle call —
start, stop, kill, restart, rename, remove, inspect, logs, top, stats,
changes, wait, pause/unpause, attach, exec start/resize/inspect — is
proxied to the same role.
The registry is process-local; lookups for pre-existing or
post-restart workloads return `None` and callers fall back to the
native default. Durable persistence and strict fail-closed behavior
are deferred to a later slice once the connector layer actually
resolves each role to a distinct guest dockerd endpoint.
Address routing gaps that would surface once `native` and `rosetta`
resolve to distinct guest dockerds:
- WorkloadRoleRegistry now tracks container `--name` aliases (with
rename propagation) and resolves short hex IDs by canonical prefix,
so `docker start web` and `docker logs ab12c3` land on the same role
as the canonical entry instead of falling through to native.
- proxy_fallback resolves the role from the URI (container ID, then
exec ID, then native default), so unrouted endpoints like
`/containers/{id}/archive`, `/containers/{id}/update`, and
`/exec/{id}/resize` follow the workload's VM.
- Module docs reworded to make clear the registry tracks bindings
in-process rather than persisting them; durable persistence remains
out of scope.
BuildKit `/session` routing is intentionally not addressed in this
commit: the protocol opens `/session` before the matching `/build`
sets the platform, so a session-UUID lookup cannot be honored at
session-open time. A follow-up needs either lazy session forwarding
or a pending-session buffer keyed by `X-Docker-Expose-Session-Uuid`.
Same goes for per-role host port forwarding, which still uses
`runtime.default_machine_name()` and needs a Runtime API for role →
machine identity.
…rship consistent Two correctness fixes on the workload role registry, before native and rosetta resolve to distinct guest dockerds: - Short hex prefix lookup now collects every canonical that matches the requested prefix. If those canonicals agree on a role, that role is returned; if they disagree the registry returns `None` so callers fall back to the native default instead of silently choosing whichever HashMap iteration order surfaced first. - WorkloadRoleRegistry gained an `alias_owner` reverse map. `add_alias` and `rename_alias` now detach an alias from any previous owner before reassigning it, so forgetting the previous canonical can no longer delete a binding that has since been adopted by another canonical. `forget` honors the reverse map symmetrically. Duplicate alias adds for the same canonical are deduped. `query_param` docstring now flags that values are returned raw; the current callers only use ASCII-safe identifiers (`platform`, `name`), so percent-decoding stays deferred until something that may actually carry encoded bytes wants this helper.
PLAN.md step 3. Lift UtilityVmRole into arcbox-core (workload module) so it is the shared currency between the daemon, runtime, and Docker compatibility layer; arcbox-docker now re-exports it. Runtime gains three role-keyed accessors: - `ensure_role_ready(role)` — role-aware boot/ready hook. Both roles resolve to the existing default VM today; the rosetta branch diverges once the dual lifecycle lands. - `machine_name_for_role(role)` — machine name to address. - `guest_docker_vsock_port_for_role(role)` — dockerd vsock port. VsockConnector::connect_for(role) consumes the new lookups so the machine + port chosen for every connection follows the requested role. The Docker handler and fallback paths now drive the role-aware ensure hook and bubble the role into ensure errors, so a failure on the rosetta VM surfaces as such instead of a generic native error. No behavior change yet — the lookups still alias both roles to the default VM. The seam is now in place for the dual-VM lifecycle slice.
PLAN.md step 4 prep. Threads the machine name and persistent dockerd data image through VmLifecycleManager so a single struct can drive either the default native machine or a secondary VZ Rosetta machine. - New for_machine() constructor that takes the machine name and the docker.img filename. The existing new() delegates to it with the default values so all current callers (Runtime, daemon startup) behave identically. - Internal create_default_machine / start_default_vm / wait_for_agent / idle monitor / event payloads now use self.machine_name instead of DEFAULT_MACHINE_NAME so a rosetta lifecycle reports the right machine in events and logs. - data_image_path() yields the absolute path of this manager's docker.img, replacing the hard-coded DOCKER_DATA_IMAGE_NAME join in create_default_machine. No behavior change: callers still construct one lifecycle on the "default" machine. Adding a second VmLifecycleManager for the rosetta role and wiring it into Runtime is the next slice.
PLAN.md step 4 prep. Replaces the hard-coded `VmBackend::Hv` in `VmManager::build_vmm_config` with a per-machine backend so the rosetta utility VM can run on VZ while the native one keeps running on HV. - `VmConfig` gains a `backend: VmBackend` field defaulting to `Hv`. `build_vmm_config` now reads from it. - `MachineConfig` gains matching `backend` and `enable_rosetta` fields so callers can set them at create-time. `MachineManager::create` threads both into the underlying `VmConfig`. - `VmLifecycleConfig` gains `backend` so each per-role lifecycle decides its own backend. The `create_default_machine` path now feeds it (plus `default_vm.rosetta`) into the `MachineConfig` it builds. - `arcbox-core` re-exports `VmBackend` so downstream crates (`arcbox-api`'s gRPC machine handler) can construct `MachineConfig` without taking a direct `arcbox-vmm` dependency. Existing single-VM behavior is preserved: every constructor and default keeps `backend = Hv`. The rosetta lifecycle starts using `Vz` when the dual-VM Runtime wiring lands.
PLAN.md step 4. Runtime now builds a per-role lifecycle slot at construction time: - Native (HV) slot — always present, drives the existing default machine and stays the eager-started utility VM. - Rosetta (VZ) slot — present only on macOS Apple Silicon. Built with machine name "rosetta", docker-rosetta.img as its persistent data image, VmBackend::Vz, and default_vm.rosetta=true. The lifecycle is constructed up front so the slot's state is addressable, but the VM itself stays cold until `ensure_role_ready(Rosetta)` is first called by the Docker layer. Role-keyed accessors now read from the slot map: - `ensure_role_ready(role)` drives the role's container backend. - `machine_name_for_role(role)` / `guest_docker_vsock_port_for_role(role)` return the slot's machine name and dockerd port. - `lifecycle_for_role(role)` exposes the per-role lifecycle for diagnostics and future shared-control-plane wiring. - `role_is_distinct(role)` lets callers tell whether a role has its own slot or is aliasing onto native on this platform. The pre-existing `vm_lifecycle` / `container_backend` fields are kept and pinned to the native slot so the daemon-wide flows (Kubernetes, shutdown) keep behaving identically. When a role is not configured on the host (e.g. rosetta on non-Apple-Silicon), the slot lookup falls back to native so the Docker layer keeps working as a single-VM setup. Daemon startup wait-for-resources still waits only on the native docker.img; per-role XPC holder handling lands in the shared control plane slice.
PLAN.md step 5 (partial). Replaces the single inbound listener slot in Runtime with a per-machine map so each utility VM owns its own listener, then teaches the Docker handler to bind a container's published ports against the role the container was created on. - Runtime now holds `inbound_listeners: HashMap<String, InboundListenerManager>` keyed by machine name and tracks the per-container rules as `(machine_name, rules)` so teardown reaches the correct listener even when the container migrated roles. `start_port_forwarding_macos`, `stop_port_forwarding_by_id`, and `stop_port_forwarding_all` follow the per-machine map. - The Docker `setup_port_forwarding_from_inspect` path now resolves the machine name via `runtime.machine_name_for_role(role)` instead of `default_machine_name`, so an `amd64` container running on the rosetta VM lands on the rosetta bridge's listener. Existing single-VM deployments are unaffected: when only the native slot is configured every container ends up on the same machine, the old single-listener behavior reduces to one map entry.
…roles PLAN.md step 5. Two host-side coordination changes needed before a real dual VM deployment is safe. - daemon `wait_for_resources` now scans every persistent dockerd image owned by a configured utility VM role (native `docker.img`, rosetta `docker-rosetta.img`) so a stale VZ XPC holder on either image is drained before `init_runtime` brings up either VM. Same 10-second bound applies per image. - Docker handler `ensure_role_ready` refuses requests for a role that is not configured on this host (e.g. Rosetta on non-Apple-Silicon) with a clear platform-specific error rather than silently falling back to native. Silent fallback would land a `linux/amd64` workload on the HV native VM that cannot translate x86_64, with no useful diagnostic. The native default remains the fallback for Rosetta requests that fail open elsewhere; this only short-circuits the case where Rosetta is definitively unsupported by the host.
PLAN.md step 7. Compose-managed containers carry
`com.docker.compose.project` on every service; ArcBox now uses that
label to pin every service in a project to the same utility VM role
so DNS, port forwarding, and volume sharing remain coherent within a
project.
- `WorkloadRoleRegistry` gains `project_role(name)` and
`record_project(name, role)`. Bindings are sticky across compose
up/down cycles to keep group routing predictable.
- `UtilityVmRoleExt::can_host(platform)` codifies which roles can
accept which platforms: rosetta hosts both arm64 and amd64; native
refuses amd64. Used by the create handler to decide whether the
next service in a project is compatible.
- `extract_compose_project(body)` reads the project label from a
container-create payload.
- `create_container` now:
1. Parses the routing decision from platform metadata.
2. Reads the compose project label.
3. If the project is already bound, uses that role when compatible
and rejects with a 400 ("mixed-backend compose projects are not
supported") otherwise.
4. If not yet bound, records the first service's role as the
project's role.
Containers without a compose label retain the per-container behavior.
PLAN.md step 2 follow-up. The Docker CLI opens `/session` before it sends the matching `/build` that carries platform metadata, so the session role cannot be derived at session-open time. ArcBox forwards `/session` to the native (HV) utility VM by default; an `amd64` build that needs Rosetta-side BuildKit features will not see this session and side channels (secrets, ssh, build mounts) will fail. Adds a doc comment explaining the limitation and a debug log of the session UUID so operators can correlate `/session` and `/build` forwarding. Routing both endpoints together requires lazy session forwarding keyed by `X-Docker-Expose-Session-Uuid`; left as a follow-up rather than landed here, since correctly buffering the HTTP/1.1 upgrade until `/build` arrives is substantial work that deserves its own change.
…after daemon restart
Closes the two items previously deferred under PLAN.md step 2.
BuildKit /session role routing:
- `WorkloadRoleRegistry` gains `wait_for_role(key, max_wait)` backed by
a `tokio::sync::Notify`; `record(...)` fires the signal.
- `build_image` records `X-Docker-Expose-Session-Uuid → role` so the
parallel `/session` request can be routed coherently.
- `session()` reads the same UUID and parks on `wait_for_role` for up
to 30 seconds (matches BuildKit's own session-handshake timeout),
forwarding the upgrade to the role declared by `/build`. Both
ordering races (`/build`-first or `/session`-first) resolve
correctly. On timeout, `/session` forwards to native so the user
sees a BuildKit-level error rather than a hung connection.
Cross-restart durability via lazy guest-probe rebuild:
- `resolve_container_role` and `resolve_role_from_uri` now treat a
registry miss as a recovery signal: they probe each configured
role's guest dockerd with `GET /containers/{id}/json`, accept the
first 200 as the workload's role, and re-record it. Native is
always probed first because it's already up; the Rosetta probe
triggers lazy startup on first miss after a restart, which is
exactly the recovery behavior we want for surviving rosetta
workloads.
- No on-disk schema is introduced; correctness is recovered from the
guest dockerds, which already persist their own container state.
- The dead `proxy_upgrade()` helper is dropped — all upgrade paths
now go through `proxy_upgrade_to_role`.
Tests cover the three `wait_for_role` paths (cache hit, late record
wakeup, timeout), bringing the docker-lib suite to 110 passing.
Closes the routing-correctness gaps surfaced by review: the
Missing/Ambiguous conflation, the first-hit-wins rebuild, and the
silent /session timeout fallback. Every place that previously
collapsed "ambiguous" into "missing" and quietly fell back to native
now surfaces a Docker-compatible 4xx instead.
WorkloadRoleRegistry::lookup returns a new WorkloadRoleLookup
{Found(role), Missing, Ambiguous} tri-state. Cross-VM short-ID
collisions report Ambiguous (previously None). wait_for_role
propagates the same shape, and BuildKit /session no longer routes a
timed-out session to native — it returns 400 with a clear message
naming the UUID, since silently attaching the upgraded gRPC stream
to the wrong VM would just leak the misroute into BuildKit's session
layer.
resolve_container_role / resolve_exec_role / resolve_role_from_uri
all return Result<UtilityVmRole>. Ambiguous identifiers (registry
prefix collision *or* multi-guest probe match) surface as 409
Conflict via a new ambiguous_workload_error helper. Macros and the
catch-all proxy fallback propagate the Result via `?`.
rebuild_container_role_from_guests now probes Native AND Rosetta
unconditionally and collects every hit before deciding. Zero hits =>
Missing, one hit => Found, more than one => Ambiguous. Returning on
the first match was a silent-misroute bug for cross-VM short-ID
collisions after a daemon restart — fixed.
Compose project scheduling docs/comments updated to call the current
behavior what it actually is: "first-service-wins binding with
mixed-VM rejection". PLAN's stronger "any amd64 service → whole
project rosetta" requires reading the full compose file before any
service is created, which is out of scope for a per-API-request
routing layer; the limitation is documented in code and in PLAN.md
rather than papered over.
All 110 lib tests pass; existing prefix-collision test renamed to
cross_role_prefix_collision_is_ambiguous and updated to assert the
new Ambiguous outcome.
Host-driven validation script + README covering PLAN.md Decision Gates A/B/C: arm64 native + amd64-via-FEX64 `uname -m`, representative amd64 images (musl/glibc/busybox/node/python/go/apt), exit-status and stderr propagation, BuildKit amd64 build, and a mixed arm64/amd64 Compose project staying in the single HV VM. Records an environment header (macOS version, guest kernel, FEX version, binfmt status, arcbox commit) for reproducibility, and tags every check PASS/FAIL/UNSUPPORTED/INFRA so a real gate failure is distinguishable from a setup problem. The harness is executed by a developer on Apple Silicon — it cannot run where the daemon can't boot a VM. README documents the FEX-at-/arcbox/bin/FEX contract (registered by the boot-assets rootfs init) and the hardware-TSO go/no-go probe.
… (ABX-375)
Pivots the default runtime to a single HV utility VM. Platform no
longer selects a utility VM role; it selects an in-guest translator:
- routing.rs: `RuntimeTranslator { Native, Fex64 }`; amd64 →
Fex64, arm64/unspecified → Native. `RoutingDecision` carries the
translator and always resolves the workload to the single HV VM
(`utility_vm()` → Native). `is_admissible(decision, fex64_available)`
is the fail-closed gate. Drops the dual-VM-only helpers
(`utility_vm_role`, `UtilityVmRoleExt::can_host`,
`extract_compose_project`).
- handlers: `require_amd64_runtime` rejects amd64 with a clear
"requires FEX64 in the HV guest" error when FEX64 is not provisioned
— never silently falling back to VZ/Rosetta or QEMU. `create_container`
drops Compose project-role binding (single VM needs none) and the
per-request build/session role machinery is removed; `/build` and
`/session` go to the one HV VM. The registry rebuild probes only the
HV VM so a lifecycle lookup can never boot the VZ build backend.
- core: `Runtime::amd64_runtime_supported()` returns whether
`<data_dir>/bin/FEX` (guest `/arcbox/bin/FEX`) is present — the same
artifact whose presence makes the boot-assets rootfs init register
the x86_64 binfmt handler, so host admission and guest registration
share one signal.
- workload.rs: drop the Compose-project and BuildKit `/session`
role-sync machinery (unnecessary in single-VM); the registry now
only maps IDs/aliases → role and fails closed on ambiguity.
VZ/Rosetta is demoted, not deleted: the role enum, slot, and ABX-374
machinery remain as the preserved fallback / future explicit build
backend, but the runtime path never selects Rosetta or boots the VZ
VM. Routing/admission unit tests updated (amd64 → fex64 translator,
amd64 fail-closed without FEX64, native always admissible).
Running the harness against a live arcbox daemon exposed a misclassification: amd64 `exec format error` (no x86_64 binfmt handler) and the ABX-375 fail-closed error were tagged as a Gate-A FAIL, which per the goal would wrongly trigger "resume ABX-374". That is the FEX64-*unavailable* state (interpreter not provisioned), not a FEX64 gate failure. Only FEX64 actually running and mis-executing (wrong arch / crash) is a real FAIL. The harness now: - captures stderr and distinguishes "not provisioned" (exec format error / requires-FEX64 / binfmt / missing interpreter) → INFRA, sets an `amd64_blocked` flag; - reserves FAIL for FEX64 running but returning the wrong result; - reports RESULT: BLOCKED (exit 2) with explicit "do not resume ABX-374 on this basis" guidance when amd64 is unprovisioned; - applies the same distinction to the Gate B image matrix. Also corrects the README Docker context endpoint to unix:///<home>/.arcbox/run/docker.sock. Verified against the live `arcbox` context: arm64 PASS, amd64 now INFRA (FEX64 not provisioned) instead of a false FAIL.
…t path ArcBox installs boot-manifest runtime binaries to <data_dir>/runtime/bin (guest /arcbox/runtime/bin, via prepare_binaries), the same set the guest runs dockerd/containerd from. boot-assets v0.5.8+ registers the FEX binfmt handler at /arcbox/runtime/bin/FEX. Align the host-side amd64_runtime_supported() fail-closed gate to <data_dir>/runtime/bin/FEX (was <data_dir>/bin/FEX, which prepare_binaries never populates), and fix the amd64-unavailable error message to name the correct path. Pairs with boot-assets fix/fex-static-runtime-paths (static FEX + /arcbox/runtime/bin/FEX binfmt path).
…provisioned Match the FEX runtime path to the boot-assets binfmt registration (/arcbox/runtime/bin/FEX). Also short-circuit to a BLOCKED summary when Gate A finds amd64 unprovisioned, so the runtime/build/compose amd64 sub-checks don't emit misleading FAIL lines and the verdict stays "decision pending" rather than falsely triggering "resume ABX-374".
Bump [boot] to v0.5.9, which ships the statically-linked FEX64 x86_64 interpreter (FEX/FEXServer, arm64) staged at bin/FEX/... in the manifest. The host syncs it to <data_dir>/runtime/bin/FEX, shared into the guest as /arcbox/runtime/bin/FEX — the path the rootfs init registers as the x86_64 binfmt_misc handler and that amd64_runtime_supported() probes. Static linking makes FEX usable as a binfmt interpreter inside OCI container mount namespaces (no external loader/library closure to resolve against the container rootfs). Update manifest_sha256 to the published v0.5.9 manifest so the boot-time integrity check passes. Also correct two stale comments that listed the synced runtime binaries without FEX: prepare_binaries downloads every manifest binary, and FEX is optional (absent FEX does not block boot; amd64 then fails closed).
On Apple Silicon under HVF (observed on M5 Max / macOS 26) the guest is advertised SVE feature bits it cannot execute: /proc/cpuinfo reports `sve2 sve2p1 svebf16 ...`, yet a single SVE instruction (e.g. `rdvl`) SIGILLs. Userspace that trusts HWCAP_SVE then crashes — glibc's ifunc-selected SVE memcpy/str* and FEX64's SVE paths. Append `arm64.nosve arm64.nosme` to the default machine's kernel cmdline so the guest kernel ignores the phantom features and userspace falls back to NEON. Verified live: guest then reports 0 SVE features and `rdvl` is no longer selected by HWCAP-gated code. Note: this alone does NOT make FEX64 amd64 work — FEX-2605 also emits unconditional explicit SVE that traps regardless (see ABX-375 handoff).
v0.5.9 shipped a FEX that SIGILLs on Apple Silicon (compiler-emitted SVE) and required a FEXServer unreachable in container namespaces. v0.5.10 ships the fixed static-pie FEX (no SVE codegen, runs standalone), so amd64 containers route through FEX64 and run. Manifest verified: FEX present (arm64, FEX-2605), no FEXServer.
514f36b appended `arm64.nosve arm64.nosme` to the default machine's kernel cmdline on the theory that HVF advertised the guest phantom SVE feature bits it could not execute. That diagnosis was wrong: the real cause was the FEX binary being built with `-mcpu=native` on an SVE-capable host, so FEX's own codegen emitted unconditional SVE that trapped. That is fixed in the FEX build shipped in boot-assets 0.5.10 (ab3a218). With the root cause fixed at the build level, the cmdline workaround is dead weight — and unconditional (no model/macOS-version/flag gate), so it would silently force NEON fallback and lose SVE-backed glibc ifuncs and FEX translation paths on hardware where SVE works correctly.
The 0.5.10 FEX build carries a small patch that strips the FEXServer requirement and runs purely as a binfmt_misc interpreter, so no server process needs to be reachable across container mount namespaces. Drop the stale "FEX/FEXServer" pairing from the boot-asset comments, and correct the harness to probe the actual binary at /arcbox/runtime/bin/FEX (not the upstream FEXInterpreter name).
…est setup_fex() The harness README credited a guest-agent setup_fex() for the x86_64 binfmt_misc registration, but no such function exists in this repo — the only binfmt code in arcbox-agent is for Rosetta. Per the boot-assets repo, the rootfs /sbin/init trampoline checks for /arcbox/runtime/bin/FEX and registers the handler with POCF flags. Fix the description and the scope note (runtime.rs already described it correctly).
The default-VM drift check compared the persisted kernel only against the boot-asset version string, so a `--kernel` override that kept the same boot-asset version was ignored and the stale VM was reused with its old kernel. Compare the persisted kernel against the resolved desired kernel path (custom `--kernel` override, else the versioned cache path) so a kernel change is detected even when the boot-asset version is unchanged.
Apple SME cores (e.g. M4 Pro) advertise SME — and SME-derived SVE — to the guest, but plain non-streaming SVE cannot execute on this silicon (a bare `rdvl` SIGILLs on the host). FEX's x86-64 JIT detects the feature by reading ID_AA64PFR1_EL1/ID_AA64PFR0_EL1 directly and emits SVE that traps, so amd64 containers SIGILL. `arm64.nosve` does not help — it only clears HWCAP, which FEX ignores. Clear the SME field [27:24] of the guest's ID_AA64PFR1_EL1 at vCPU init using the get/modify/set_sys_reg pattern QEMU's HVF backend uses to sanitize guest ID registers, presenting a NEON-only guest like Virtualization.framework. This eliminates the SIGILL; the remaining amd64 allocator fault is unrelated.
…le fields Drift detection previously checked only cpus/memory and the kernel path inline, so cmdline changes (e.g. the guest docker vsock port, or an `arm64.nosve` toggle) silently reused the stale VM. Extract the kernel + cmdline resolution into a shared `resolve_desired_boot` used by both `create_default_machine` and the drift check, so the comparison can never diverge from what would be created, and add `machine_drift_reason` as the single place that compares every overridable persisted field (cpus, memory_mb, kernel, cmdline). Covered by a regression test.
… SME cores" This reverts commit c89de9d.
Root cause found — it's
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR advances ABX-375 by routing linux/amd64 runtime workloads through FEX64 (binfmt_misc) inside the single HV utility VM, while making the proxy/lifecycle stack role-aware so per-workload follow-ups consistently hit the same utility VM role (supporting the preserved ABX-374 dual-VM fallback).
Changes:
- Add a reproducible local FEX64 validation harness (
tests/fex/*) and update boot-assets pin inassets.lock. - Introduce role-aware routing + workload role registry in
arcbox-docker(container/exec/build/fallback proxying now selects a utility VM role deterministically). - Extend
arcbox-coreVM lifecycle/runtime to support per-role machine identity (machine name, data image) and per-machine hypervisor backend selection (HV vs VZ).
Reviewed changes
Copilot reviewed 25 out of 26 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/fex/validate-fex64.sh | Adds a local gate-based harness to validate FEX64 behavior for linux/amd64 under HV. |
| tests/fex/README.md | Documents how to run/interpret the FEX64 harness and the decision gates. |
| assets.lock | Updates boot-assets version/manifest pin used to provision guest runtime binaries. |
| app/arcbox-docker/src/workload.rs | Adds an in-process registry mapping workload IDs/aliases to utility VM roles with ambiguity handling. |
| app/arcbox-docker/src/routing.rs | Introduces platform parsing and a translator concept (native vs FEX64) for single-HV routing decisions. |
| app/arcbox-docker/src/proxy/upload.rs | Adds role-selectable streaming upload forwarding into guest dockerd. |
| app/arcbox-docker/src/proxy/upgrade.rs | Adds role-selectable upgrade proxying (exec attach/session) into guest dockerd. |
| app/arcbox-docker/src/proxy/mod.rs | Extends the guest connector trait to support role-aware connections. |
| app/arcbox-docker/src/proxy/forward.rs | Adds role-selectable buffered/streaming proxy helpers to guest dockerd. |
| app/arcbox-docker/src/proxy/fallback.rs | Routes unmatched Docker API requests by resolving role from URI (container/exec IDs). |
| app/arcbox-docker/src/proxy/connector.rs | Implements connect_for(role) using per-role machine name + vsock port. |
| app/arcbox-docker/src/lib.rs | Exposes new routing and workload modules. |
| app/arcbox-docker/src/handlers/mod.rs | Adds role extraction/resolution from URIs plus role-aware proxy helpers and fail-closed admission for amd64 runtime. |
| app/arcbox-docker/src/handlers/exec.rs | Records exec IDs to role on exec-create and routes exec follow-ups to the recorded role. |
| app/arcbox-docker/src/handlers/container.rs | Adds fail-closed amd64 admission, records container IDs/names to role, and routes lifecycle/networking by role. |
| app/arcbox-docker/src/handlers/build.rs | Routes builds through HV and fails closed for amd64 when FEX64 is absent. |
| app/arcbox-docker/src/api.rs | Stores the workload role registry in shared app state. |
| app/arcbox-daemon/src/startup/mod.rs | Expands startup resource-wait to scan both native and rosetta dockerd images. |
| app/arcbox-core/src/workload.rs | Defines shared UtilityVmRole enum and helpers for cross-crate role identity. |
| app/arcbox-core/src/vm_lifecycle/mod.rs | Adds per-machine identity/config, drift detection based on resolved boot params, and backend selection. |
| app/arcbox-core/src/vm.rs | Makes VM backend selection a per-machine config property instead of hardcoding HV. |
| app/arcbox-core/src/runtime.rs | Introduces per-role slots (native + optional rosetta) and per-machine inbound port-forwarding state. |
| app/arcbox-core/src/machine.rs | Adds backend + rosetta exposure flags to machine creation config. |
| app/arcbox-core/src/lib.rs | Re-exports VmBackend and UtilityVmRole. |
| app/arcbox-core/src/boot_assets.rs | Updates docs for preparing all runtime binaries in the boot manifest (including optional FEX). |
| app/arcbox-api/src/grpc/machine.rs | Updates machine creation requests to populate new backend/rosetta fields. |
| fn ambiguous_workload_error(id: &str) -> DockerError { | ||
| DockerError::Conflict(format!( | ||
| "workload identifier '{id}' is ambiguous: it matches multiple workloads. \ | ||
| Use the full canonical container ID." | ||
| )) | ||
| } |
| /// Ensures the utility VM for `role` is running and ready. | ||
| /// | ||
| /// Drives the per-role lifecycle so the native and rosetta VMs are | ||
| /// reachable independently. If `role` is not configured on this host | ||
| /// (e.g. rosetta on non-Apple-Silicon) the native slot answers as a | ||
| /// degradation path — the dockerd connector still works, but the | ||
| /// workload runs on HV instead of VZ+Rosetta. |
| pub async fn ensure_role_ready(&self, role: UtilityVmRole) -> Result<u32> { | ||
| self.slot_for(role).container_backend.ensure_ready().await | ||
| } |
| /// Returns the role slot, falling back to native if `role` is not | ||
| /// configured on this host. | ||
| fn slot_for(&self, role: UtilityVmRole) -> &RoleSlot { | ||
| if let Some(slot) = self.role_slots.get(&role) { | ||
| return slot; | ||
| } | ||
| self.role_slots | ||
| .get(&UtilityVmRole::Native) | ||
| .expect("Native role slot must always be present") | ||
| } |
| /// If the alias is currently owned by a different canonical (e.g. a | ||
| /// previous container with the same name that has not yet been | ||
| /// forgotten), the alias is detached from the previous owner first so | ||
| /// the old owner's alias list never points to a key that now resolves | ||
| /// to a different role. |
| detach_alias_from_previous_owner(&mut guard, &alias); | ||
| guard.roles.insert(alias.clone(), role); | ||
| guard | ||
| .alias_owner | ||
| .insert(alias.clone(), canonical.to_string()); |
| async fn get_cid(&self) -> Result<u32> { | ||
| self.machine_manager | ||
| .get_cid(DEFAULT_MACHINE_NAME) | ||
| .get_cid(&self.machine_name) | ||
| .ok_or_else(|| CoreError::Machine("default machine has no CID".to_string())) | ||
| } |
| match self.machine_manager.start(&self.machine_name).await { | ||
| Ok(()) => { | ||
| tracing::info!("Default VM started successfully"); |
| # --- Gate A: basic viability ------------------------------------------------ | ||
| section "Gate A: basic viability" | ||
|
|
||
| arch="$("${DC[@]}" run --rm --platform linux/arm64 alpine uname -m 2>/dev/null)" |
| amd64_out="$("${DC[@]}" run --rm --platform linux/amd64 alpine uname -m 2>&1)" | ||
| if [ "$amd64_out" = "x86_64" ]; then |
Greptile SummaryThis PR (ABX-375) routes all Docker runtime containers through the single HV utility VM, using FEX (
Confidence Score: 3/5Safe to continue iterating on, but two issues should be resolved before the branch is considered production-ready: the Rosetta lifecycle shutdown gap and the test harness false-FAIL path. The routing pivot, WorkloadRoleRegistry, and FEX admission gate are all well-implemented and tested. runtime::shutdown() tears down the Rosetta VM machine directly but never calls the Rosetta lifecycle manager's own shutdown, leaving its in-memory state stale and its cancellation token uncancelled — any subsequent call to lifecycle_for_role(Rosetta).shutdown() would then target the wrong machine name. The validate-fex harness exit-code and stderr tests emit FAIL (the resume-ABX-374 signal) for transient infrastructure errors, making the gate unreliable as a decision artifact. app/arcbox-core/src/runtime.rs (shutdown/shutdown_force Rosetta lifecycle teardown) and tests/fex/validate-fex.sh (exit-code and stderr propagation gate checks). Important Files Changed
|
There was a problem hiding this comment.
ℹ️ No critical issues — the findings are latent, scoped to the not-yet-routed Rosetta path. The Native runtime path (all that this PR exercises) is sound.
Reviewed changes — the ABX-375 pivot from multi-utility-VM routing to a single HV utility VM that runs linux/amd64 via in-guest FEX64, with VZ/Rosetta demoted to an opt-in build backend.
- New
routing.rsplacement layer —WorkloadPlatformparsing maps to aRuntimeTranslator(Native/Fex64);RoutingDecision::utility_vm()always returnsUtilityVmRole::Native, andis_admissible()fails closed foramd64when FEX64 is absent rather than falling back to VZ/QEMU. - Per-role VM abstraction in
Runtime— addsRoleSlot/role_slots,ensure_role_ready,machine_name_for_role,lifecycle_for_role, andamd64_runtime_supported. The Rosetta (VZ) slot is constructed lazily on macOS aarch64 but its VM only boots onensure_role_ready(Rosetta), which nothing currently calls. VmLifecycleManagerparameterized per machine — newfor_machine(name, data_image)constructor;new()delegates withDEFAULT_MACHINE_NAME/docker.img;DEFAULT_MACHINE_NAMEreplaced with&self.machine_nameacross the manager. Default-VM drift detection generalized viamachine_drift_reason.- Role-aware Docker handlers —
proxy_to_role,resolve_container_role, and a workload→role registry with alias tracking and fail-closed prefix-collision handling. - Per-machine inbound port forwarding —
inbound_listenerskeyed by machine name andinbound_rulescarry the owning machine so teardown reaches the right listener. - Daemon resource wait —
wait_for_resourcesnow scans bothdocker.imganddocker-rosetta.img(filtered by existence). - Assets + tooling —
assets.lockbumped to boot-assets v0.5.10 (static FEX64);tests/fex/validation harness + README added. The guestarm64.nosve/SME disable was added then reverted later in the branch (net no cmdline change).
ℹ️ Runtime shutdown never tears down the Rosetta role slot's lifecycle
The per-role refactor constructs a Rosetta VmLifecycleManager into role_slots on Apple Silicon, but Runtime::shutdown() / shutdown_force() only operate on self.vm_lifecycle (the native lifecycle). The Rosetta slot's lifecycle is never told to shut down, so its health-monitor task, state machine, and MachineStopped event are skipped on daemon teardown.
This is latent today: nothing routes to Rosetta (utility_vm() is always Native), so the Rosetta VM never boots and there is nothing to tear down. It becomes a real resource/teardown gap the moment the Rosetta route is re-enabled (ABX-374). shutdown()/shutdown_force() are outside this PR's diff, which is why this is in the body rather than inline.
Technical details
# Runtime shutdown skips non-native role slots
## Affected sites
- `app/arcbox-core/src/runtime.rs` — `Runtime::shutdown()` (~line 659) and `shutdown_force()` (~line 727) call only `self.vm_lifecycle.shutdown()` / `.force_stop()`.
- `role_slots[UtilityVmRole::Rosetta].lifecycle` is never shut down.
## Required outcome
- On runtime shutdown, every configured role slot's lifecycle manager is shut down (or force-stopped), not just the native one — so the health monitor is cancelled, state transitions to `Stopped`/`NotExist`, and `MachineStopped` is published per machine.
## Suggested approach (optional)
- Iterate `self.role_slots.values()` and call `lifecycle.shutdown()` / `force_stop()` on each, deduplicating the native slot which is already covered by `self.vm_lifecycle`. The generic step-3 "stop non-default machines" loop terminates the VM process but does not drive the lifecycle state machine.
## Open questions for the human
- Is Rosetta intended to stay a fully cold/dead path on this branch (handoff to ABX-374), or should the teardown wiring land here so the slot is safe to activate later?Claude Opus | 𝕏
| DEFAULT_MACHINE_NAME | ||
| ); | ||
| self.machine_manager.stop(DEFAULT_MACHINE_NAME) | ||
| self.machine_manager.stop(&self.machine_name) |
There was a problem hiding this comment.
The per-machine rename is incomplete in shutdown(): the graceful_stop(DEFAULT_MACHINE_NAME, …) call (line 1105) and both warning logs (lines 1112, 1119) still use the "default" constant, while these force-stop fallbacks correctly use &self.machine_name. For a non-native lifecycle (the rosetta slot), shutdown() would graceful-stop the native machine and only force-stop the right one — and the diagnostics would name the wrong machine.
Latent today (nothing calls the rosetta lifecycle's shutdown()), but it's a clear gap in the rename this branch otherwise completed.
Technical details
# shutdown() still hardcodes DEFAULT_MACHINE_NAME
## Affected sites
- `app/arcbox-core/src/vm_lifecycle/mod.rs:1105` — `graceful_stop(DEFAULT_MACHINE_NAME, …)`
- `app/arcbox-core/src/vm_lifecycle/mod.rs:1112`, `:1119` — `tracing::warn!` interpolate `DEFAULT_MACHINE_NAME`
## Required outcome
- All four sites address `&self.machine_name`, consistent with the force-stop fallbacks at 1114/1122.| if let Some(id) = extract_container_id(&uri) { | ||
| let _ = state.runtime.ensure_vm_ready().await; | ||
| if let Some(body_bytes) = inspect_container_body(&state, &id).await { | ||
| if let Some(body_bytes) = inspect_container_body(&state, role, &id).await { |
There was a problem hiding this comment.
restart_container resolves role (line 201) but the best-effort wake on line 209 calls state.runtime.ensure_vm_ready(), which always wakes the native VM, while the subsequent inspect_container_body(&state, role, &id) proxies to the role's VM. The same pattern is in resolve_canonical_from_uri (line 531). Should be state.runtime.ensure_role_ready(role).await.
Latent today since every container resolves to Native, but it would wake the wrong VM once Rosetta is routed.
| // Resolve role → machine/port via the runtime. Today both | ||
| // roles still alias to the default machine; once the dual VM | ||
| // lifecycle lands the rosetta branch returns its own machine | ||
| // name and dockerd port without any change here. |
There was a problem hiding this comment.
This comment is already inaccurate as of this PR: machine_name_for_role(Rosetta) returns "rosetta" (not the default machine) on macOS aarch64, since the rosetta slot is populated. The dual-VM lifecycle has landed; the comment describes the prior state. Consider dropping or correcting it.
The interpreter has always been plain upstream FEX; FEX64 was an invented name. Renames RuntimeTranslator::Fex64, needs_fex64(), the fex64 tracing label, the fail-closed error message, and the validation script (validate-fex64.sh -> validate-fex.sh, with its 'requires fex64' match updated in the same change). Binary paths and the binfmt entry were already correct.
Pre-existing clippy::redundant_clone on the last use of `current`.

ABX-375: All-in-HV + FEX64 for
linux/amd64— status & handoffGoal: run
linux/amd64OCI containers through FEX64 (binfmt) inside the singleHV utility VM, so no VZ/Rosetta runtime VM is needed. VZ/Rosetta kept only as an
optional build backend / fallback (ABX-374, PR #291, branch
feat/dual-utility-vm-routing— preserved, do not delete).Status: ⛔ BLOCKED at Gate A on Apple Silicon (Apple M5 Max / macOS 26)
linux/amd64via FEX64: FEX SIGILLs — does not run.Root cause (diagnosed live)
The HVF guest is advertised phantom SVE it cannot execute.
/proc/cpuinfointhe guest reports
sve2 sve2p1 svebf16 …, yet a single SVE instruction (rdvl)SIGILLs. FEX trusts
HWCAP_SVE, runs SVE, and traps. This appears specific toM5 + macOS-26 HVF (a real M1 advertises no SVE → FEX uses NEON and works).
Fixes attempted
arm64.nosve arm64.nosmeguest cmdline (this branch,514f36b,app/arcbox-core/src/vm_lifecycle/mod.rs) — ✅ verified: guest → 0 SVEfeatures. Correct hardening regardless of the amd64 decision; keep.
SVE, but FEX-2605 also emits explicit, unconditional SVE (gdb:
ld1rd {z5.d}, p5/zin FEX's own code) that runs and traps even with guestSVE disabled. So neither fix, nor both together, makes FEX run amd64 here.
Decision point
Per the ABX-375 plan, a Gate-A failure → resume ABX-374 (VZ Rosetta) for amd64
rather than duct-taping. Options:
linux/amd64runtime to VZ Rosetta (proven); keep theHV path +
arm64.nosvefor arm64. FEX stays optional/experimental.or try a newer FEX — uncertain; may be a FEX-vs-M5/HVF incompatibility.
What's in this branch
runtime VM (
f4387f9,app/arcbox-docker/src/routing.rs+ handlers).assets.lockpinned to boot-assets v0.5.9 (static FEX at/arcbox/runtime/bin/FEX) —c9e2e6b.tests/fex/validate-fex64.sh.arm64.nosveguest fix —514f36b.Related resources
the guard correctly catches FEX's remaining explicit SVE):
fix(fex): target apple-m1 to avoid SVE codegen (SIGILL on Apple Silicon) boot-assets#23
feat/dual-utility-vm-routing, PR feat(docker): add utility VM routing seam #291.Reproduce the blocker