fix(ipc): harden the agent-GUI link end to end (review follow-ups for #212/#213)#215
Conversation
…e exec failure The watcher exec'd on the first (len, mtime) difference, so a non-atomic binary replacement (a plain cp, the linker rewriting target/debug in place) could be caught mid-write and exec a truncated image — and the failure path exited(1), killing the agent permanently in exactly the no-respawner setups (autostart off, GUI closed) the module exists to protect. A change now has to hold still for two consecutive ticks before the exec, and a failed exec returns to the watch loop (exec does not fork, so the old image is intact) instead of dying. Also documents the symlink-flip (Nix profile) blind spot: current_exe resolves the store payload, which never changes in place.
The first PROTOCOL_VERSION bump strands every existing user: the deployed agent predates the self-restart watcher, never exits, and holds the singleton lock — so the freshly-spawned new agent loses and quits, launchd (which only acts on exit) never intervenes, and the new GUI refuses the old protocol and sits on its connecting screen until the next login. A new agent that loses the lock now handshakes the lock holder over IPC (protocol_version is wire-stable across every version: method 0, plain u32). If the holder is provably older it gets SIGTERM'd — under launchd a signal death is a non-successful exit, so launchd itself respawns the new binary — and the loser of the ensuing lock race exits cleanly. Debug builds never take over: a dev agent must not displace the user's production agent.
inventory_ready: bool couldn't say "enumeration is broken", which left
two real dead ends: a persistently failing HID backend kept the GUI on
'Scanning for devices…' forever with no CTA or error, and the watcher's
thread-spawn-failure fallback forged a completed empty enumeration —
the agent confidently reported 'no devices' for a check that never ran.
AgentStatus now carries InventoryHealth { Scanning, Ready, Unavailable }:
- the watcher reports Unavailable after 3 consecutive initial failures
(and keeps retrying — a later success upgrades back to Ready), and on
its spawn-failure path instead of forging an empty snapshot;
- the agent select loop maps a closed watcher channel (thread death,
e.g. a panic inside enumerate) to Unavailable as well;
- the orchestrator folds last_inventory + enumerated into one
InventoryState, so the never-checked / checked-and-empty / broken
distinction lives in the type, and mid-session failures still keep
the last good snapshot.
Wire change → PROTOCOL_VERSION 3 (the takeover/self-restart machinery
converges running agents).
accessibility_granted: Option<bool> + scanning: bool were two hand-synced
projections of the same fact (the last AgentStatus snapshot, or its
absence) — a future writer updating one but not the other would
desynchronize the gate and the scanning state. AppState now stores a
single AgentLink { Connecting, Unreachable, OutdatedGui, Ready(status) }
and the render path branches on it, so the frames can't disagree.
This also restructures render around one root element: the window
actions (⌘W / ⌘M / zoom) now work on the connecting/gate/error frames
too, which previously early-returned before the on_action handlers were
attached.
New frames (wired to the IPC client in a follow-up commit):
- Unreachable: static notice once the agent has been down well past
startup — no infinite spinner animation pinning the render loop on a
frame that can stay up indefinitely;
- OutdatedGui: the agent answered with a newer protocol (the bundle was
updated under a running GUI) — explain and offer a relaunch button
instead of a frozen live-looking UI;
- scanning-unavailable: the agent reports enumeration as broken, which
used to render as an eternal 'Scanning…' spinner.
Locale files gain the 8 new strings at the same ordered position
(including the pairing-restart message a follow-up commit uses).
The update arm fed every snapshot into refresh_inventories, including a freshly-(re)started agent's empty pre-enumeration list. INVENTORY_MISS_ GRACE = 2 was calibrated against the 2 s steady poll (~4-6 s of tolerance); at the 250 ms reconnect cadence three empty snapshots arrive within ~750 ms while a fresh enumeration takes 1.5-5 s (the Bolt arrival drain alone is 1.5 s) — so every agent self-restart wiped the device list, popped an open device detail page back to Home, and dropped the per-device DPI/SmartShift caches. Gate the merge on InventoryHealth:: Ready: a not-ready agent has nothing authoritative to merge, and the stale list rides through the restart exactly as it rode through 2 s polls before.
…t handling Several poll-loop fixes from the post-merge review of #212/#213: - poll() now fetches status BEFORE inventory. The other way around, the agent's first enumeration could land between the two RPCs and pair an empty pre-enumeration inventory with ready=true — re-creating the 'No devices connected' flash the fast poll exists to prevent, stretched to a full steady period by the cadence latch. The inverse pairing is benign (devices render regardless of the scanning state). - Cadence policy extracted into a unit-tested Pacing type, and the command-branch disconnect now re-arms the fast cadence too — it previously only reset client, contradicting the 'back to fast whenever the connection is lost' contract, so a disconnect detected by an in-flight command re-converged at 2 s instead of 250 ms. - The fast phase is capped (15 s without readiness → steady): an agent that never becomes ready, or a protocol mismatch, no longer holds the loop (and its connect+warn round) at 4 Hz for the GUI's lifetime. - New GuiUpdate signals: Unreachable (no snapshot 15 s into an outage) and OutdatedGui (agent speaks a newer protocol) now reach AppState, wiring up the static frames from the AgentLink commit — no more eternal connecting spinner, and no frozen live-looking UI after an in-place app update. - A pairing session interrupted by an agent restart synthesizes a Failed event (the agent's terminal-event guarantee dies with the process), so the Add Device window no longer sits in 'Searching…' until manually cancelled. - Both intervals use MissedTickBehavior::Delay (a stalled poll no longer bursts its missed ticks), the spawn gate skips the tick that detected a drop (don't race the agent's self-exec for the singleton lock), identical snapshots skip the full-window refresh, and the module docs drop the stale 'launchd KeepAlive reconnects us' claim.
agent_binary_path() joined a bare 'openlogi-agent' (no .exe) and then fell through to a macOS-only LoginItems path, so on Windows the spawn retry — the only thing that restarts an updated agent there, since restart() exits and the Run-key autostart only fires at login — could never locate the binary. Append std::env::consts::EXE_SUFFIX and gate the bundle-helper fallback to macOS. Also gates takeover.rs imports for the Windows stub (caught by the windows cross-lint).
…ands Every PollUpdate constructed a fresh AssetResolver — exe resolution, cache-root stats, and a full serde_json parse of index.json (652 KB / ~210 entries on a synced install) — at 4 Hz during fast-poll phases. And the per-tick rebuild didn't even deliver freshness: refresh_ inventories' unchanged-list early-return discarded the freshly resolved records unless the device set happened to change. Hoist the resolver out of the loop and rebuild it exactly once, when the background asset sync reports completion, paired with a forced list refresh so silhouettes actually upgrade to the synced photos mid-session.
The agent<->GUI frames are bincode 1.3 DefaultOptions (varint, little-endian) via tokio-serde, and bincode carries no schema: struct field order, field types, and enum variant order ARE the encoding. Nothing enforced that until now — a refactor reordering WriteError variants or wrapping a PairedDevice field in Option would decode as garbage (or worse, as the wrong variant) across an agent/GUI version skew, and no test would notice. Add golden-byte tests in openlogi-agent-core covering every type that crosses the IPC boundary, including the tarpc-generated AgentRequest — whose variant index is how trait method order becomes wire format (protocol_version must stay variant 0 forever: it is the handshake and the takeover probe). A pinned PROTOCOL_VERSION assertion makes any golden regeneration surface the version bump in the same diff. Also drop a doc marker on each wire type that lives outside openlogi-agent-core (DeviceInventory tree, Lighting, DeviceRoute, DpiInfo, SmartShift types, pairing types, WriteError) so an editor sees the contract before the test fails.
Greptile SummaryThis PR hardens the agent↔GUI IPC link end-to-end across three structural axes: a stale-agent takeover mechanism (new
Confidence Score: 5/5Safe to merge; the one noted gap (post-sync photo rebuild lost on a Scanning tick) is cosmetic and requires a narrow race to trigger. All three structural fixes are coherent and well-tested. The takeover path handles the empty-pids race (addressed in this PR), the stale-is-newer case, and the debug-build guard. The crates/openlogi-gui/src/main.rs — the Important Files Changed
Sequence DiagramsequenceDiagram
participant NewAgent as New Agent (main)
participant OldAgent as Old Agent (lock holder)
participant Lock as Singleton Lock
participant GUI
NewAgent->>Lock: acquire("agent.lock")
Lock-->>NewAgent: AlreadyRunning
NewAgent->>OldAgent: protocol_version() [timeout 2s]
OldAgent-->>NewAgent: "v2 (< v3)"
NewAgent->>OldAgent: kill -TERM
loop 20 x 200 ms
NewAgent->>Lock: acquire("agent.lock")
Lock-->>NewAgent: OK (old agent exited)
end
NewAgent->>NewAgent: run() as primary agent
Note over GUI,NewAgent: Normal poll loop
GUI->>NewAgent: protocol_version()
NewAgent-->>GUI: v3
GUI->>NewAgent: status() → InventoryHealth::Scanning
GUI->>NewAgent: inventory() → []
Note over GUI: AgentLink::Connecting → Connecting frame
GUI->>NewAgent: status() → InventoryHealth::Ready
GUI->>NewAgent: inventory() → [device]
Note over GUI: AgentLink::Ready → device gallery
Note over GUI,NewAgent: Agent update / exec
NewAgent->>NewAgent: self_restart detects binary change (2-tick settle)
NewAgent->>NewAgent: exec() replaces process image
GUI->>NewAgent: RPC error — connection dropped
Note over GUI: pacing: back to Fast (250 ms)
GUI->>NewAgent: reconnect → protocol_version()
alt newer agent (GUI is stale)
NewAgent-->>GUI: "v4 > v3"
Note over GUI: AgentLink::OutdatedGui → relaunch CTA
else same version
NewAgent-->>GUI: v3
Note over GUI: AgentLink::Connecting → converges to Ready
end
Reviews (2): Last reviewed commit: "fix(agent): don't exit as a duplicate wh..." | Re-trigger Greptile |
…y gone If the lock holder dies between the version handshake and the pgrep call, agent_pids() comes back empty, nothing gets signalled, and the takeover bailed out as a duplicate — leaving no agent running even though the lock was free (until launchd or the GUI's 30s spawn retry noticed). An empty pid list now falls through to the lock retry loop: the lock itself is the ground truth for whether the holder vacated. Found by Greptile on #215.
Fixes everything surfaced by the max-effort review of #212/#213 (15 confirmed/plausible findings), plus the cheap cleanups from the same review's cut list. One PR because the findings share three roots — the v2 migration story, inventory readiness being a lossy bool, and GUI connection state being scattered — and fixing them piecemeal would have meant re-touching the same seams repeatedly.
The three structural fixes
1. Stale-agent takeover (
fix(agent): takeover commit)The review's top finding: the self-exec watcher shipped in #213 only exists in binaries that ship it, so the first protocol bump strands every user whose pre-watcher agent is running — it never exits, holds the singleton lock so every new agent loses and quits, and the new GUI refuses the old protocol. The user parks on the connecting screen until next login.
The escape hatch runs in the new agent: on losing the lock it connects as a client and asks the holder for
protocol_version()— wire-stable across every version (method 0, plainu32). If the holder is provably older, it gets SIGTERM (no quit RPC exists in old protocols) and the new agent retries the lock. Under launchdKeepAlive {SuccessfulExit: false}, signal death also makes launchd itself respawn from the (updated) bundle path, so either path converges on exactly one current agent.connect/wrap/AgentClientthe GUI uses; the rest ispgrep/kill/lock-retry.2. Inventory readiness is now a tri-state (
feat(ipc), protocol v3)inventory_ready: boolconflated three answers and the review found bugs on both sides of it:InventoryHealth { Scanning, Ready, Unavailable }replaces the bool (hence protocol v3 — v2 never shipped in a release). The watcher reportsUnavailableonce after 3 consecutive initial failures (and keeps retrying); the orchestrator only downgrades a still-pending inventory, so a transient failure after a good scan never erases known devices. The agent's select loop also maps the watcher channel closing (thread death) toUnavailableinstead of spinning on a dead receiver. The GUI renders a static notice pointing at the agent log.3. One
AgentLinkstate for the GUI (refactor(gui))Connection state was
accessibility_granted: Option<bool>+scanning: boolacross two files, and the review traced three inconsistent frames to that split.AgentLink { Connecting, Unreachable, OutdatedGui, Ready(AgentStatus) }is now the single source of truth:cx.restart().CloseWindow/Minimize/Zoomaction trio — ⌘W/⌘M worked only on the Ready frame before.The polling/merge fixes (
fix(gui)commits)pacing::Pacingcentralizes cadence: 250ms fast phase / 2s steady, fast phase hard-capped at 15s from first contact, reopened on reconnect after an outage. Unit-tested; replaces three separate ad-hoc cadence bugs (unbounded fast poll, no fast phase after reconnect, interval drift viaMissedTickBehavior::Delay).Failed("The background service restarted — try pairing again.")instead of leaving the Add Device window waiting forever.openlogi-agentwithoutEXE_SUFFIX, so on Windows — where the spawn retry is the only thing that restarts an updated agent — it silently never worked.forcepush-through so the rebuild isn't lost to the unchanged-list early-return).Wire-format guard (
test(ipc))The frames are bincode 1.3
DefaultOptionsand bincode carries no schema: field order, field types, and enum variant order are the encoding, and nothing enforced that. New golden-byte tests pin every type that crosses the IPC boundary — including the tarpc-generatedAgentRequest, whose variant index is how trait method order becomes wire format (protocol_versionmust stay variant 0 forever; it is the handshake and the takeover probe). A pinnedPROTOCOL_VERSIONassertion forces any golden regeneration to surface the version bump in the same diff. Wire types living inopenlogi-core/openlogi-hidgot doc markers so an editor sees the contract before the test fails.Also from the review, in
self_restart: the debounce now requires two consecutive identical fingerprints before exec (no more racing a mid-write binary), a failed exec keeps the current image and retries instead of dying, and the Nix store symlink-flip limitation is documented.Deliberately not fixed
AgentStatusfield (another protocol rev); follow-up.Verification
cargo test --workspacegreen (incl. 4 new pacing tests, 3 self-restart debounce tests, 2 orchestrator health tests, 6 wire-format golden tests, i18n parity × 20 locales).-D warnings) clean; Windows cross-lint (openlogi:check-windows) clean.