Skip to content

fix(ipc): harden the agent-GUI link end to end (review follow-ups for #212/#213)#215

Merged
AprilNEA merged 11 commits into
masterfrom
fix/agent-link-hardening
Jun 11, 2026
Merged

fix(ipc): harden the agent-GUI link end to end (review follow-ups for #212/#213)#215
AprilNEA merged 11 commits into
masterfrom
fix/agent-link-hardening

Conversation

@AprilNEA

Copy link
Copy Markdown
Owner

Fixes everything surfaced by the max-effort review of #212/#213 (15 confirmed/plausible findings), plus the cheap cleanups from the same review's cut list. One PR because the findings share three roots — the v2 migration story, inventory readiness being a lossy bool, and GUI connection state being scattered — and fixing them piecemeal would have meant re-touching the same seams repeatedly.

The three structural fixes

1. Stale-agent takeover (fix(agent): takeover commit)

The review's top finding: the self-exec watcher shipped in #213 only exists in binaries that ship it, so the first protocol bump strands every user whose pre-watcher agent is running — it never exits, holds the singleton lock so every new agent loses and quits, and the new GUI refuses the old protocol. The user parks on the connecting screen until next login.

The escape hatch runs in the new agent: on losing the lock it connects as a client and asks the holder for protocol_version() — wire-stable across every version (method 0, plain u32). If the holder is provably older, it gets SIGTERM (no quit RPC exists in old protocols) and the new agent retries the lock. Under launchd KeepAlive {SuccessfulExit: false}, signal death also makes launchd itself respawn from the (updated) bundle path, so either path converges on exactly one current agent.

  • Debug builds never take over: a dev agent losing the lock to the user's production agent is the expected dev workflow.
  • Not runtime-tested against a live agent — doing so would require killing the production v0.6.6 agent on the dev machine. The handshake path is the same connect/wrap/AgentClient the GUI uses; the rest is pgrep/kill/lock-retry.

2. Inventory readiness is now a tri-state (feat(ipc), protocol v3)

inventory_ready: bool conflated three answers and the review found bugs on both sides of it:

  • The watcher's spawn-failure fallback sent an empty snapshot, forging a "checked, no devices" answer for a check that never ran — the GUI would show the empty-state CTA over a broken HID backend.
  • If enumeration never succeeds, the GUI showed "scanning…" forever with no path out.

InventoryHealth { Scanning, Ready, Unavailable } replaces the bool (hence protocol v3 — v2 never shipped in a release). The watcher reports Unavailable once after 3 consecutive initial failures (and keeps retrying); the orchestrator only downgrades a still-pending inventory, so a transient failure after a good scan never erases known devices. The agent's select loop also maps the watcher channel closing (thread death) to Unavailable instead of spinning on a dead receiver. The GUI renders a static notice pointing at the agent log.

3. One AgentLink state for the GUI (refactor(gui))

Connection state was accessibility_granted: Option<bool> + scanning: bool across two files, and the review traced three inconsistent frames to that split. AgentLink { Connecting, Unreachable, OutdatedGui, Ready(AgentStatus) } is now the single source of truth:

  • Unreachable is a real frame (was: connecting spinner forever). It appears only after 15s without a delivery, once per outage.
  • OutdatedGui (agent newer than this GUI — possible mid-update) gets a relaunch CTA wired to cx.restart().
  • Every frame, including the early ones, hangs off one root element that carries the CloseWindow/Minimize/Zoom action trio — ⌘W/⌘M worked only on the Ready frame before.
  • The settings Accessibility row now distinguishes "agent says denied" from "agent unreachable" instead of defaulting to Denied.

The polling/merge fixes (fix(gui) commits)

  • Status-first polling: inventory was fetched before status, so a Ready status could arrive with a pre-Ready (empty) inventory and flash the empty state — the exact fix: stop flashing the permission gate and empty state at startup #212 bug class, one layer down. Status is now fetched strictly first, and the GUI merges device lists only from completed enumerations.
  • pacing::Pacing centralizes cadence: 250ms fast phase / 2s steady, fast phase hard-capped at 15s from first contact, reopened on reconnect after an outage. Unit-tested; replaces three separate ad-hoc cadence bugs (unbounded fast poll, no fast phase after reconnect, interval drift via MissedTickBehavior::Delay).
  • Pairing: an agent restart mid-pairing now synthesizes a terminal Failed("The background service restarted — try pairing again.") instead of leaving the Add Device window waiting forever.
  • Windows agent path: the spawn retry looked for openlogi-agent without EXE_SUFFIX, so on Windows — where the spawn retry is the only thing that restarts an updated agent — it silently never worked.
  • Asset resolver: built once instead of per-tick, rebuilt exactly when an asset sync lands (with a force push-through so the rebuild isn't lost to the unchanged-list early-return).

Wire-format guard (test(ipc))

The frames are bincode 1.3 DefaultOptions and bincode carries no schema: field order, field types, and enum variant order are the encoding, and nothing enforced that. New golden-byte tests pin every type that crosses the IPC boundary — including the tarpc-generated AgentRequest, whose variant index is how trait method order becomes wire format (protocol_version must stay variant 0 forever; it is the handshake and the takeover probe). A pinned PROTOCOL_VERSION assertion forces any golden regeneration to surface the version bump in the same diff. Wire types living in openlogi-core/openlogi-hid got doc markers so an editor sees the contract before the test fails.

Also from the review, in self_restart: the debounce now requires two consecutive identical fingerprints before exec (no more racing a mid-write binary), a failed exec keeps the current image and retries instead of dying, and the Nix store symlink-flip limitation is documented.

Deliberately not fixed

  • Input Monitoring settings row — needs a new AgentStatus field (another protocol rev); follow-up.
  • Self-restart baseline fingerprint race (binary replaced during startup hashing) — no portable fix; window is milliseconds at process start.
  • Nix symlink flips don't change the watched inode — documented in the module doc; Nix users manage services declaratively anyway.
  • The 8 new i18n strings are hand-translated across all 20 locales (parity test enforces ordering); they should be seeded into Crowdin before the Crowdin workflow lands so the TM doesn't overwrite them.

Verification

  • cargo test --workspace green (incl. 4 new pacing tests, 3 self-restart debounce tests, 2 orchestrator health tests, 6 wire-format golden tests, i18n parity × 20 locales).
  • Full-workspace clippy (-D warnings) clean; Windows cross-lint (openlogi:check-windows) clean.
  • Protocol bump v3 exercised implicitly: a dev GUI against the running production v0.6.6 agent correctly sits on the new connecting frame (strict-equality refusal) — expected until an agent with this PR is installed.

AprilNEA added 10 commits June 11, 2026 16:08
…e exec failure

The watcher exec'd on the first (len, mtime) difference, so a non-atomic
binary replacement (a plain cp, the linker rewriting target/debug in
place) could be caught mid-write and exec a truncated image — and the
failure path exited(1), killing the agent permanently in exactly the
no-respawner setups (autostart off, GUI closed) the module exists to
protect. A change now has to hold still for two consecutive ticks before
the exec, and a failed exec returns to the watch loop (exec does not
fork, so the old image is intact) instead of dying.

Also documents the symlink-flip (Nix profile) blind spot: current_exe
resolves the store payload, which never changes in place.
The first PROTOCOL_VERSION bump strands every existing user: the
deployed agent predates the self-restart watcher, never exits, and holds
the singleton lock — so the freshly-spawned new agent loses and quits,
launchd (which only acts on exit) never intervenes, and the new GUI
refuses the old protocol and sits on its connecting screen until the
next login.

A new agent that loses the lock now handshakes the lock holder over IPC
(protocol_version is wire-stable across every version: method 0, plain
u32). If the holder is provably older it gets SIGTERM'd — under launchd
a signal death is a non-successful exit, so launchd itself respawns the
new binary — and the loser of the ensuing lock race exits cleanly.
Debug builds never take over: a dev agent must not displace the user's
production agent.
inventory_ready: bool couldn't say "enumeration is broken", which left
two real dead ends: a persistently failing HID backend kept the GUI on
'Scanning for devices…' forever with no CTA or error, and the watcher's
thread-spawn-failure fallback forged a completed empty enumeration —
the agent confidently reported 'no devices' for a check that never ran.

AgentStatus now carries InventoryHealth { Scanning, Ready, Unavailable }:
- the watcher reports Unavailable after 3 consecutive initial failures
  (and keeps retrying — a later success upgrades back to Ready), and on
  its spawn-failure path instead of forging an empty snapshot;
- the agent select loop maps a closed watcher channel (thread death,
  e.g. a panic inside enumerate) to Unavailable as well;
- the orchestrator folds last_inventory + enumerated into one
  InventoryState, so the never-checked / checked-and-empty / broken
  distinction lives in the type, and mid-session failures still keep
  the last good snapshot.

Wire change → PROTOCOL_VERSION 3 (the takeover/self-restart machinery
converges running agents).
accessibility_granted: Option<bool> + scanning: bool were two hand-synced
projections of the same fact (the last AgentStatus snapshot, or its
absence) — a future writer updating one but not the other would
desynchronize the gate and the scanning state. AppState now stores a
single AgentLink { Connecting, Unreachable, OutdatedGui, Ready(status) }
and the render path branches on it, so the frames can't disagree.

This also restructures render around one root element: the window
actions (⌘W / ⌘M / zoom) now work on the connecting/gate/error frames
too, which previously early-returned before the on_action handlers were
attached.

New frames (wired to the IPC client in a follow-up commit):
- Unreachable: static notice once the agent has been down well past
  startup — no infinite spinner animation pinning the render loop on a
  frame that can stay up indefinitely;
- OutdatedGui: the agent answered with a newer protocol (the bundle was
  updated under a running GUI) — explain and offer a relaunch button
  instead of a frozen live-looking UI;
- scanning-unavailable: the agent reports enumeration as broken, which
  used to render as an eternal 'Scanning…' spinner.

Locale files gain the 8 new strings at the same ordered position
(including the pairing-restart message a follow-up commit uses).
The update arm fed every snapshot into refresh_inventories, including a
freshly-(re)started agent's empty pre-enumeration list. INVENTORY_MISS_
GRACE = 2 was calibrated against the 2 s steady poll (~4-6 s of
tolerance); at the 250 ms reconnect cadence three empty snapshots arrive
within ~750 ms while a fresh enumeration takes 1.5-5 s (the Bolt arrival
drain alone is 1.5 s) — so every agent self-restart wiped the device
list, popped an open device detail page back to Home, and dropped the
per-device DPI/SmartShift caches. Gate the merge on InventoryHealth::
Ready: a not-ready agent has nothing authoritative to merge, and the
stale list rides through the restart exactly as it rode through 2 s
polls before.
…t handling

Several poll-loop fixes from the post-merge review of #212/#213:

- poll() now fetches status BEFORE inventory. The other way around, the
  agent's first enumeration could land between the two RPCs and pair an
  empty pre-enumeration inventory with ready=true — re-creating the 'No
  devices connected' flash the fast poll exists to prevent, stretched to
  a full steady period by the cadence latch. The inverse pairing is
  benign (devices render regardless of the scanning state).
- Cadence policy extracted into a unit-tested Pacing type, and the
  command-branch disconnect now re-arms the fast cadence too — it
  previously only reset client, contradicting the 'back to fast
  whenever the connection is lost' contract, so a disconnect detected
  by an in-flight command re-converged at 2 s instead of 250 ms.
- The fast phase is capped (15 s without readiness → steady): an agent
  that never becomes ready, or a protocol mismatch, no longer holds the
  loop (and its connect+warn round) at 4 Hz for the GUI's lifetime.
- New GuiUpdate signals: Unreachable (no snapshot 15 s into an outage)
  and OutdatedGui (agent speaks a newer protocol) now reach AppState,
  wiring up the static frames from the AgentLink commit — no more
  eternal connecting spinner, and no frozen live-looking UI after an
  in-place app update.
- A pairing session interrupted by an agent restart synthesizes a
  Failed event (the agent's terminal-event guarantee dies with the
  process), so the Add Device window no longer sits in 'Searching…'
  until manually cancelled.
- Both intervals use MissedTickBehavior::Delay (a stalled poll no
  longer bursts its missed ticks), the spawn gate skips the tick that
  detected a drop (don't race the agent's self-exec for the singleton
  lock), identical snapshots skip the full-window refresh, and the
  module docs drop the stale 'launchd KeepAlive reconnects us' claim.
agent_binary_path() joined a bare 'openlogi-agent' (no .exe) and then
fell through to a macOS-only LoginItems path, so on Windows the spawn
retry — the only thing that restarts an updated agent there, since
restart() exits and the Run-key autostart only fires at login — could
never locate the binary. Append std::env::consts::EXE_SUFFIX and gate
the bundle-helper fallback to macOS.

Also gates takeover.rs imports for the Windows stub (caught by the
windows cross-lint).
…ands

Every PollUpdate constructed a fresh AssetResolver — exe resolution,
cache-root stats, and a full serde_json parse of index.json (652 KB /
~210 entries on a synced install) — at 4 Hz during fast-poll phases.
And the per-tick rebuild didn't even deliver freshness: refresh_
inventories' unchanged-list early-return discarded the freshly resolved
records unless the device set happened to change.

Hoist the resolver out of the loop and rebuild it exactly once, when the
background asset sync reports completion, paired with a forced list
refresh so silhouettes actually upgrade to the synced photos mid-session.
The agent<->GUI frames are bincode 1.3 DefaultOptions (varint,
little-endian) via tokio-serde, and bincode carries no schema: struct
field order, field types, and enum variant order ARE the encoding.
Nothing enforced that until now — a refactor reordering WriteError
variants or wrapping a PairedDevice field in Option would decode as
garbage (or worse, as the wrong variant) across an agent/GUI version
skew, and no test would notice.

Add golden-byte tests in openlogi-agent-core covering every type that
crosses the IPC boundary, including the tarpc-generated AgentRequest —
whose variant index is how trait method order becomes wire format
(protocol_version must stay variant 0 forever: it is the handshake and
the takeover probe). A pinned PROTOCOL_VERSION assertion makes any
golden regeneration surface the version bump in the same diff.

Also drop a doc marker on each wire type that lives outside
openlogi-agent-core (DeviceInventory tree, Lighting, DeviceRoute,
DpiInfo, SmartShift types, pairing types, WriteError) so an editor sees
the contract before the test fails.
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown

Greptile Summary

This PR hardens the agent↔GUI IPC link end-to-end across three structural axes: a stale-agent takeover mechanism (new takeover.rs), a tri-state InventoryHealth enum replacing the lossy inventory_ready: bool, and a unified AgentLink state replacing scattered connection flags in the GUI. It also centralises poll cadence in a tested pacing::Pacing module, fixes status-before-inventory ordering to prevent the "No devices" flash, adds golden-byte wire-format tests, and fixes the Windows EXE_SUFFIX omission in the agent spawn path.

  • Takeover (takeover.rs, main.rs): the new agent probes the lock-holder's protocol version via RPC, SIGTERMs it if provably older, and retries the singleton lock; debug builds are unconditionally excluded from displacing a production agent.
  • InventoryHealth tri-state (ipc.rs, orchestrator.rs, inventory.rs): Scanning/Ready/Unavailable replaces bool, with the watcher reporting Unavailable after three consecutive initial failures and the orchestrator preventing Unavailable from clobbering a live device set.
  • AgentLink + pacing::Pacing (state.rs, ipc_client.rs, app.rs): one enum drives all GUI frames; the Unreachable and OutdatedGui frames are new, bounded, and static; the pacing module is unit-tested with four scenarios.

Confidence Score: 5/5

Safe to merge; the one noted gap (post-sync photo rebuild lost on a Scanning tick) is cosmetic and requires a narrow race to trigger.

All three structural fixes are coherent and well-tested. The takeover path handles the empty-pids race (addressed in this PR), the stale-is-newer case, and the debug-build guard. The InventoryHealth state machine correctly prevents Unavailable from overwriting a live snapshot, and the watcher's INITIAL_FAILURE_LIMIT fires exactly once. The pacing module's four unit tests cover the cap, reconnect rearm, outage-then-recovery, and newer-agent paths. The only gap found is that synced_assets_applied is consumed before confirming the forced inventory rebuild ran — requiring SYNC_DONE to coincide with a Scanning tick, which in practice demands an agent restart mid-sync.

crates/openlogi-gui/src/main.rs — the synced_assets_applied / force_refresh sequencing around line 285.

Important Files Changed

Filename Overview
crates/openlogi-agent/src/takeover.rs New file implementing stale-agent takeover via IPC version probe + SIGTERM + lock retry; empty-pids race addressed in this PR; debug guard, Unix/Windows stubs, and retry budget are all correct.
crates/openlogi-agent-core/src/ipc.rs Protocol bumped to v3 with InventoryHealth tri-state replacing inventory_ready: bool; AgentStatus gets PartialEq/Eq needed by the GUI dedup; wire format pinned by golden tests.
crates/openlogi-agent-core/src/orchestrator.rs Internal InventoryState superset maps cleanly to wire InventoryHealth; mark_inventory_unavailable correctly downgrades only Pending, not live snapshots; both new tests cover key invariants.
crates/openlogi-agent-core/src/watchers/inventory.rs InventoryEvent enum replaces bare Vec; INITIAL_FAILURE_LIMIT counter sends Unavailable exactly once at the third consecutive initial failure and continues retrying; spawn-failure now emits Unavailable instead of forging an empty snapshot.
crates/openlogi-agent/src/self_restart.rs Two-tick debounce via assess() prevents exec of mid-write binaries; failed exec returns instead of exiting (keeps current image, disarms for fresh settle); three new unit tests cover settle, churn, and disarm paths.
crates/openlogi-gui/src/ipc_client.rs Major refactor: GuiUpdate enum surfaces Unreachable/OutdatedGui states; pacing::Pacing centralises cadence with cap, reconnect reset, and four unit tests; status-before-inventory order prevents the "No devices" flash; Windows EXE_SUFFIX fix in agent_binary_path; mid-pairing restart now synthesises Failed.
crates/openlogi-gui/src/state.rs AgentLink enum replaces scattered accessibility_granted: Option<bool> + scanning: bool; set_agent_link returns a changed-bool for dedup; refresh_inventories now returns bool and accepts force for post-sync rebuilds.
crates/openlogi-gui/src/app.rs All frames now hang off a shared root element carrying the window-action trio (⌘W/M/zoom work on every frame); Unreachable, OutdatedGui, and Unavailable frames are static (no animation loop); relaunch CTA wired to cx.restart().
crates/openlogi-gui/src/main.rs Asset resolver built once per session and rebuilt on sync completion; ipc_open gate prevents spinning on a closed channel; synced_assets_applied flag has a narrow race where it's consumed before the forced rebuild actually runs (see inline comment).
crates/openlogi-agent-core/tests/wire_format.rs New golden-byte tests pin every IPC-crossing type using the exact DefaultOptions varint encoding; protocol_version_is_pinned ties version bumps to golden regeneration; request_variant_order guards the RPC method-index-as-wire-format invariant.
crates/openlogi-agent/src/main.rs Takeover integrated cleanly; inventory_open gate prevents spin on a closed watcher channel; InventoryEvent arms in the select loop correctly map both Unavailable and channel close to mark_inventory_unavailable.

Sequence Diagram

sequenceDiagram
    participant NewAgent as New Agent (main)
    participant OldAgent as Old Agent (lock holder)
    participant Lock as Singleton Lock
    participant GUI

    NewAgent->>Lock: acquire("agent.lock")
    Lock-->>NewAgent: AlreadyRunning
    NewAgent->>OldAgent: protocol_version() [timeout 2s]
    OldAgent-->>NewAgent: "v2 (< v3)"
    NewAgent->>OldAgent: kill -TERM
    loop 20 x 200 ms
        NewAgent->>Lock: acquire("agent.lock")
        Lock-->>NewAgent: OK (old agent exited)
    end
    NewAgent->>NewAgent: run() as primary agent

    Note over GUI,NewAgent: Normal poll loop
    GUI->>NewAgent: protocol_version()
    NewAgent-->>GUI: v3
    GUI->>NewAgent: status() → InventoryHealth::Scanning
    GUI->>NewAgent: inventory() → []
    Note over GUI: AgentLink::Connecting → Connecting frame
    GUI->>NewAgent: status() → InventoryHealth::Ready
    GUI->>NewAgent: inventory() → [device]
    Note over GUI: AgentLink::Ready → device gallery

    Note over GUI,NewAgent: Agent update / exec
    NewAgent->>NewAgent: self_restart detects binary change (2-tick settle)
    NewAgent->>NewAgent: exec() replaces process image
    GUI->>NewAgent: RPC error — connection dropped
    Note over GUI: pacing: back to Fast (250 ms)
    GUI->>NewAgent: reconnect → protocol_version()
    alt newer agent (GUI is stale)
        NewAgent-->>GUI: "v4 > v3"
        Note over GUI: AgentLink::OutdatedGui → relaunch CTA
    else same version
        NewAgent-->>GUI: v3
        Note over GUI: AgentLink::Connecting → converges to Ready
    end
Loading

Reviews (2): Last reviewed commit: "fix(agent): don't exit as a duplicate wh..." | Re-trigger Greptile

Comment thread crates/openlogi-agent/src/takeover.rs
…y gone

If the lock holder dies between the version handshake and the pgrep
call, agent_pids() comes back empty, nothing gets signalled, and the
takeover bailed out as a duplicate — leaving no agent running even
though the lock was free (until launchd or the GUI's 30s spawn retry
noticed). An empty pid list now falls through to the lock retry loop:
the lock itself is the ground truth for whether the holder vacated.

Found by Greptile on #215.
@AprilNEA AprilNEA merged commit bb02290 into master Jun 11, 2026
8 checks passed
@AprilNEA AprilNEA deleted the fix/agent-link-hardening branch June 11, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant