Skip to content

hub: connectionRegistry refactor + 10 connection-handling bug fixes#77

Open
DerAndereAndi wants to merge 1 commit into
devfrom
fix/hub-connection-bugs
Open

hub: connectionRegistry refactor + 10 connection-handling bug fixes#77
DerAndereAndi wants to merge 1 commit into
devfrom
fix/hub-connection-bugs

Conversation

@DerAndereAndi

Copy link
Copy Markdown
Member

Audit identified 10 bugs in the hub's connection lifecycle, clustered around three root causes. Rather than 10 individual patches, three architectural primitives eliminate the classes by construction; four remaining bugs are targeted hygiene fixes.

What:

  • connectionRegistry (hub/connection_registry.go): owns the connections map under a private mutex. Swap() atomically applies the SHIP 12.2.2 double-connection rule and registers the new connection in one critical section, with a caller-supplied builder invoked only when the new conn wins. RemoveIfMatches returns bool so HandleConnectionClosed gates user-facing disconnect callbacks on slot ownership. shouldKeepNew is the single point of change for a future spec-compliant rule swap.
  • Hub.ctx + Hub.inflightWg: cancelled first in Shutdown; timer callbacks (prepareConnectionInitation) and in-flight dials (connectFoundService) short-circuit on ctx.Err(); inflightWg.Wait drains pending dials with a 6s timeout.
  • ShipConnectionInterface.IsAlive(): atomic.Bool flipped inside shutdownOnce.Do. Registry uses it to filter zombie entries (CloseConnection fired, HandleConnectionClosed not yet propagated).

Targeted fixes (no architectural change needed):

  • Bug 1: defer flag-clear in prepareConnectionInitation early returns
  • Bug 4: release muxCon before muxTimers in registerConnection
  • Bug 10: self-cleaning timer entries in connectionDelayTimers
  • Bug 11: drop unconditional 100ms sleep in sendWSCloseMessage

Why:
The audit's three root causes - decide-and-act split across goroutines
(TOCTOU, ungated callbacks, dual websocket close ownership), no
first-class shutdown context (timer callbacks and dials outlived
Shutdown), and no liveness signal on registered connections (zombie
entries blocked reconnects) - were properties of the primitives,
not of any one feature. Confining each class to one well-tested
primitive means future features funnelling through these paths cannot
reintroduce the same shapes.

How:
TDD per bug: failing test first, minimal fix to GREEN, full suite
re-run. Phase 4 verification tests for the architectural bugs (2, 3,
9, 12) passed on first run after the registry was wired - exactly the
signal that the architecture made the classes unrepresentable. 13
existing test files migrated to the registry API; original assertion
intent preserved. Suite passes under -race and -tags=deadlock.

Bugs 5 and 8 from the original audit are absent on dev; both are specific to feature/shippairing-2 (AddCu replacement, ServiceIdentity callback migration) and tracked in SHIP_PAIRING_2_AUDIT_FOLLOWUPS.md.

Public API unchanged. ctx is intentionally not re-armed in Shutdown (would race with concurrent reads); to restart the hub fully, create a new Hub instance.

Audit identified 10 bugs in the hub's connection lifecycle, clustered
around three root causes. Rather than 10 individual patches, three
architectural primitives eliminate the classes by construction; four
remaining bugs are targeted hygiene fixes.

What:
  - connectionRegistry (hub/connection_registry.go): owns the
    connections map under a private mutex. Swap() atomically applies
    the SHIP 12.2.2 double-connection rule and registers the new
    connection in one critical section, with a caller-supplied builder
    invoked only when the new conn wins. RemoveIfMatches returns bool
    so HandleConnectionClosed gates user-facing disconnect callbacks on
    slot ownership. shouldKeepNew is the single point of change for a
    future spec-compliant rule swap.
  - Hub.ctx + Hub.inflightWg: cancelled first in Shutdown; timer
    callbacks (prepareConnectionInitation) and in-flight dials
    (connectFoundService) short-circuit on ctx.Err(); inflightWg.Wait
    drains pending dials with a 6s timeout.
  - ShipConnectionInterface.IsAlive(): atomic.Bool flipped inside
    shutdownOnce.Do. Registry uses it to filter zombie entries
    (CloseConnection fired, HandleConnectionClosed not yet propagated).

Targeted fixes (no architectural change needed):
  - Bug 1:  defer flag-clear in prepareConnectionInitation early returns
  - Bug 4:  release muxCon before muxTimers in registerConnection
  - Bug 10: self-cleaning timer entries in connectionDelayTimers
  - Bug 11: drop unconditional 100ms sleep in sendWSCloseMessage

Why:
  The audit's three root causes - decide-and-act split across goroutines
  (TOCTOU, ungated callbacks, dual websocket close ownership), no
  first-class shutdown context (timer callbacks and dials outlived
  Shutdown), and no liveness signal on registered connections (zombie
  entries blocked reconnects) - were properties of the *primitives*,
  not of any one feature. Confining each class to one well-tested
  primitive means future features funnelling through these paths cannot
  reintroduce the same shapes.

How:
  TDD per bug: failing test first, minimal fix to GREEN, full suite
  re-run. Phase 4 verification tests for the architectural bugs (2, 3,
  9, 12) passed on first run after the registry was wired - exactly the
  signal that the architecture made the classes unrepresentable. 13
  existing test files migrated to the registry API; original assertion
  intent preserved. Suite passes under -race and -tags=deadlock.

Bugs 5 and 8 from the original audit are absent on dev; both are
specific to feature/shippairing-2 (AddCu replacement, ServiceIdentity
callback migration) and tracked in SHIP_PAIRING_2_AUDIT_FOLLOWUPS.md.

Public API unchanged. ctx is intentionally not re-armed in Shutdown
(would race with concurrent reads); to restart the hub fully, create
a new Hub instance.
@coveralls

Copy link
Copy Markdown

Coverage Status

coverage: 94.78% (-0.03%) from 94.807% — fix/hub-connection-bugs into dev

@DerAndereAndi DerAndereAndi added the bug Something isn't working label Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants