hub: connectionRegistry refactor + 10 connection-handling bug fixes#77
Open
DerAndereAndi wants to merge 1 commit into
Open
hub: connectionRegistry refactor + 10 connection-handling bug fixes#77DerAndereAndi wants to merge 1 commit into
DerAndereAndi wants to merge 1 commit into
Conversation
Audit identified 10 bugs in the hub's connection lifecycle, clustered
around three root causes. Rather than 10 individual patches, three
architectural primitives eliminate the classes by construction; four
remaining bugs are targeted hygiene fixes.
What:
- connectionRegistry (hub/connection_registry.go): owns the
connections map under a private mutex. Swap() atomically applies
the SHIP 12.2.2 double-connection rule and registers the new
connection in one critical section, with a caller-supplied builder
invoked only when the new conn wins. RemoveIfMatches returns bool
so HandleConnectionClosed gates user-facing disconnect callbacks on
slot ownership. shouldKeepNew is the single point of change for a
future spec-compliant rule swap.
- Hub.ctx + Hub.inflightWg: cancelled first in Shutdown; timer
callbacks (prepareConnectionInitation) and in-flight dials
(connectFoundService) short-circuit on ctx.Err(); inflightWg.Wait
drains pending dials with a 6s timeout.
- ShipConnectionInterface.IsAlive(): atomic.Bool flipped inside
shutdownOnce.Do. Registry uses it to filter zombie entries
(CloseConnection fired, HandleConnectionClosed not yet propagated).
Targeted fixes (no architectural change needed):
- Bug 1: defer flag-clear in prepareConnectionInitation early returns
- Bug 4: release muxCon before muxTimers in registerConnection
- Bug 10: self-cleaning timer entries in connectionDelayTimers
- Bug 11: drop unconditional 100ms sleep in sendWSCloseMessage
Why:
The audit's three root causes - decide-and-act split across goroutines
(TOCTOU, ungated callbacks, dual websocket close ownership), no
first-class shutdown context (timer callbacks and dials outlived
Shutdown), and no liveness signal on registered connections (zombie
entries blocked reconnects) - were properties of the *primitives*,
not of any one feature. Confining each class to one well-tested
primitive means future features funnelling through these paths cannot
reintroduce the same shapes.
How:
TDD per bug: failing test first, minimal fix to GREEN, full suite
re-run. Phase 4 verification tests for the architectural bugs (2, 3,
9, 12) passed on first run after the registry was wired - exactly the
signal that the architecture made the classes unrepresentable. 13
existing test files migrated to the registry API; original assertion
intent preserved. Suite passes under -race and -tags=deadlock.
Bugs 5 and 8 from the original audit are absent on dev; both are
specific to feature/shippairing-2 (AddCu replacement, ServiceIdentity
callback migration) and tracked in SHIP_PAIRING_2_AUDIT_FOLLOWUPS.md.
Public API unchanged. ctx is intentionally not re-armed in Shutdown
(would race with concurrent reads); to restart the hub fully, create
a new Hub instance.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Audit identified 10 bugs in the hub's connection lifecycle, clustered around three root causes. Rather than 10 individual patches, three architectural primitives eliminate the classes by construction; four remaining bugs are targeted hygiene fixes.
What:
Targeted fixes (no architectural change needed):
Why:
The audit's three root causes - decide-and-act split across goroutines
(TOCTOU, ungated callbacks, dual websocket close ownership), no
first-class shutdown context (timer callbacks and dials outlived
Shutdown), and no liveness signal on registered connections (zombie
entries blocked reconnects) - were properties of the primitives,
not of any one feature. Confining each class to one well-tested
primitive means future features funnelling through these paths cannot
reintroduce the same shapes.
How:
TDD per bug: failing test first, minimal fix to GREEN, full suite
re-run. Phase 4 verification tests for the architectural bugs (2, 3,
9, 12) passed on first run after the registry was wired - exactly the
signal that the architecture made the classes unrepresentable. 13
existing test files migrated to the registry API; original assertion
intent preserved. Suite passes under -race and -tags=deadlock.
Bugs 5 and 8 from the original audit are absent on dev; both are specific to feature/shippairing-2 (AddCu replacement, ServiceIdentity callback migration) and tracked in SHIP_PAIRING_2_AUDIT_FOLLOWUPS.md.
Public API unchanged. ctx is intentionally not re-armed in Shutdown (would race with concurrent reads); to restart the hub fully, create a new Hub instance.