retry mesh joins via relay-only address#496
Closed
i386 wants to merge 1 commit into
Closed
Conversation
Collaborator
|
didn't seem to solve it in this case, so WIP |
Collaborator
|
I think will close this, not needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix mesh joins where a token advertises a valid iroh relay plus direct IP addresses that are not reachable from the joining node.
The join path now tries the full advertised
EndpointAddrfirst, preserving the normal iroh direct-path behavior. If that connect fails and the token contains relay transports, it retries once with a relay-onlyEndpointAddrso unusable direct addresses do not prevent relay connectivity.Token Breakdown
The reported token decodes to:
{ "id": "cb1737c6fb73a5d8173a335d8af291db8f5d605874fc1bf52ad4dcc390728287", "addrs": [ { "Relay": "https://aps1-1.relay.michaelneale.mesh-llm.iroh.link./" }, { "Ip": "10.0.0.1:52398" }, { "Ip": "100.107.22.123:52398" }, { "Ip": "100.112.37.220:40256" }, { "Ip": "192.168.0.2:52398" }, { "Ip": "192.168.86.26:52398" } ] }This token was produced on James' network. From James' node, those direct IPs are meaningful candidates because they describe local or overlay interfaces available in James' network context. From Michael's network, they are not generally usable:
10.0.0.1is RFC1918 private LAN space.192.168.0.2and192.168.86.26are RFC1918 private LAN space.100.107.22.123and100.112.37.220are in100.64.0.0/10, commonly CGNAT or overlay VPN space such as Tailscale, and are not generally reachable from an unrelated LAN unless both sides share that routing context.Why Michael's Node Could Fail
Michael's node was handed an address card for James' node. The expected behavior is that Michael cannot reach James via James' LAN/private/overlay IPs, but can still reach James via the advertised relay.
The suspected failure is that our join path gave iroh the full mixed address list inside one 15-second connect window. With several plausible but unreachable direct candidates present, iroh could spend that window on direct-path attempts and fail before the relay path got a clean attempt. That makes a relay-capable token appear unreachable from Michael's third LAN.
Most tokens that join cleanly either contain a direct address reachable from the joiner, or iroh's normal path selection reaches the relay quickly enough after direct attempts. This token is different because it has several direct candidates that are valid for James' environment but unusable from Michael's network.
Validation
cargo fmt --all -- --checkcargo check -p mesh-llmTargeted test note:
cargo test -p mesh-llm-host-runtime relay_only_endpoint_addr --libcould not link in this worktree becauseskippy-fficould not find the native static libraryllama-common. The new unit tests compile undercargo check; running them requires the local llama static ABI build artifacts.