Skip to content

retry mesh joins via relay-only address#496

Closed
i386 wants to merge 1 commit into
mainfrom
codex/relay-only-join-fallback
Closed

retry mesh joins via relay-only address#496
i386 wants to merge 1 commit into
mainfrom
codex/relay-only-join-fallback

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented May 10, 2026

Summary

Fix mesh joins where a token advertises a valid iroh relay plus direct IP addresses that are not reachable from the joining node.

The join path now tries the full advertised EndpointAddr first, preserving the normal iroh direct-path behavior. If that connect fails and the token contains relay transports, it retries once with a relay-only EndpointAddr so unusable direct addresses do not prevent relay connectivity.

Token Breakdown

The reported token decodes to:

{
  "id": "cb1737c6fb73a5d8173a335d8af291db8f5d605874fc1bf52ad4dcc390728287",
  "addrs": [
    { "Relay": "https://aps1-1.relay.michaelneale.mesh-llm.iroh.link./" },
    { "Ip": "10.0.0.1:52398" },
    { "Ip": "100.107.22.123:52398" },
    { "Ip": "100.112.37.220:40256" },
    { "Ip": "192.168.0.2:52398" },
    { "Ip": "192.168.86.26:52398" }
  ]
}

This token was produced on James' network. From James' node, those direct IPs are meaningful candidates because they describe local or overlay interfaces available in James' network context. From Michael's network, they are not generally usable:

  • 10.0.0.1 is RFC1918 private LAN space.
  • 192.168.0.2 and 192.168.86.26 are RFC1918 private LAN space.
  • 100.107.22.123 and 100.112.37.220 are in 100.64.0.0/10, commonly CGNAT or overlay VPN space such as Tailscale, and are not generally reachable from an unrelated LAN unless both sides share that routing context.
  • The relay is the only address in the token that should be expected to work across James' LAN and Michael's LAN.

Why Michael's Node Could Fail

Michael's node was handed an address card for James' node. The expected behavior is that Michael cannot reach James via James' LAN/private/overlay IPs, but can still reach James via the advertised relay.

The suspected failure is that our join path gave iroh the full mixed address list inside one 15-second connect window. With several plausible but unreachable direct candidates present, iroh could spend that window on direct-path attempts and fail before the relay path got a clean attempt. That makes a relay-capable token appear unreachable from Michael's third LAN.

Most tokens that join cleanly either contain a direct address reachable from the joiner, or iroh's normal path selection reaches the relay quickly enough after direct attempts. This token is different because it has several direct candidates that are valid for James' environment but unusable from Michael's network.

Validation

  • cargo fmt --all -- --check
  • cargo check -p mesh-llm

Targeted test note: cargo test -p mesh-llm-host-runtime relay_only_endpoint_addr --lib could not link in this worktree because skippy-ffi could not find the native static library llama-common. The new unit tests compile under cargo check; running them requires the local llama static ABI build artifacts.

@i386 i386 changed the title [codex] retry mesh joins via relay-only address retry mesh joins via relay-only address May 10, 2026
@i386 i386 requested a review from michaelneale May 10, 2026 01:17
@i386 i386 marked this pull request as ready for review May 10, 2026 01:29
@michaelneale
Copy link
Copy Markdown
Collaborator

didn't seem to solve it in this case, so WIP

@michaelneale
Copy link
Copy Markdown
Collaborator

I think will close this, not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants