Dynamic peer discovery — Kubernetes headless service (Option B from #5)

Scoped follow-up to #5. Covers **Option B only** (Kubernetes peer discovery for in-cluster deployments). Option A (mDNS LAN) is tracked separately in #62; Option C (gossip) remains parked under #5.

> **Pivot from #5's original spec:** #5 proposed *headless Service + DNS poll + BootstrapInfo handshake*. peat-mesh's existing `KubernetesDiscovery` takes a different (better) approach — **EndpointSlice watching + annotation-carried metadata**. This issue follows the peat-mesh design rather than reinventing.

## Context

In-cluster deployments today must pre-share Iroh endpoint IDs at sidecar startup (`--peer endpoint_id@host:port`, `src/main.rs:60`). For autoscaling deployments — replicaset scales 3 → 5, new pods need to join the mesh — there's no mechanism for them to find existing peers or for existing peers to learn about them short of an external orchestrator calling `ConnectPeer` per new pod.

## Prior art in `peat-mesh =0.9.0-rc.9` (most of the work is already done — but feature-gated)

The Kubernetes implementation lives in peat-mesh behind the `kubernetes` feature flag (not currently enabled in peat-node's dep). Reusable today:

- `peat_mesh::discovery::KubernetesDiscovery` — watches `EndpointSlice` resources via the kube API
- `peat_mesh::discovery::KubernetesDiscoveryConfig`:
  - `namespace: Option<String>` — defaults to service-account mount, falls back to `default`
  - `label_selector: String` — defaults `app=peat-mesh`
  - `annotation_prefix: String` — defaults `peat.`
  - `poll_interval: Duration` — defaults 30s
- `extract_peers_from_endpoint_slice(...)`:
  - `node_id` ← `endpoint.target_ref.name` (the pod name)
  - addresses ← `endpoint.addresses`
  - port ← EndpointSlice port (defaults 8080)
  - custom metadata (e.g. `relay_url`, **and presumably iroh endpoint ID**) ← EndpointSlice annotations with the configured prefix
- Reference wiring: `peat-mesh/src/bin/peat-mesh-node.rs:152-156` (the `kubernetes` / `k8s` mode branch)

`AutomergeBackend::with_iroh` does **not** fold discovery into its config — peat-mesh keeps the two concerns parallel, leaving the consumer to construct discovery alongside the backend. peat-node follows the same pattern.

## Concrete gap list (what needs to be built in peat-node)

1. **Enable the `kubernetes` feature on the peat-mesh dep.** `Cargo.toml:16` currently reads `features = [\"automerge-backend\"]`; add `\"kubernetes\"`.
2. **Construct + spawn discovery in `src/node.rs`.** Instantiate `KubernetesDiscovery::new(...)`, take its event stream, call `start`, call `advertise(node_id, sidecar_port)`, and spawn an event-consumer task that maps `PeerInfo` events → `node.connect_peer(endpoint_id, &addresses, \"\")`. Mirrors the mDNS wiring in #62.
3. **CLI flags in `src/main.rs`:**
   ```
   --discovery-mode <none|mdns|kubernetes>   / PEAT_NODE_DISCOVERY_MODE (default: none)
   --discovery-namespace <ns>                / PEAT_NODE_DISCOVERY_NAMESPACE (auto from SA mount)
   --discovery-label-selector <sel>          / PEAT_NODE_DISCOVERY_LABEL_SELECTOR (default: app=peat-node)
   --discovery-annotation-prefix <prefix>    / PEAT_NODE_DISCOVERY_ANNOTATION_PREFIX (default: peat.)
   --discovery-interval <seconds>            / PEAT_NODE_DISCOVERY_INTERVAL (default: 30)
   ```
   Share `--discovery-mode` and `--discovery-interval` with #62.
4. **Iroh endpoint ID propagation into the EndpointSlice annotation — the key technical unknown.** peat-mesh's extractor reads annotations off the **EndpointSlice resource**, not off pods. EndpointSlices are auto-managed by the kube-controller-manager and don't inherit per-pod annotations for free. Options to investigate during design:
   - **a)** Helm chart sets a static `endpoint_id` annotation on the Service → EndpointSlice mirroring carries it through. Only works if all replicas share an endpoint_id (they don't — endpoint_id is per-instance).
   - **b)** Each pod self-patches its own EndpointSlice annotation on startup via the kube API. Requires `patch` RBAC on `endpointslices` — viable but adds a sidecar-startup dependency on the API server.
   - **c)** Each pod self-patches its own **pod** annotation (cheaper RBAC), and we extend peat-mesh's extractor to look at the pod via `target_ref.name` → pod GET → annotations. Adds one API call per peer discovery cycle.
   - **d)** Use a deterministic keypair seeded from `(formation_id, pod_name)` so endpoint_id is computable from `target_ref.name` without any annotation lookup. Cleanest but requires a deterministic-keypair option in `AutomergeBackendConfig` (peat-mesh additive change).
   First spelunk should be: **does peat-mesh's reference binary or operator solve this today, and how?** That answer dictates everything else.
5. **RBAC manifests in `chart/peat-node/templates/`:** new `ServiceAccount`, `Role` (or `ClusterRole` if cross-namespace) with `get`/`list`/`watch` on `endpointslices.discovery.k8s.io`, and `RoleBinding`. Optional additional `patch` on `endpointslices` (option 4b) or `pods` (option 4c) depending on gap 4's resolution.
6. **Service / Deployment labels.** `app=peat-node` label on the Deployment so the default `label_selector` works without forcing operators to think about it.
7. **Self-filter + dedup in the event consumer.** Skip own pod, skip already-connected peers.
8. **Graceful degradation.** kube API unreachable (e.g., running outside a cluster) → log + continue, do not fail startup. The `--discovery-mode=none` path must remain the safe default.
9. **Helm chart values (`chart/peat-node/values.yaml`):** `discovery.mode`, `discovery.namespace`, `discovery.labelSelector`, `discovery.annotationPrefix`, `discovery.interval`. Default `mode: none` to stay backward-compatible.
10. **Docs:** `README.md` config table + `docs/CONFIGURATION.md` deployment example for in-cluster discovery, including a complete Deployment + Service + RBAC manifest example.
11. **Tests:**
    - Unit: `extract_peers_from_endpoint_slice` is already covered in peat-mesh; peat-node tests cover the event-consumer dedup/self-filter.
    - Integration (extend `test/cross-cluster-sync.sh` or new k3d test): 3 sidecar replicas in a Deployment with no static `--peer` flags converge to a fully-connected mesh; scale to 5 and verify new pods join within `--discovery-interval`.

## Acceptance

- `cargo build` and `cargo test` green
- `helm template chart/peat-node` renders cleanly with `discovery.mode=kubernetes` and produces the RBAC + Service + Deployment manifests
- Existing static `--peer` flow continues to work when `--discovery-mode=none` (default)
- k3d integration test: 3-replica Deployment converges to a fully-connected mesh under the new discovery mode; scaling up adds the new pod within `discovery.interval`
- README API/config table and `docs/CONFIGURATION.md` updated; sample Deployment + RBAC manifest included

## Constraints

- Proto-first per SKILL.md: if any discovery state is exposed via gRPC (e.g., \"list discovered peers\"), it goes in `proto/sidecar.proto` first. Optional for the first cut.
- Discovery off by default. Failure to reach the kube API must log + degrade gracefully, never panic.
- Don't break the `endpoint_id@host:port` parsing in `main.rs` — kubernetes discovery is additive.
- Cross-namespace discovery is **out of scope** for the first cut; namespace defaults to the pod's own SA-mount namespace.

## Dependencies

- **Probably requires a small peat-mesh PR**, depending on gap 4's resolution. If `KubernetesDiscovery` doesn't already have a story for per-pod endpoint_id propagation, the cleanest fix is option 4d (deterministic per-pod keypair seeded from `(formation_id, pod_name)`), which needs additive surface in `AutomergeBackendConfig`.
- Helm chart values addition; chart version bump per repo convention.

## Effort estimate

Medium. Larger than #62 because of the RBAC manifests and the gap-4 question. If gap 4 resolves to options (a–c) it's plumbing + chart work (~300 lines + manifests + docs). If it needs option (d), add a small peat-mesh PR first. The peat-mesh `KubernetesDiscovery` itself is done — peat-node is the wiring + RBAC + the endpoint_id propagation answer.

## References

- Parent tracker: #5
- Sibling: #62 (mDNS — shares CLI / event-consumer plumbing)
- Existing peer wiring: `src/main.rs:60`, `src/node.rs:260`
- Helm chart current state: `chart/peat-node/templates/` (no RBAC, no headless service today)
- peat-mesh prior art: `peat_mesh::discovery::{KubernetesDiscovery, KubernetesDiscoveryConfig}` (feature-gated on `kubernetes`), reference binary at `peat-mesh/src/bin/peat-mesh-node.rs:152-156`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic peer discovery — Kubernetes headless service (Option B from #5) #63

Context

Prior art in `peat-mesh =0.9.0-rc.9` (most of the work is already done — but feature-gated)

Concrete gap list (what needs to be built in peat-node)

Acceptance

Constraints

Dependencies

Effort estimate

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dynamic peer discovery — Kubernetes headless service (Option B from #5) #63

Description

Context

Prior art in peat-mesh =0.9.0-rc.9 (most of the work is already done — but feature-gated)

Concrete gap list (what needs to be built in peat-node)

Acceptance

Constraints

Dependencies

Effort estimate

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Prior art in `peat-mesh =0.9.0-rc.9` (most of the work is already done — but feature-gated)