[ADR] gRPC listener authentication & authorization

## Status

**Proposed.** Design discussion — no decision yet. Once accepted, this body should be flattened into a tracked ADR file: either `peat-node/docs/adr/001-grpc-listener-auth.md` (establishing a repo-local ADR convention, currently absent) or `peat/docs/adr/NNN-peat-node-grpc-auth.md` (ecosystem-level, alongside ADRs 006, 043, 044, 048, 055, 056). Choosing the home is itself part of this decision (see Open Questions).

## Context

`peat-node` exposes `peat.sidecar.v1.PeatSidecar` over a single Connect-RPC listener that simultaneously serves Connect-over-HTTP+JSON, gRPC, and gRPC-Web. Today there is **no authentication and no authorization on the listener**:

- No TLS on inbound connections (TLS exists only on the *outbound* watcher → agent client path in `src/watcher.rs`).
- No interceptor, no bearer-token / API-key check, no caller-identity surface.
- The Helm chart defaults to `listen: "tcp://127.0.0.1:50051"`; `src/main.rs` defaults to `tcp://0.0.0.0:50051`; Unix-socket mode (`unix:///path/to/sock`) is supported.

The implicit security model is **sidecar-namespace trust**: peat-node and the consuming application share a pod (K8s) or a container (Compose), the listener binds to loopback or a UDS, and anything that reaches the socket is by definition the co-located app. The pod / container boundary is the auth boundary.

This works for the Kubernetes sidecar deployment that `docs/DESIGN.md` describes. It breaks in at least four cases that are already real:

1. **The new Compose example** (`examples/compose/`, PR #36 / issue #35) binds `0.0.0.0:50051` inside each container and maps to the host. Anyone on the host can hit either node. Acceptable for a localhost demo — unacceptable beyond.
2. **Multi-tenant containers** — two unrelated processes in the same pod / container both reach the loopback listener; only one should be authorized.
3. **Cross-host gRPC** — app and sidecar on different machines (e.g. a development laptop talking to a remote peat-node), or a peat-node fronting fleet-wide queries.
4. **`peat-gateway` integration** — per ADR-043, protocol-bridge adapters (NATS, MQTT, etc.) live in `peat-gateway`. If `peat-gateway` is the *only* trusted caller of a deployed `peat-node`, the gateway → sidecar hop currently has no caller-identity check.

At the mesh layer (peat-node ↔ peat-node over Iroh), authentication is solved: `FormationKey` handshake (`app_id` + 32-byte shared key) + QUIC encryption + ed25519 endpoint identity. That is **not** the question here. This ADR is exclusively about **inbound calls to the gRPC listener from co-located or remote clients**.

## Drivers

A listener-auth design must address all of these, or explicitly decline each:

| Driver | Why it matters |
|---|---|
| **Trust-model clarity** | Today's "loopback / UDS = trusted" model is undocumented as a security boundary. Anyone reading the chart values or the Compose example will not infer the implications. |
| **Caller identity** | Some deployments need *who* called, not just *that they could reach the socket* (audit, multi-tenant routing, quota). A shared secret gives "yes/no"; per-caller identity needs cert or token. |
| **Rotation** | Whatever we add must be rotatable without restarting the sidecar (or with a clearly-documented restart cost). |
| **Operational simplicity** | Sidecars deploy at fleet scale (5000+ agents per `docs/DESIGN.md`). Whatever auth gets added must be feasible to provision and rotate at that scale — manual cert-per-agent fails. |
| **Failure modes** | Issue #37 (watcher TLS partial-config silent fallback) shows the existing pattern of "set some env vars, get insecure" is real. Any new auth surface must fail closed, never open. |
| **Consistency with ecosystem invariants** | `FormationKey` is already the mesh-layer trust unit. ADR-043 carves out adapters as a separate concern (peat-gateway). The decision should harmonize with both. |
| **Observability** | If auth fails, the operator needs a discoverable signal — not a silent 401 in a log nobody reads. |
| **Backwards compatibility** | Existing deployments with no auth must keep working until they opt in (or we ship a major version that flips the default). |

## Options

Five candidates, none mutually exclusive. The decision is likely a *combination* (e.g. A + C), not a single pick.

### Option A — Status quo, formalize the trust model

Do nothing in code. Add a `SECURITY.md` (currently absent in this repo) that explicitly documents the namespace-trust model, calls out the Compose example as demo-only, and points operators at the available mitigations (UDS, network policy, mesh-level auth, peat-gateway in front).

- **+** Zero code change; ships immediately.
- **+** Aligns with the sidecar deployment pattern as designed.
- **−** Does nothing for cross-host or multi-tenant cases.
- **−** Doesn't address #35's Compose-example exposure — anyone on the host can write the mesh.

### Option B — Listener TLS (server-only or mTLS)

Add `PEAT_NODE_LISTEN_TLS_CERT` / `_KEY` / `_CA` env vars mirroring the watcher's shape (and avoiding the partial-config trap of #37 from day one — error on incoherent combinations).

- **+** Standard, well-understood. Connect / hyper supports it natively.
- **+** mTLS gives per-caller identity (via cert CN / SANs) for free.
- **+** Composes cleanly with SPIFFE / cert-manager / Vault PKI for rotation.
- **−** Cert provisioning at 5000-agent scale is a real operational problem — needs an answer (cert-manager? SPIFFE? Pre-shared CA + short-lived issued certs?).
- **−** Doubles the TLS surface in this repo (already have outbound watcher TLS) — more code paths to keep coherent.

### Option C — Bearer-token interceptor (shared secret)

Add `PEAT_NODE_API_TOKEN` env var. Connect interceptor compares `Authorization: Bearer <token>` on every request, rejects with 401 on mismatch. Single shared secret per sidecar.

- **+** Cheapest implementation (~50 LOC + tests).
- **+** Works over plain HTTP (no TLS dependency).
- **+** Easy for app integrators — one env var on both sides.
- **−** No per-caller identity. No revocation without a restart unless we add token-list + reload.
- **−** Token-over-HTTP is broken if the network is untrusted (need TLS too — collapses back into Option B).

### Option D — SPIFFE / workload identity

Listener accepts mTLS where peer-cert is a SPIFFE ID; SPIRE handles rotation and identity issuance.

- **+** Cloud-native standard; integrates with existing UDS/K8s identity tooling.
- **+** Solves provisioning + rotation in one stroke.
- **−** Hard requirement on SPIRE agents being deployed — heavy operational dependency.
- **−** Likely overkill for non-K8s deployments (Compose, bare metal).
- **−** Larger code surface than Option B alone (SPIFFE cert validation, trust-domain config).

### Option E — UDS-only mode (refuse TCP listeners)

Add a `PEAT_NODE_REQUIRE_UDS` flag (or change the default) that errors at startup if the listener is anything but a Unix socket. Auth becomes file-mode + ownership of the socket path — the kernel enforces it.

- **+** Truly simple. Trust model is filesystem permissions.
- **+** Aligns with the K8s sidecar pattern.
- **−** Doesn't work for the Compose example as written (containers don't share filesystems by default — needs a shared volume).
- **−** Doesn't work cross-host at all.
- **−** Forces a deployment-shape constraint the rest of the project may not want.

## Tradeoff matrix

| Driver | A (status quo + doc) | B (listener TLS / mTLS) | C (bearer token) | D (SPIFFE) | E (UDS-only) |
|---|---|---|---|---|---|
| Trust-model clarity | ✅ (documented) | ✅ | ⚠️ (token != identity) | ✅ | ✅ |
| Caller identity | ❌ | ✅ (mTLS) | ❌ | ✅ | ⚠️ (process uid) |
| Rotation | n/a | ⚠️ (needs PKI) | ❌ (restart) | ✅ | n/a |
| Operational simplicity | ✅ | ⚠️ (PKI burden) | ✅ | ❌ (SPIRE dep) | ✅ |
| Failure modes (fail-closed) | n/a | depends on design | ✅ | ✅ | ✅ |
| Cross-host support | ❌ | ✅ | ⚠️ (+TLS) | ✅ | ❌ |
| Backwards compatible (opt-in) | ✅ | ✅ | ✅ | ✅ | ⚠️ (flips default) |
| Implementation cost | <1 day | 2–4 days | <1 day | 1–2 weeks | <1 day |

## Recommendation (tentative)

A **layered combination**, picking the simplest that solves each driver:

1. **Option A unconditionally.** Ship `SECURITY.md` documenting the trust model now. Tighten the `examples/compose/` README and the chart `values.yaml` comments to say "loopback or UDS only for any non-demo use." This costs nothing and removes the largest current foot-gun (operators not knowing the boundary exists).
2. **Option B opt-in.** Add listener-TLS env vars with mTLS support. Make partial-config a startup error from day one (don't repeat #37). This unblocks (1) cross-host gRPC and (2) per-caller identity for operators who want it.
3. **Option C as a non-goal.** A bearer token without TLS is a footgun; a bearer token with TLS is redundant with mTLS. Skip unless a concrete consumer asks for it.
4. **Option D as a follow-up issue, not this ADR.** SPIFFE integration is a real value-add for fleet-scale deployments but is large enough to deserve its own ADR after Option B has shipped.
5. **Option E as documentation, not a mode.** Recommend UDS as the default in the chart for sidecar deployments; don't make TCP a startup error — too disruptive.

This is **not a decision** — it's the recommendation a future PR would carry into the ADR file once we agree.

## Consequences

If recommendation lands:

- **`SECURITY.md`** added at repo root with the trust-model writeup and a quick-reference matrix of which deployment shapes need which mitigations.
- **`src/main.rs`** gets `PEAT_NODE_LISTEN_TLS_*` env vars (mirroring watcher TLS shape, learning from #37 — partial config = startup error).
- **`src/server.rs`** (new) or extended `src/main.rs` — TLS listener path.
- **`docs/CONFIGURATION.md`** — new section for listener TLS env vars.
- **`chart/peat-node/values.yaml`** — new `listenerTls.*` block; documented as opt-in.
- **`examples/compose/`** — second example added showing mTLS-enabled two-node mesh (or a note that the existing example deliberately runs unauthenticated for the localhost demo).
- **`tests/`** — new integration tests for the partial-config-error cases (mTLS-required-but-not-configured, malformed PEM, wrong CA, etc.).
- **No changes to**: the wire protocol (TLS is on the transport, not the proto), the mesh layer (FormationKey unchanged), the watcher TLS code (independent path).

## Open questions

1. **ADR home.** Does this live in `peat-node/docs/adr/001-...` (establishing a repo-local ADR convention — currently the repo has only `docs/DESIGN.md`) or in `peat/docs/adr/NNN-...` (ecosystem-level)? Argument for repo-local: this affects only `peat-node`'s listener, not the mesh or any other sibling. Argument for ecosystem: future SPIFFE integration (Option D) likely *does* affect the ecosystem.
2. **Cert lifecycle at fleet scale.** Even with Option B, 5000+ sidecars need cert issuance / rotation. Is the answer cert-manager (K8s-native), SPIFFE/SPIRE, a Vault PKI mount, or pre-shared root CA + on-startup CSR? This is the question that determines whether Option D becomes the right answer instead.
3. **Relationship to `peat-gateway`.** If `peat-gateway` always fronts `peat-node` in production deployments (per ADR-043 / ADR-055), is listener auth on `peat-node` redundant? Or does it still matter for defense-in-depth between gateway and sidecar?
4. **Default-strict policy.** Some future major version may want to flip the default to "require auth." When does that become reasonable, and what's the migration path for existing deployments?
5. **mTLS identity binding.** If we adopt mTLS, what's the canonical mapping from peer-cert (CN, SAN, SPIFFE ID) to a Peat-level caller identity? Does it tie back to FormationKey app_id, or is it a separate axis?

## Dependencies

- #37 — partial-config failure mode (already filed; the new listener-TLS surface should learn from it).
- ADR-006, ADR-043, ADR-044, ADR-048, ADR-055 in the `peat` repo for ecosystem-level security/trust context.
- `peat-gateway` repo posture on auth (if it already does per-caller auth, that changes the recommendation in §3 above).
- No PR-scope yet — this issue is the decision artifact. Implementation PRs branch off the chosen option(s).

Driver	Why it matters
Trust-model clarity	Today's "loopback / UDS = trusted" model is undocumented as a security boundary. Anyone reading the chart values or the Compose example will not infer the implications.
Caller identity	Some deployments need who called, not just that they could reach the socket (audit, multi-tenant routing, quota). A shared secret gives "yes/no"; per-caller identity needs cert or token.
Rotation	Whatever we add must be rotatable without restarting the sidecar (or with a clearly-documented restart cost).
Operational simplicity	Sidecars deploy at fleet scale (5000+ agents per `docs/DESIGN.md`). Whatever auth gets added must be feasible to provision and rotate at that scale — manual cert-per-agent fails.
Failure modes	Issue #37 (watcher TLS partial-config silent fallback) shows the existing pattern of "set some env vars, get insecure" is real. Any new auth surface must fail closed, never open.
Consistency with ecosystem invariants	`FormationKey` is already the mesh-layer trust unit. ADR-043 carves out adapters as a separate concern (peat-gateway). The decision should harmonize with both.
Observability	If auth fails, the operator needs a discoverable signal — not a silent 401 in a log nobody reads.
Backwards compatibility	Existing deployments with no auth must keep working until they opt in (or we ship a major version that flips the default).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADR] gRPC listener authentication & authorization #38

Status

Context

Drivers

Options

Option A — Status quo, formalize the trust model

Option B — Listener TLS (server-only or mTLS)

Option C — Bearer-token interceptor (shared secret)

Option D — SPIFFE / workload identity

Option E — UDS-only mode (refuse TCP listeners)

Tradeoff matrix

Recommendation (tentative)

Consequences

Open questions

Dependencies

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Driver	A (status quo + doc)	B (listener TLS / mTLS)	C (bearer token)	D (SPIFFE)	E (UDS-only)
Trust-model clarity	✅ (documented)	✅	⚠️ (token != identity)	✅	✅
Caller identity	❌	✅ (mTLS)	❌	✅	⚠️ (process uid)
Rotation	n/a	⚠️ (needs PKI)	❌ (restart)	✅	n/a
Operational simplicity	✅	⚠️ (PKI burden)	✅	❌ (SPIRE dep)	✅
Failure modes (fail-closed)	n/a	depends on design	✅	✅	✅
Cross-host support	❌	✅	⚠️ (+TLS)	✅	❌
Backwards compatible (opt-in)	✅	✅	✅	✅	⚠️ (flips default)
Implementation cost	<1 day	2–4 days	<1 day	1–2 weeks	<1 day

[ADR] gRPC listener authentication & authorization #38

Description

Status

Context

Drivers

Options

Option A — Status quo, formalize the trust model

Option B — Listener TLS (server-only or mTLS)

Option C — Bearer-token interceptor (shared secret)

Option D — SPIFFE / workload identity

Option E — UDS-only mode (refuse TCP listeners)

Tradeoff matrix

Recommendation (tentative)

Consequences

Open questions

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions