Skip to content

[ADR] gRPC listener authentication & authorization #38

@kitplummer

Description

@kitplummer

Status

Proposed. Design discussion — no decision yet. Once accepted, this body should be flattened into a tracked ADR file: either peat-node/docs/adr/001-grpc-listener-auth.md (establishing a repo-local ADR convention, currently absent) or peat/docs/adr/NNN-peat-node-grpc-auth.md (ecosystem-level, alongside ADRs 006, 043, 044, 048, 055, 056). Choosing the home is itself part of this decision (see Open Questions).

Context

peat-node exposes peat.sidecar.v1.PeatSidecar over a single Connect-RPC listener that simultaneously serves Connect-over-HTTP+JSON, gRPC, and gRPC-Web. Today there is no authentication and no authorization on the listener:

  • No TLS on inbound connections (TLS exists only on the outbound watcher → agent client path in src/watcher.rs).
  • No interceptor, no bearer-token / API-key check, no caller-identity surface.
  • The Helm chart defaults to listen: "tcp://127.0.0.1:50051"; src/main.rs defaults to tcp://0.0.0.0:50051; Unix-socket mode (unix:///path/to/sock) is supported.

The implicit security model is sidecar-namespace trust: peat-node and the consuming application share a pod (K8s) or a container (Compose), the listener binds to loopback or a UDS, and anything that reaches the socket is by definition the co-located app. The pod / container boundary is the auth boundary.

This works for the Kubernetes sidecar deployment that docs/DESIGN.md describes. It breaks in at least four cases that are already real:

  1. The new Compose example (examples/compose/, PR docs: Docker Compose example + env-var reference for non-K8s consumers #36 / issue docs: Docker Compose example + env-var reference for non-K8s consumers #35) binds 0.0.0.0:50051 inside each container and maps to the host. Anyone on the host can hit either node. Acceptable for a localhost demo — unacceptable beyond.
  2. Multi-tenant containers — two unrelated processes in the same pod / container both reach the loopback listener; only one should be authorized.
  3. Cross-host gRPC — app and sidecar on different machines (e.g. a development laptop talking to a remote peat-node), or a peat-node fronting fleet-wide queries.
  4. peat-gateway integration — per ADR-043, protocol-bridge adapters (NATS, MQTT, etc.) live in peat-gateway. If peat-gateway is the only trusted caller of a deployed peat-node, the gateway → sidecar hop currently has no caller-identity check.

At the mesh layer (peat-node ↔ peat-node over Iroh), authentication is solved: FormationKey handshake (app_id + 32-byte shared key) + QUIC encryption + ed25519 endpoint identity. That is not the question here. This ADR is exclusively about inbound calls to the gRPC listener from co-located or remote clients.

Drivers

A listener-auth design must address all of these, or explicitly decline each:

Driver Why it matters
Trust-model clarity Today's "loopback / UDS = trusted" model is undocumented as a security boundary. Anyone reading the chart values or the Compose example will not infer the implications.
Caller identity Some deployments need who called, not just that they could reach the socket (audit, multi-tenant routing, quota). A shared secret gives "yes/no"; per-caller identity needs cert or token.
Rotation Whatever we add must be rotatable without restarting the sidecar (or with a clearly-documented restart cost).
Operational simplicity Sidecars deploy at fleet scale (5000+ agents per docs/DESIGN.md). Whatever auth gets added must be feasible to provision and rotate at that scale — manual cert-per-agent fails.
Failure modes Issue #37 (watcher TLS partial-config silent fallback) shows the existing pattern of "set some env vars, get insecure" is real. Any new auth surface must fail closed, never open.
Consistency with ecosystem invariants FormationKey is already the mesh-layer trust unit. ADR-043 carves out adapters as a separate concern (peat-gateway). The decision should harmonize with both.
Observability If auth fails, the operator needs a discoverable signal — not a silent 401 in a log nobody reads.
Backwards compatibility Existing deployments with no auth must keep working until they opt in (or we ship a major version that flips the default).

Options

Five candidates, none mutually exclusive. The decision is likely a combination (e.g. A + C), not a single pick.

Option A — Status quo, formalize the trust model

Do nothing in code. Add a SECURITY.md (currently absent in this repo) that explicitly documents the namespace-trust model, calls out the Compose example as demo-only, and points operators at the available mitigations (UDS, network policy, mesh-level auth, peat-gateway in front).

Option B — Listener TLS (server-only or mTLS)

Add PEAT_NODE_LISTEN_TLS_CERT / _KEY / _CA env vars mirroring the watcher's shape (and avoiding the partial-config trap of #37 from day one — error on incoherent combinations).

  • + Standard, well-understood. Connect / hyper supports it natively.
  • + mTLS gives per-caller identity (via cert CN / SANs) for free.
  • + Composes cleanly with SPIFFE / cert-manager / Vault PKI for rotation.
  • Cert provisioning at 5000-agent scale is a real operational problem — needs an answer (cert-manager? SPIFFE? Pre-shared CA + short-lived issued certs?).
  • Doubles the TLS surface in this repo (already have outbound watcher TLS) — more code paths to keep coherent.

Option C — Bearer-token interceptor (shared secret)

Add PEAT_NODE_API_TOKEN env var. Connect interceptor compares Authorization: Bearer <token> on every request, rejects with 401 on mismatch. Single shared secret per sidecar.

  • + Cheapest implementation (~50 LOC + tests).
  • + Works over plain HTTP (no TLS dependency).
  • + Easy for app integrators — one env var on both sides.
  • No per-caller identity. No revocation without a restart unless we add token-list + reload.
  • Token-over-HTTP is broken if the network is untrusted (need TLS too — collapses back into Option B).

Option D — SPIFFE / workload identity

Listener accepts mTLS where peer-cert is a SPIFFE ID; SPIRE handles rotation and identity issuance.

  • + Cloud-native standard; integrates with existing UDS/K8s identity tooling.
  • + Solves provisioning + rotation in one stroke.
  • Hard requirement on SPIRE agents being deployed — heavy operational dependency.
  • Likely overkill for non-K8s deployments (Compose, bare metal).
  • Larger code surface than Option B alone (SPIFFE cert validation, trust-domain config).

Option E — UDS-only mode (refuse TCP listeners)

Add a PEAT_NODE_REQUIRE_UDS flag (or change the default) that errors at startup if the listener is anything but a Unix socket. Auth becomes file-mode + ownership of the socket path — the kernel enforces it.

  • + Truly simple. Trust model is filesystem permissions.
  • + Aligns with the K8s sidecar pattern.
  • Doesn't work for the Compose example as written (containers don't share filesystems by default — needs a shared volume).
  • Doesn't work cross-host at all.
  • Forces a deployment-shape constraint the rest of the project may not want.

Tradeoff matrix

Driver A (status quo + doc) B (listener TLS / mTLS) C (bearer token) D (SPIFFE) E (UDS-only)
Trust-model clarity ✅ (documented) ⚠️ (token != identity)
Caller identity ✅ (mTLS) ⚠️ (process uid)
Rotation n/a ⚠️ (needs PKI) ❌ (restart) n/a
Operational simplicity ⚠️ (PKI burden) ❌ (SPIRE dep)
Failure modes (fail-closed) n/a depends on design
Cross-host support ⚠️ (+TLS)
Backwards compatible (opt-in) ⚠️ (flips default)
Implementation cost <1 day 2–4 days <1 day 1–2 weeks <1 day

Recommendation (tentative)

A layered combination, picking the simplest that solves each driver:

  1. Option A unconditionally. Ship SECURITY.md documenting the trust model now. Tighten the examples/compose/ README and the chart values.yaml comments to say "loopback or UDS only for any non-demo use." This costs nothing and removes the largest current foot-gun (operators not knowing the boundary exists).
  2. Option B opt-in. Add listener-TLS env vars with mTLS support. Make partial-config a startup error from day one (don't repeat watcher: harden build_client to error on partial mTLS configuration #37). This unblocks (1) cross-host gRPC and (2) per-caller identity for operators who want it.
  3. Option C as a non-goal. A bearer token without TLS is a footgun; a bearer token with TLS is redundant with mTLS. Skip unless a concrete consumer asks for it.
  4. Option D as a follow-up issue, not this ADR. SPIFFE integration is a real value-add for fleet-scale deployments but is large enough to deserve its own ADR after Option B has shipped.
  5. Option E as documentation, not a mode. Recommend UDS as the default in the chart for sidecar deployments; don't make TCP a startup error — too disruptive.

This is not a decision — it's the recommendation a future PR would carry into the ADR file once we agree.

Consequences

If recommendation lands:

  • SECURITY.md added at repo root with the trust-model writeup and a quick-reference matrix of which deployment shapes need which mitigations.
  • src/main.rs gets PEAT_NODE_LISTEN_TLS_* env vars (mirroring watcher TLS shape, learning from watcher: harden build_client to error on partial mTLS configuration #37 — partial config = startup error).
  • src/server.rs (new) or extended src/main.rs — TLS listener path.
  • docs/CONFIGURATION.md — new section for listener TLS env vars.
  • chart/peat-node/values.yaml — new listenerTls.* block; documented as opt-in.
  • examples/compose/ — second example added showing mTLS-enabled two-node mesh (or a note that the existing example deliberately runs unauthenticated for the localhost demo).
  • tests/ — new integration tests for the partial-config-error cases (mTLS-required-but-not-configured, malformed PEM, wrong CA, etc.).
  • No changes to: the wire protocol (TLS is on the transport, not the proto), the mesh layer (FormationKey unchanged), the watcher TLS code (independent path).

Open questions

  1. ADR home. Does this live in peat-node/docs/adr/001-... (establishing a repo-local ADR convention — currently the repo has only docs/DESIGN.md) or in peat/docs/adr/NNN-... (ecosystem-level)? Argument for repo-local: this affects only peat-node's listener, not the mesh or any other sibling. Argument for ecosystem: future SPIFFE integration (Option D) likely does affect the ecosystem.
  2. Cert lifecycle at fleet scale. Even with Option B, 5000+ sidecars need cert issuance / rotation. Is the answer cert-manager (K8s-native), SPIFFE/SPIRE, a Vault PKI mount, or pre-shared root CA + on-startup CSR? This is the question that determines whether Option D becomes the right answer instead.
  3. Relationship to peat-gateway. If peat-gateway always fronts peat-node in production deployments (per ADR-043 / ADR-055), is listener auth on peat-node redundant? Or does it still matter for defense-in-depth between gateway and sidecar?
  4. Default-strict policy. Some future major version may want to flip the default to "require auth." When does that become reasonable, and what's the migration path for existing deployments?
  5. mTLS identity binding. If we adopt mTLS, what's the canonical mapping from peer-cert (CN, SAN, SPIFFE ID) to a Peat-level caller identity? Does it tie back to FormationKey app_id, or is it a separate axis?

Dependencies

  • watcher: harden build_client to error on partial mTLS configuration #37 — partial-config failure mode (already filed; the new listener-TLS surface should learn from it).
  • ADR-006, ADR-043, ADR-044, ADR-048, ADR-055 in the peat repo for ecosystem-level security/trust context.
  • peat-gateway repo posture on auth (if it already does per-caller auth, that changes the recommendation in §3 above).
  • No PR-scope yet — this issue is the decision artifact. Implementation PRs branch off the chosen option(s).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions