Skip to content

Design identity provisioning and credential lifecycle for DDIL mesh #14

@kitplummer

Description

@kitplummer

Summary

Define how mesh node identity is provisioned, authenticated, and rotated — accounting for DDIL constraints where nodes may be offline during credential rotation or revocation events.

Context

The current fleet management registration flow uses Keycloak client credential exchange with mTLS for agent identity (see UDS Fleet Management: Overview - Member Registration). This works for hub-spoke but introduces challenges for the CRDT mesh:

  1. mTLS certificate rotation in DDIL: If a node is offline when certs rotate, it can't re-authenticate. The mesh must handle nodes with stale certificates gracefully.
  2. Revocation in DDIL: CRL/OCSP checks require connectivity. A revoked node that's offline may still participate in P2P sync until it reconnects and is rejected.
  3. Mesh-specific identity: The CRDT mesh uses Iroh endpoint IDs (public keys) for transport identity, plus shared_key and app_id for formation grouping. These are separate from the Keycloak-issued agent identity.
  4. Trust boundaries: Which nodes should sync with which? Formation/app_id grouping is the current mechanism but isn't tied to the organizational identity model.

Current State

  • --shared-key (base64 32-byte key) — pre-shared secret for formation authentication
  • --app-id — formation/group identifier
  • Iroh endpoint ID — derived from ephemeral keypair, changes on restart
  • No integration with Keycloak or any external identity provider
  • No certificate lifecycle management
  • No mechanism to revoke a specific node's mesh access

Key Questions

  1. Should mesh identity be derived from or linked to Keycloak agent identity?

    • Pro: single identity model, SSO-style revocation
    • Con: Keycloak dependency in DDIL defeats the purpose
  2. How long should credentials be valid without hub connectivity?

    • Offline tokens with long expiry (days/weeks)?
    • Pre-provisioned credential bundles with overlap windows?
    • Grace periods after expiry?
  3. What happens when a node is revoked while offline?

    • Gossip-based revocation lists synced via the CRDT mesh itself?
    • Time-bounded trust (credentials naturally expire)?
    • Accepting the risk window as a design trade-off?
  4. How are new nodes provisioned into the mesh?

    • Hub issues formation credentials during registration (when connected)
    • Out-of-band provisioning (USB, QR code) for air-gapped deployment
    • Self-registration with approval workflow?

Potential Approaches

A: Pre-Provisioned Credential Bundles

  • Hub generates long-lived credential bundles during registration
  • Bundle contains: mesh shared key, app_id, list of trusted peer fingerprints
  • Nodes carry the bundle; it works offline until expiry
  • Rotation: hub issues new bundle with overlap period; nodes pick up on next connection

B: Mesh-Native Identity (Iroh Keys)

  • Each node has a persistent Iroh keypair (not ephemeral)
  • Hub maintains an allowlist of endpoint IDs per formation
  • Allowlist syncs via CRDT (self-propagating trust)
  • Revocation: remove endpoint ID from allowlist; propagates via mesh

C: Hybrid (Keycloak + Mesh Tokens)

  • Initial registration via Keycloak (when connected)
  • Keycloak issues a mesh-specific offline token with configurable TTL
  • Token contains formation membership, capabilities, expiry
  • Mesh validates token locally (no Keycloak connectivity required)
  • Rotation: token refresh on next hub connection

References

  • UDS Fleet Management: Overview - Member Registration
  • src/main.rs:47-53 — current app_id and shared_key config
  • src/node.rs:20-27SidecarConfig identity fields
  • Iroh endpoint identity model (public key based)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions