diff --git a/docs/agentcube/docs/tutorials/internal-auth-spire.md b/docs/agentcube/docs/tutorials/internal-auth-spire.md new file mode 100644 index 00000000..16fdf22f --- /dev/null +++ b/docs/agentcube/docs/tutorials/internal-auth-spire.md @@ -0,0 +1,328 @@ +# Securing Internal Traffic with SPIRE (mTLS) + +This task shows you how to enable mutual TLS (mTLS) between AgentCube's +control-plane components using [SPIRE](https://spiffe.io/docs/latest/spire-about/spire-concepts/). +By the end, every request between the Router and WorkloadManager will be +cryptographically authenticated using short-lived X.509 certificates that rotate +automatically. + +## Before you begin + +1. Follow the [Getting Started](../getting-started.md) guide to install + AgentCube on your cluster. **Do not** enable SPIRE during the initial + installation - this tutorial walks through that step explicitly. + +2. Make sure you have the following tools installed: + - [`kubectl`](https://kubernetes.io/docs/tasks/tools/) (v1.25+) + - [`helm`](https://helm.sh/docs/intro/install/) (v3.12+) + +3. Confirm AgentCube is running without SPIRE: + + ```bash + kubectl get pods -n agentcube + ``` + + You should see the Router and WorkloadManager pods in `Running` state, each + showing `1/1` containers ready (no sidecar yet): + +``` + NAME READY STATUS RESTARTS AGE + agentcube-router-7fbb7b54c-7khq5 1/1 Running 0 8s + workloadmanager-6c44454f68-zmfcc 1/1 Running 0 8s +``` + +> **Tip:** +> If you are running on a local [Kind](https://kind.sigs.k8s.io/) or +> [Minikube](https://minikube.sigs.k8s.io/) cluster, you will need to pass two +> extra overrides in the Helm upgrade command shown below. These are already +> included in the instructions, so just keep them in. + + +## What gets deployed + +When you enable SPIRE, the Helm chart creates the following additional resources +inside your cluster: + +| Resource | Kind | Purpose | +|---|---|---| +| `spire-server` | StatefulSet (1 replica) | Central certificate authority. Runs the SPIRE Controller Manager as a sidecar. | +| `spire-agent` | DaemonSet | Runs on every node. Attests workloads and delivers certificates. | +| `ClusterSPIFFEID` (×2) | CRD | Declarative identity registration for the Router and WorkloadManager. | +| `spiffe-helper` sidecar | Container (injected) | Fetches and rotates certificates inside the Router and WorkloadManager pods. | + +The Router and WorkloadManager pods will each go from `1/1` to `2/2` containers +(the main process + the `spiffe-helper` sidecar). + +## Step 1 - Install the SPIRE Controller Manager CRDs + +The SPIRE Controller Manager watches `ClusterSPIFFEID` custom resources. These +CRDs must be present in the cluster **before** the Helm upgrade, otherwise the +chart will fail to create them. + +```bash +kubectl apply -k "https://github.com/spiffe/spire-controller-manager/config/crd?ref=v0.6.4" +``` + +Verify the CRD was installed: + +```bash +kubectl get crd clusterspiffeids.spire.spiffe.io +``` + +Expected output: + +``` +NAME CREATED AT +clusterspiffeids.spire.spiffe.io 2026-06-01T16:22:32Z +``` + +## Step 2 - Upgrade the Helm release with SPIRE enabled + +Run the Helm upgrade with `spire.enabled=true`. Keep `--reuse-values` so your +existing install-time settings (for example Redis, images, RBAC, or service +accounts) are preserved while enabling SPIRE. The extra `--set` flags for +`insecureBootstrap` and `skipKubeletVerification` are needed for local +development clusters (Kind / Minikube). On a production cluster with proper +kubelet certificates, you can omit them. + +```bash +helm upgrade agentcube manifests/charts/base \ + -n agentcube \ + --reuse-values \ + --set spire.enabled=true \ + --set spire.agent.insecureBootstrap=true \ + --set spire.agent.skipKubeletVerification=true +``` + +This single command deploys the full SPIRE infrastructure **and** injects the +`spiffe-helper` sidecar into the Router and WorkloadManager pods. + +Wait for everything to become ready: + +```bash +kubectl rollout status statefulset/spire-server -n agentcube --timeout=120s +kubectl rollout status daemonset/spire-agent -n agentcube --timeout=120s +kubectl rollout status deployment/agentcube-router -n agentcube --timeout=120s +kubectl rollout status deployment/workloadmanager -n agentcube --timeout=120s +``` + +## Step 3 - Verify SPIRE is healthy + +Check that the SPIRE Server is up and has registered agents: + +```bash +kubectl exec -n agentcube statefulset/spire-server -c spire-server -- \ + /opt/spire/bin/spire-server agent list +``` + +You should see at least one agent entry (one per cluster node): + +``` +Found 1 attested agent(s): + +SPIFFE ID : spiffe://cluster.local/spire/agent/k8s_psat/agentcube-cluster/67790303-3657-42d6-bf4f-c3833ec6dd5e +Attestation type : k8s_psat +... +``` + +Next, confirm the identity registrations were picked up from the +`ClusterSPIFFEID` resources: + +```bash +kubectl exec -n agentcube statefulset/spire-server -c spire-server -- \ + /opt/spire/bin/spire-server entry show +``` + +You should see entries for both the Router and WorkloadManager, with SPIFFE IDs +following the format +`spiffe://cluster.local/ns/agentcube/sa/`: + +``` +Entry ID : bfd507ec-10d8-43e5-b984-861a3ff81167 +SPIFFE ID : spiffe://cluster.local/ns/agentcube/sa/agentcube-router +Parent ID : spiffe://cluster.local/spire/agent/k8s_psat/agentcube-cluster/67790303-3657-42d6-bf4f-c3833ec6dd5e +Revision : 0 + +Entry ID : 21e3ba6f-ad13-4076-9e08-90a2d4ff518f +SPIFFE ID : spiffe://cluster.local/ns/agentcube/sa/workloadmanager +Parent ID : spiffe://cluster.local/spire/agent/k8s_psat/agentcube-cluster/67790303-3657-42d6-bf4f-c3833ec6dd5e +Revision : 0 +``` + +## Step 4 - Verify the sidecar and certificates + +Confirm that both the Router and WorkloadManager pods now show `2/2` containers +(the main container + the `spiffe-helper` sidecar): + +```bash +kubectl get pods -n agentcube +``` + +Expected output: + +``` +NAME READY STATUS RESTARTS AGE +agentcube-router-574d98b76-tr2nr 2/2 Running 5 (2m24s ago) 3m17s +spire-agent-8r9jx 1/1 Running 3 (2m44s ago) 3m17s +spire-server-0 2/2 Running 0 3m17s +workloadmanager-5797888bd4-jm2qj 2/2 Running 3 (118s ago) 3m17s +``` + +Check the Router logs to confirm mTLS is active. You should see a log line +indicating it is waiting for, and then successfully loading, the certificates: + +```bash +kubectl logs -n agentcube deployment/agentcube-router -c agentcube-router | grep -i mtls +``` + +Expected output: + +``` +I0601 16:25:21.444099 1 main.go:64] Waiting for Router mTLS cert/key/CA files +I0601 16:25:21.444259 1 wait.go:46] All mTLS cert/key/CA files are present +I0601 16:25:21.445161 1 session_manager.go:84] Using https:// for WORKLOAD_MANAGER_URL because mTLS is configured +I0601 16:25:21.445482 1 session_manager.go:93] Router→WorkloadManager mTLS enabled: expecting server SPIFFE ID spiffe://cluster.local/ns/agentcube/sa/workloadmanager +``` + +Do the same for the WorkloadManager: + +```bash +kubectl logs -n agentcube deployment/workloadmanager -c workloadmanager | grep -i mtls +``` + +Expected output: + +``` +I0601 16:25:22.561316 1 main.go:80] Waiting for WorkloadManager mTLS cert/key/CA files +I0601 16:25:22.561931 1 wait.go:46] All mTLS cert/key/CA files are present +I0601 16:25:22.678777 1 server.go:218] WorkloadManager mTLS enabled: accepting clients with valid SPIRE-provisioned certificates +``` + +## Step 5 - Test it end-to-end + +Deploy a simple agent and invoke it through the Router to confirm the full +mTLS-secured path works: + +```bash +kubectl apply -f - </ns//sa/. // The trust domain defaults to cluster.local and can be overridden with AGENTCUBE_SPIFFE_TRUST_DOMAIN // to match the SPIRE trust domain configured by deployment tooling. -// The namespace defaults to agentcube-system and can be overridden with AGENTCUBE_NAMESPACE. +// The namespace defaults to agentcube and can be overridden with AGENTCUBE_NAMESPACE. var ( // RouterSPIFFEID is the SPIFFE identity for the Router component. RouterSPIFFEID = componentSPIFFEID(configuredTrustDomain(), configuredNamespace(), "agentcube-router") diff --git a/pkg/mtls/spiffeid_test.go b/pkg/mtls/spiffeid_test.go index bdd328d1..71baede8 100644 --- a/pkg/mtls/spiffeid_test.go +++ b/pkg/mtls/spiffeid_test.go @@ -43,8 +43,8 @@ func TestConfiguredNamespace(t *testing.T) { } func TestComponentSPIFFEID(t *testing.T) { - got := componentSPIFFEID("example.org", "agentcube-system", "agentcube-router") - want := "spiffe://example.org/ns/agentcube-system/sa/agentcube-router" + got := componentSPIFFEID("example.org", "agentcube", "agentcube-router") + want := "spiffe://example.org/ns/agentcube/sa/agentcube-router" if got != want { t.Fatalf("componentSPIFFEID() = %q, want %q", got, want) } diff --git a/pkg/mtls/wait.go b/pkg/mtls/wait.go index 79f5cf1e..71b25876 100644 --- a/pkg/mtls/wait.go +++ b/pkg/mtls/wait.go @@ -21,6 +21,8 @@ import ( "os" "strings" "time" + + "k8s.io/klog/v2" ) // DefaultCertificateFileWaitTimeout bounds the startup race while spiffe-helper writes the initial SVID files. @@ -41,6 +43,7 @@ func WaitForCertificateFiles(cfg Config, timeout time.Duration) error { return fmt.Errorf("failed to access mTLS cert/key/CA files: %w", err) } if exist { + klog.Infof("All mTLS cert/key/CA files are present") return nil } missing = currentMissing diff --git a/pkg/router/session_manager_test.go b/pkg/router/session_manager_test.go index 8b585f4b..0b61f28d 100644 --- a/pkg/router/session_manager_test.go +++ b/pkg/router/session_manager_test.go @@ -636,7 +636,7 @@ func generateTestCertsForRouter(t *testing.T, dir string) (certFile, keyFile, ca if err != nil { t.Fatalf("generate leaf key: %v", err) } - spiffeURL, _ := url.Parse("spiffe://cluster.local/ns/agentcube-system/sa/agentcube-router") + spiffeURL, _ := url.Parse("spiffe://cluster.local/ns/agentcube/sa/agentcube-router") leafTemplate := &x509.Certificate{ SerialNumber: big.NewInt(2), Subject: pkix.Name{Organization: []string{"Test Router"}}, diff --git a/test/e2e/run_e2e.sh b/test/e2e/run_e2e.sh index 0c215718..e14c1c5b 100755 --- a/test/e2e/run_e2e.sh +++ b/test/e2e/run_e2e.sh @@ -18,7 +18,7 @@ WORKLOAD_MANAGER_IMAGE=${WORKLOAD_MANAGER_IMAGE:-workloadmanager:latest} ROUTER_IMAGE=${ROUTER_IMAGE:-agentcube-router:latest} PICOD_IMAGE=${PICOD_IMAGE:-picod:latest} REDIS_IMAGE=${REDIS_IMAGE:-redis:7-alpine} -AGENTCUBE_NAMESPACE=${AGENTCUBE_NAMESPACE:-agentcube-system} +AGENTCUBE_NAMESPACE=${AGENTCUBE_NAMESPACE:-agentcube} WORKLOAD_NAMESPACE=${WORKLOAD_NAMESPACE:-agentcube} E2E_VENV_DIR=${E2E_VENV_DIR:-/tmp/agentcube-e2e-venv} MCP_K8S_LOCAL_PORT=${MCP_K8S_LOCAL_PORT:-19446}