docs: add tutorial for securing internal traffic with SPIRE (mTLS)#364
docs: add tutorial for securing internal traffic with SPIRE (mTLS)#364mahil-2040 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a new tutorial documentation file explaining how SPIRE is used to secure internal traffic with mTLS in AgentCube. The reviewer pointed out two critical discrepancies between the documentation and the actual codebase: the documentation references a CertWatcher and hot-reloading mechanism that does not exist in the code, and it lists incorrect CLI flags (--mtls-* and --tls-ca) that are not supported by the current implementation.
| - **Strict Identity Enforcement**: The Router and WorkloadManager have hardcoded SPIFFE IDs. | ||
| - The WorkloadManager accepts any client presenting a valid certificate signed by the trusted CA pool (authorization is handled later at the application layer). | ||
| - However, the Router strictly verifies it's actually talking to the `WorkloadManagerSPIFFEID` before it forwards any traffic, preventing spoofed servers. | ||
| - **Zero-Downtime Rotation**: A new `CertWatcher` actively monitors the certificates on disk (using `fsnotify`). When SPIRE rotates the short-lived certs, they are hot-reloaded without dropping any active connections. |
There was a problem hiding this comment.
The documentation describes a CertWatcher that monitors certificates on disk using fsnotify for zero-downtime rotation. However, in pkg/router/server.go and pkg/workloadmanager/server.go, the servers are started using standard ListenAndServeTLS with static certificate paths, and there is no CertWatcher or fsnotify implementation in the codebase.\n\nPlease update the documentation to reflect the current capabilities, or implement the CertWatcher and hot-reloading mechanism in the server code.
| For the **Router**, use the `mtls` prefix: | ||
| ```bash | ||
| --mtls-cert=/path/to/tls.crt | ||
| --mtls-key=/path/to/tls.key | ||
| --mtls-ca=/path/to/ca.crt | ||
| ``` | ||
|
|
||
| For the **WorkloadManager**, use the `tls` prefix: | ||
| ```bash | ||
| --tls-cert=/path/to/tls.crt | ||
| --tls-key=/path/to/tls.key | ||
| --tls-ca=/path/to/ca.crt | ||
| ``` |
There was a problem hiding this comment.
The documented CLI flags do not match the actual flags implemented in the codebase.\n\nIn cmd/router/main.go and cmd/workload-manager/main.go, the defined flags are --tls-cert and --tls-key (along with --enable-tls). There are no --mtls-* or --tls-ca flags defined in the command-line parsers, nor is there any client CA verification (ClientCAs / ClientAuth) configured in the http.Server setups in pkg/router/server.go or pkg/workloadmanager/server.go.\n\nPlease update the documentation to reflect the actual flags, or implement the corresponding CLI flags and mTLS configuration in the codebase.\n\nFor example, the actual flags are:\nbash\n# For Router\n--enable-tls\n--tls-cert=/path/to/tls.crt\n--tls-key=/path/to/tls.key\n\n# For WorkloadManager\n--enable-tls\n--tls-cert=/path/to/tls.crt\n--tls-key=/path/to/tls.key\n
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #364 +/- ##
==========================================
+ Coverage 47.57% 55.56% +7.99%
==========================================
Files 30 34 +4
Lines 2819 3191 +372
==========================================
+ Hits 1341 1773 +432
+ Misses 1338 1239 -99
- Partials 140 179 +39
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
hzxuzhonghu
left a comment
There was a problem hiding this comment.
This isnot a good tutorial, think about if you are a new user, can you run it by this guide?
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
625d1fd to
8c07a2f
Compare
I have completely rewritten the guide with step by step instructions, commands and expected outputs of each step, PTAL! |
| 3. Confirm AgentCube is running without SPIRE: | ||
|
|
||
| ```bash | ||
| kubectl get pods -n agentcube-system |
There was a problem hiding this comment.
The linked Getting Started guide installs into the agentcube namespace, but every command here uses agentcube-system. A new user following both guides will check and upgrade the wrong release unless the namespace is made consistent or parameterized.
There was a problem hiding this comment.
I checked the inconsistency, the Getting Started guide installs into -n agentcube, but pkg/mtls/spiffeid.go defaults to agentcube-system. In practice this default is never hit in Helm deployments because both manifests inject AGENTCUBE_NAMESPACE from the pod's actual namespace via fieldRef: metadata.namespace. But it's still confusing to have two different names.
Would you be okay with me changing the Go default from agentcube-system to agentcube to make it consistent with the Getting Started guide? I'd also update the 4 unit test files and the e2e script that reference the old default.
| kubelet certificates, you can omit them. | ||
|
|
||
| ```bash | ||
| helm upgrade agentcube manifests/charts/base \ |
There was a problem hiding this comment.
Please add --reuse-values or repeat the required install values here. helm upgrade without it rebuilds values from chart defaults, so users who set Redis, images, RBAC, or service account values during install can lose them when enabling SPIRE.
There was a problem hiding this comment.
added the --reuse-values to preserve the install-time values.
| kubectl logs -n agentcube-system deployment/agentcube-router -c agentcube-router | grep -i mtls | ||
| ``` | ||
|
|
||
| Expected output: |
There was a problem hiding this comment.
The expected output includes All mTLS cert/key/CA files are present, but the current code returns silently after the files appear. Please either update the expected output or add this log, otherwise this verification step asks users to look for a line that never appears.
There was a problem hiding this comment.
Added klog.Infof to wait.go to match the expected output
| plane components, run the Helm upgrade again with `spire.enabled=false`: | ||
|
|
||
| ```bash | ||
| helm upgrade agentcube manifests/charts/base \ |
There was a problem hiding this comment.
Same issue here: disabling SPIRE should preserve the existing release values. Without --reuse-values or the original required --sets, cleanup can unintentionally reset the deployment while only trying to turn SPIRE off.
| --set spire.enabled=false | ||
| ``` | ||
|
|
||
| This removes all SPIRE resources (Server, Agent, CRDs, sidecars) and the |
There was a problem hiding this comment.
This does not remove the CRDs; they were installed manually in Step 1 and are deleted separately below. This should say it removes the SPIRE workloads, sidecars, and ClusterSPIFFEID resources from this Helm release.
There was a problem hiding this comment.
This is my mistake, accidently added CRDs, but now removed it.
- Added klog.Infof to wait.go so expected output is conistent with what actually appears in logs, matching the tutorial's expected output - Helm upgrade removes SPIRE workloads and sidecars, not CRDs (those are removed separately via kubectl) - added --reuse-values flag to preserve the install-time values Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Description
This PR adds a comprehensive user guide (
docs/tutorials/internal-auth-spire.md) for the newly implemented SPIRE mTLS internal authentication system.The tutorial explains the architecture of our zero-trust control plane and acts as a definitive reference for developers and operators.
Key Additions
CertWatcherachieves zero-downtime certificate rotation viafsnotify.--mtls-*for Router,--tls-*for WorkloadManager).spiffe-helpersidecar is deployed alongside the control plane pods to automatically provision and rotate SVIDs.PicoDandAgentRuntimesandboxes continue to use JWT authentication (avoiding TLS handshake overhead to preserve ultra-low cold-start latency, and keeping user-defined runtimes pure).Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
NO