Skip to content

docs: add tutorial for securing internal traffic with SPIRE (mTLS)#364

Open
mahil-2040 wants to merge 3 commits into
volcano-sh:mainfrom
mahil-2040:docs/add-spire-auth-guide
Open

docs: add tutorial for securing internal traffic with SPIRE (mTLS)#364
mahil-2040 wants to merge 3 commits into
volcano-sh:mainfrom
mahil-2040:docs/add-spire-auth-guide

Conversation

@mahil-2040
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Description

This PR adds a comprehensive user guide (docs/tutorials/internal-auth-spire.md) for the newly implemented SPIRE mTLS internal authentication system.

The tutorial explains the architecture of our zero-trust control plane and acts as a definitive reference for developers and operators.

Key Additions

  • Architecture Overview: Explains how the Router and WorkloadManager cryptographically verify each other's SPIFFE identities, and how the CertWatcher achieves zero-downtime certificate rotation via fsnotify.
  • Configuration Guide: Documents the exact CLI flags required to enable mTLS on both components (--mtls-* for Router, --tls-* for WorkloadManager).
  • Sidecar Mechanics: Details how the spiffe-helper sidecar is deployed alongside the control plane pods to automatically provision and rotate SVIDs.
  • Architectural Justification: Adds a section explicitly explaining why PicoD and AgentRuntime sandboxes continue to use JWT authentication (avoiding TLS handshake overhead to preserve ultra-low cold-start latency, and keeping user-defined runtimes pure).

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NO

NONE

Copilot AI review requested due to automatic review settings May 27, 2026 17:44
@volcano-sh-bot volcano-sh-bot added the kind/documentation Improvements or additions to documentation label May 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new tutorial documentation file explaining how SPIRE is used to secure internal traffic with mTLS in AgentCube. The reviewer pointed out two critical discrepancies between the documentation and the actual codebase: the documentation references a CertWatcher and hot-reloading mechanism that does not exist in the code, and it lists incorrect CLI flags (--mtls-* and --tls-ca) that are not supported by the current implementation.

- **Strict Identity Enforcement**: The Router and WorkloadManager have hardcoded SPIFFE IDs.
- The WorkloadManager accepts any client presenting a valid certificate signed by the trusted CA pool (authorization is handled later at the application layer).
- However, the Router strictly verifies it's actually talking to the `WorkloadManagerSPIFFEID` before it forwards any traffic, preventing spoofed servers.
- **Zero-Downtime Rotation**: A new `CertWatcher` actively monitors the certificates on disk (using `fsnotify`). When SPIRE rotates the short-lived certs, they are hot-reloaded without dropping any active connections.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation describes a CertWatcher that monitors certificates on disk using fsnotify for zero-downtime rotation. However, in pkg/router/server.go and pkg/workloadmanager/server.go, the servers are started using standard ListenAndServeTLS with static certificate paths, and there is no CertWatcher or fsnotify implementation in the codebase.\n\nPlease update the documentation to reflect the current capabilities, or implement the CertWatcher and hot-reloading mechanism in the server code.

Comment on lines +29 to +41
For the **Router**, use the `mtls` prefix:
```bash
--mtls-cert=/path/to/tls.crt
--mtls-key=/path/to/tls.key
--mtls-ca=/path/to/ca.crt
```

For the **WorkloadManager**, use the `tls` prefix:
```bash
--tls-cert=/path/to/tls.crt
--tls-key=/path/to/tls.key
--tls-ca=/path/to/ca.crt
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documented CLI flags do not match the actual flags implemented in the codebase.\n\nIn cmd/router/main.go and cmd/workload-manager/main.go, the defined flags are --tls-cert and --tls-key (along with --enable-tls). There are no --mtls-* or --tls-ca flags defined in the command-line parsers, nor is there any client CA verification (ClientCAs / ClientAuth) configured in the http.Server setups in pkg/router/server.go or pkg/workloadmanager/server.go.\n\nPlease update the documentation to reflect the actual flags, or implement the corresponding CLI flags and mTLS configuration in the codebase.\n\nFor example, the actual flags are:\nbash\n# For Router\n--enable-tls\n--tls-cert=/path/to/tls.crt\n--tls-key=/path/to/tls.key\n\n# For WorkloadManager\n--enable-tls\n--tls-cert=/path/to/tls.crt\n--tls-key=/path/to/tls.key\n

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 27, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.56%. Comparing base (524e55e) to head (a145259).
⚠️ Report is 103 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #364      +/-   ##
==========================================
+ Coverage   47.57%   55.56%   +7.99%     
==========================================
  Files          30       34       +4     
  Lines        2819     3191     +372     
==========================================
+ Hits         1341     1773     +432     
+ Misses       1338     1239      -99     
- Partials      140      179      +39     
Flag Coverage Δ
unittests 55.56% <100.00%> (+7.99%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isnot a good tutorial, think about if you are a new user, can you run it by this guide?

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
@mahil-2040 mahil-2040 force-pushed the docs/add-spire-auth-guide branch from 625d1fd to 8c07a2f Compare May 28, 2026 08:14
@mahil-2040
Copy link
Copy Markdown
Contributor Author

This isnot a good tutorial, think about if you are a new user, can you run it by this guide?

I have completely rewritten the guide with step by step instructions, commands and expected outputs of each step, PTAL!

@mahil-2040 mahil-2040 requested a review from hzxuzhonghu May 28, 2026 08:17
3. Confirm AgentCube is running without SPIRE:

```bash
kubectl get pods -n agentcube-system
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked Getting Started guide installs into the agentcube namespace, but every command here uses agentcube-system. A new user following both guides will check and upgrade the wrong release unless the namespace is made consistent or parameterized.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the inconsistency, the Getting Started guide installs into -n agentcube, but pkg/mtls/spiffeid.go defaults to agentcube-system. In practice this default is never hit in Helm deployments because both manifests inject AGENTCUBE_NAMESPACE from the pod's actual namespace via fieldRef: metadata.namespace. But it's still confusing to have two different names.

Would you be okay with me changing the Go default from agentcube-system to agentcube to make it consistent with the Getting Started guide? I'd also update the 4 unit test files and the e2e script that reference the old default.

kubelet certificates, you can omit them.

```bash
helm upgrade agentcube manifests/charts/base \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add --reuse-values or repeat the required install values here. helm upgrade without it rebuilds values from chart defaults, so users who set Redis, images, RBAC, or service account values during install can lose them when enabling SPIRE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the --reuse-values to preserve the install-time values.

kubectl logs -n agentcube-system deployment/agentcube-router -c agentcube-router | grep -i mtls
```

Expected output:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected output includes All mTLS cert/key/CA files are present, but the current code returns silently after the files appear. Please either update the expected output or add this log, otherwise this verification step asks users to look for a line that never appears.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added klog.Infof to wait.go to match the expected output

plane components, run the Helm upgrade again with `spire.enabled=false`:

```bash
helm upgrade agentcube manifests/charts/base \
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: disabling SPIRE should preserve the existing release values. Without --reuse-values or the original required --sets, cleanup can unintentionally reset the deployment while only trying to turn SPIRE off.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this too.

--set spire.enabled=false
```

This removes all SPIRE resources (Server, Agent, CRDs, sidecars) and the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not remove the CRDs; they were installed manually in Step 1 and are deleted separately below. This should say it removes the SPIRE workloads, sidecars, and ClusterSPIFFEID resources from this Helm release.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my mistake, accidently added CRDs, but now removed it.

- Added klog.Infof to wait.go so expected output is conistent with what actually appears in logs, matching the tutorial's expected output
- Helm upgrade removes SPIRE workloads and sidecars, not CRDs (those are removed separately via kubectl)
- added --reuse-values flag to preserve the install-time values

Signed-off-by: Mahil Patel <mahilpatel0808@gmail.com>
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mahil-2040 mahil-2040 requested a review from acsoto May 31, 2026 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Improvements or additions to documentation size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants