Skip to content

Conversation

@bryan-cox
Copy link
Member

@bryan-cox bryan-cox commented Nov 4, 2025

Summary

Fixes Azure Disk and File CSI drivers on Azure self-managed hosted clusters by adding a token-minter sidecar container.

Problem

On Azure self-managed hosted clusters (HyperShift mode), Azure Disk and File CSI driver controllers fail to provision volumes with errors:

  • Azure-Disk: WorkloadIdentityCredential: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory
  • Azure-File: failed to ensure storage account: clientFactory is nil

The CSI driver controllers run in the management cluster but need guest cluster service account tokens for Azure workload identity authentication. The token file at /var/run/secrets/openshift/serviceaccount/token does not exist because there is no mechanism to create it.

Solution

This PR adds a shared WithTokenMinter(serviceAccountName string) deployment hook function in pkg/driver/common/operator/hooks.go that both Azure Disk and File CSI driver operators use to inject a token-minter sidecar container.

The token-minter sidecar:

  • Runs the /usr/bin/control-plane-operator token-minter command
  • Creates service account tokens for the guest cluster namespace openshift-cluster-csi-drivers
  • Writes tokens to /var/run/secrets/openshift/serviceaccount/token in a shared emptyDir volume
  • Uses the service-network-admin-kubeconfig secret to access the guest cluster
  • Reads HYPERSHIFT_IMAGE env var directly (not placeholder) since deployment hooks run after asset replacement

Note: The bound-sa-token emptyDir volume and hosted-kubeconfig secret volume are already added by the HyperShift patch files (controller_add_hypershift_controller.yaml), so the hook only adds the token-minter container.

Platform-Specific Behavior

The hook is added to both Azure Disk and File drivers. The platform-specific behavior is controlled by cluster-storage-operator:

  • Self-managed Azure: cluster-storage-operator passes HYPERSHIFT_IMAGE env var to the CSI driver operators, enabling token-minter functionality
  • ARO HCP: cluster-storage-operator does NOT pass HYPERSHIFT_IMAGE, as ARO HCP uses Secret Provider Class with managed identities instead

This follows the same pattern already used by AWS EBS CSI driver and Azure Cloud Controller Manager.

Changes

  • pkg/driver/common/operator/hooks.go: Added WithTokenMinter(serviceAccountName string) deployment hook (lines 257-301)
  • pkg/driver/common/operator/replacer.go: Fixed copy-paste bug checking wrong variable for HYPERSHIFT_IMAGE (line 64)
  • pkg/driver/azure-disk/azure_disk.go: Use common WithTokenMinter() with azure-disk-csi-driver-controller-sa (line 228)
  • pkg/driver/azure-file/azure_file.go: Use common WithTokenMinter() with azure-file-csi-driver-controller-sa (line 187)

Testing

On an Azure self-managed hosted cluster:

  1. Verify CSI driver controller deployments have the token-minter sidecar
  2. Verify the token file exists in the controller pods
  3. Create PVCs using Azure Disk and Azure File storage classes
  4. Verify PVCs reach Bound status
  5. Verify pods can successfully use the volumes

Related PRs

References

@openshift-ci openshift-ci bot requested review from jsafrane and tsmetana November 4, 2025 15:31
@bryan-cox bryan-cox marked this pull request as draft November 4, 2025 15:32
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 4, 2025
@bryan-cox
Copy link
Member Author

/uncc @jsafrane

@openshift-ci openshift-ci bot removed the request for review from jsafrane November 4, 2025 15:32
@bryan-cox
Copy link
Member Author

/uncc @tsmetana

@openshift-ci openshift-ci bot removed the request for review from tsmetana November 4, 2025 15:33
@bryan-cox bryan-cox force-pushed the HOSTEDCP-2033 branch 2 times, most recently from bef9f49 to fd11123 Compare November 4, 2025 15:38
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2025
@bryan-cox bryan-cox changed the title fix(azure): add token-minter for self-managed hosted clusters OCPBUGS-63698: fix(azure): add token-minter for self-managed hosted clusters Nov 4, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 4, 2025
@openshift-ci-robot
Copy link

@bryan-cox: This pull request references Jira Issue OCPBUGS-63698, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes Azure Disk and File CSI drivers on Azure self-managed hosted clusters by adding a token-minter sidecar container.

Problem

On Azure self-managed hosted clusters (HyperShift mode), Azure Disk and File CSI driver controllers fail to provision volumes with errors:

  • Azure-Disk: WorkloadIdentityCredential: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory
  • Azure-File: failed to ensure storage account: clientFactory is nil

The CSI driver controllers run in the management cluster but need guest cluster service account tokens for Azure workload identity authentication. The token file at /var/run/secrets/openshift/serviceaccount/token does not exist because there is no mechanism to create it.

Solution

This PR adds a shared WithTokenMinter(serviceAccountName string) deployment hook function in pkg/driver/common/operator/hooks.go that both Azure Disk and File CSI driver operators use to:

  1. For self-managed Azure clusters: Inject a token-minter sidecar container that creates guest cluster service account tokens
  2. For ARO HCP: Continue using the existing Secret Provider Class approach with managed identities (no changes)

The conditional logic checks for the presence of ARO_HCP_SECRET_PROVIDER_CLASS_FOR_* environment variables:

  • If present → ARO HCP mode (use Secret Provider Class)
  • If absent → Self-managed Azure mode (use token-minter)

The token-minter sidecar:

  • Runs the /usr/bin/control-plane-operator token-minter command
  • Creates service account tokens for the guest cluster namespace openshift-cluster-csi-drivers
  • Writes tokens to /var/run/secrets/openshift/serviceaccount/token in a shared emptyDir volume
  • Uses the service-network-admin-kubeconfig secret to access the guest cluster

Note: The bound-sa-token emptyDir volume and hosted-kubeconfig secret volume are already added by the HyperShift patch files (controller_add_hypershift_controller.yaml), so the hook only adds the token-minter container.

This follows the same pattern already used by AWS EBS CSI driver and Azure Cloud Controller Manager.

Changes

  • pkg/driver/common/operator/hooks.go: Added WithTokenMinter(serviceAccountName string) deployment hook
  • pkg/driver/azure-disk/azure_disk.go: Use common WithTokenMinter() with azure-disk-csi-driver-controller-sa
  • pkg/driver/azure-file/azure_file.go: Use common WithTokenMinter() with azure-file-csi-driver-controller-sa

Testing

On an Azure self-managed hosted cluster:

  1. Verify CSI driver controller deployments have the token-minter sidecar
  2. Verify the token file exists in the controller pods
  3. Create PVCs using Azure Disk and Azure File storage classes
  4. Verify PVCs reach Bound status
  5. Verify pods can successfully use the volumes

References

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Azure Disk and File CSI drivers fail on Azure self-managed hosted
clusters because the service account token at
/var/run/secrets/openshift/serviceaccount/token does not exist.

Add runtime deployment hooks that conditionally inject token-minter
sidecar container for self-managed Azure clusters. The token-minter
creates guest cluster service account tokens that the CSI drivers
use for Azure workload identity authentication.

ARO HCP continues to use Secret Provider Class with managed
identities and is not affected by this change.

Fixes: OCPBUGS-63698
Signed-off-by: Bryan Cox <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
The token-minter image should use the  placeholder
instead of reading os.Getenv() directly. The placeholder is replaced
at runtime by the DefaultReplacements() function when the operator
processes the deployment.

This matches the pattern used in AWS EBS static patches.
Fix copy-paste error in DefaultReplacements() where HYPERSHIFT_IMAGE
placeholder replacement was incorrectly gated on csiDriver != ""
instead of hyperShiftImage != "".

This bug prevented ${HYPERSHIFT_IMAGE} placeholders from being
replaced with the actual image value, causing token-minter containers
to have invalid image references.

Signed-off-by: Bryan Cox <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
Deployment hooks run after asset placeholder replacement, so
placeholders added by hooks never get replaced. Fix by directly
reading os.Getenv("HYPERSHIFT_IMAGE") in the hook instead of using
a placeholder string.

Also add conditional behavior: if HYPERSHIFT_IMAGE is not set, skip
adding the token-minter container. This allows the same hook to work
for both self-managed Azure (where cluster-storage-operator sets
HYPERSHIFT_IMAGE) and ARO HCP (where it doesn't).

Signed-off-by: Bryan Cox <[email protected]>
Commit-Message-Assisted-by: Claude (via Claude Code)
@bryan-cox
Copy link
Member Author

/test all

@bryan-cox
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 9, 2025
@openshift-ci-robot
Copy link

@bryan-cox: This pull request references Jira Issue OCPBUGS-63698, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 5, 2026
@openshift-ci-robot
Copy link

@bryan-cox: This PR has been marked as verified by @duanwei33.

Details

In response to this:

/verified by @duanwei33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 522d61f and 2 for PR HEAD 0af2834 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 34fad4d and 1 for PR HEAD 0af2834 in total

@bryan-cox
Copy link
Member Author

/test hypershift-e2e-aks

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 3b21c8b and 0 for PR HEAD 0af2834 in total

@openshift-ci-robot
Copy link

/hold

Revision 0af2834 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@duanwei33
Copy link
Contributor

Last e2e-azurestack-csi job failed at Dec 10, let's launch a new job to check.
/test e2e-azurestack-csi

@duanwei33
Copy link
Contributor

In the latest failed job: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_csi-operator/461/pull-ci-openshift-csi-operator-main-e2e-azurestack-csi/2008443855589871616

All storage cases (and other cases) passed, it failed with the error log in cleanup phase(?):

{"component":"entrypoint","error":"wrapped process failed: exit status 1","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2026-01-06T10:57:47Z"}
error: failed to execute wrapped command: exit status 1 

I don't think it is a product issue, or caused by the PR.

@bryan-cox
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@jsafrane
Copy link
Contributor

jsafrane commented Jan 6, 2026

/override ci/prow/okd-scos-images

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

@jsafrane: Overrode contexts on behalf of jsafrane: ci/prow/okd-scos-images

Details

In response to this:

/override ci/prow/okd-scos-images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 3b21c8b and 2 for PR HEAD 0af2834 in total

@bryan-cox
Copy link
Member Author

/test e2e-azure-csi

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 390ff75 and 1 for PR HEAD 0af2834 in total

@bryan-cox
Copy link
Member Author

/test e2e-azure-file-nfs-csi

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD e27d70e and 0 for PR HEAD 0af2834 in total

@openshift-ci-robot
Copy link

/hold

Revision 0af2834 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2026
@bryan-cox
Copy link
Member Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2026
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD e27d70e and 2 for PR HEAD 0af2834 in total

@bryan-cox
Copy link
Member Author

/test hypershift-e2e-aks

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 8, 2026

@bryan-cox: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azurestack-csi 0af2834 link false /test e2e-azurestack-csi

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bryan-cox
Copy link
Member Author

/test hypershift-e2e-aks

@openshift-merge-bot openshift-merge-bot bot merged commit 33e93a6 into openshift:main Jan 8, 2026
21 of 22 checks passed
@openshift-ci-robot
Copy link

@bryan-cox: Jira Issue Verification Checks: Jira Issue OCPBUGS-63698
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-63698 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

Fixes Azure Disk and File CSI drivers on Azure self-managed hosted clusters by adding a token-minter sidecar container.

Problem

On Azure self-managed hosted clusters (HyperShift mode), Azure Disk and File CSI driver controllers fail to provision volumes with errors:

  • Azure-Disk: WorkloadIdentityCredential: open /var/run/secrets/openshift/serviceaccount/token: no such file or directory
  • Azure-File: failed to ensure storage account: clientFactory is nil

The CSI driver controllers run in the management cluster but need guest cluster service account tokens for Azure workload identity authentication. The token file at /var/run/secrets/openshift/serviceaccount/token does not exist because there is no mechanism to create it.

Solution

This PR adds a shared WithTokenMinter(serviceAccountName string) deployment hook function in pkg/driver/common/operator/hooks.go that both Azure Disk and File CSI driver operators use to inject a token-minter sidecar container.

The token-minter sidecar:

  • Runs the /usr/bin/control-plane-operator token-minter command
  • Creates service account tokens for the guest cluster namespace openshift-cluster-csi-drivers
  • Writes tokens to /var/run/secrets/openshift/serviceaccount/token in a shared emptyDir volume
  • Uses the service-network-admin-kubeconfig secret to access the guest cluster
  • Reads HYPERSHIFT_IMAGE env var directly (not placeholder) since deployment hooks run after asset replacement

Note: The bound-sa-token emptyDir volume and hosted-kubeconfig secret volume are already added by the HyperShift patch files (controller_add_hypershift_controller.yaml), so the hook only adds the token-minter container.

Platform-Specific Behavior

The hook is added to both Azure Disk and File drivers. The platform-specific behavior is controlled by cluster-storage-operator:

  • Self-managed Azure: cluster-storage-operator passes HYPERSHIFT_IMAGE env var to the CSI driver operators, enabling token-minter functionality
  • ARO HCP: cluster-storage-operator does NOT pass HYPERSHIFT_IMAGE, as ARO HCP uses Secret Provider Class with managed identities instead

This follows the same pattern already used by AWS EBS CSI driver and Azure Cloud Controller Manager.

Changes

  • pkg/driver/common/operator/hooks.go: Added WithTokenMinter(serviceAccountName string) deployment hook (lines 257-301)
  • pkg/driver/common/operator/replacer.go: Fixed copy-paste bug checking wrong variable for HYPERSHIFT_IMAGE (line 64)
  • pkg/driver/azure-disk/azure_disk.go: Use common WithTokenMinter() with azure-disk-csi-driver-controller-sa (line 228)
  • pkg/driver/azure-file/azure_file.go: Use common WithTokenMinter() with azure-file-csi-driver-controller-sa (line 187)

Testing

On an Azure self-managed hosted cluster:

  1. Verify CSI driver controller deployments have the token-minter sidecar
  2. Verify the token file exists in the controller pods
  3. Create PVCs using Azure Disk and Azure File storage classes
  4. Verify PVCs reach Bound status
  5. Verify pods can successfully use the volumes

Related PRs

References

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bryan-cox bryan-cox deleted the HOSTEDCP-2033 branch January 8, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants