Skip to content

Conversation

@patrickdillon
Copy link
Contributor

Switch to using Azure marketplace images by default. This will bypass the steps where the installer uploads a VHD and creates a managed image for the cluster, which will still happen on OKD (we are investigating whether a public Shared Image Gallery can be used to achieve the same results for OKD).

The code in the last commit is ripe for refactoring, but the refactoring would be substantial and this PR is already complex enough, so that will need to be a follow up.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 12, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 12, 2025

@patrickdillon: This pull request references CORS-3657 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

Switch to using Azure marketplace images by default. This will bypass the steps where the installer uploads a VHD and creates a managed image for the cluster, which will still happen on OKD (we are investigating whether a public Shared Image Gallery can be used to achieve the same results for OKD).

The code in the last commit is ripe for refactoring, but the refactoring would be substantial and this PR is already complex enough, so that will need to be a follow up.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jhixson74 and rna-afk November 12, 2025 21:28
@tthvo
Copy link
Member

tthvo commented Nov 12, 2025

/cc @sadasu
/cc

@openshift-ci openshift-ci bot requested review from sadasu and tthvo November 12, 2025 21:43
Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool 👍 I just have a few code comments above.

azi = ext.Marketplace.Azure.NoPurchasePlan.Gen2.URN()
if gen == "V1" {
if mkt.Azure.NoPurchasePlan.Gen1 == nil {
return "", fmt.Errorf("a HyperVGeneration 1 instance was selected but no Gen1 marketplace imagge is available")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return "", fmt.Errorf("a HyperVGeneration 1 instance was selected but no Gen1 marketplace imagge is available")
return "", fmt.Errorf("a HyperVGeneration 1 instance was selected but no Gen1 marketplace image is available")

nit: typo 😁

@patrickdillon
Copy link
Contributor Author

Feedback has been addressed.

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Code looks good to m 🚀

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 14, 2025
@patrickdillon
Copy link
Contributor Author

@gpei For testing, I think the main areas not covered by these existing e2es are:

  • govcloud
  • install with a machinepool of an instance type that only supports hyperv gen1, such as Dsv2

@gpei
Copy link
Contributor

gpei commented Nov 16, 2025

Thanks for the suggestions @patrickdillon, I ran a set of tests on features that I think are most likely affected by the boot image change. Besides the GovCloud and Gen 1 VM type (which you mentioned), these features also include the following:

  • UserManaged boot diagnostics
  • Confidentialvm vmgueststateonly
  • Confidential trustedlaunch
  • Customize Disk Types
  • Enable Disk Encryption
  • Additional custom Azure disk
  • Accelerated networking type
  • NVMe disk controller support
  • Marketplace image configured in the previous way
  • FIPS enabled
  • ARM arch installation
  • ASH

Overall Results

Most of the functions work correctly, including the two test points you specifically pointed out.

During installation, the boot image specified in the marketplace-rhcos.json will be used. The installer will not create:

  • The storage container used for uploading the image
  • The corresponding Image Gallery

Worker Nodes with HyperV Gen1 Support

For worker nodes using a type that only supports HyperV Gen1:

Test Artifacts: machines.json

The workers are provisioned from the Gen1 image correctly

"image": {
    "offer": "aro4",
    "publisher": "azureopenshift",
    "resourceID": "",
    "sku": "aro_419",
    "type": "MarketplaceNoPlan",
    "version": "419.6.20250523"
}

MAG (Azure Gov cloud) Job

Test Artifacts: machines.json

The master and worker machines are provisioned from the image specified in marketplace-rhcos.json.

"image": {
    "offer": "aro4",
    "publisher": "azureopenshift",
    "resourceID": "",
    "sku": "419-v2",
    "type": "MarketplaceNoPlan",
    "version": "419.6.20250523"
}

Failure scenarios

Currently, two test cases have been identified as problematic: one related to ConfidentialVM and the other to ASH.


Issue #1: ConfidentialVM

Problem Description

For the ConfidentialVM testing on compute nodes, the worker machines failed to be provisioned because the image doesn't have ConfidentialVM setting enabled.

Error Log Location: machine-controller.log

Error Message:

E1115 07:57:35.633693       1 actuator.go:84] Machine error: failed to reconcile machine "ci-op-0m2bf8lj-1f224-5m7h7-worker-westeurope1-t7xxh": failed to create vm ci-op-0m2bf8lj-1f224-5m7h7-worker-westeurope1-t7xxh: failed to create VM: cannot create vm: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-0m2bf8lj-1f224-5m7h7-rg/providers/Microsoft.Compute/virtualMachines/ci-op-0m2bf8lj-1f224-5m7h7-worker-westeurope1-t7xxh
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: BadRequest
--------------------------------------------------------------------------------
{
  "error": {
    "code": "BadRequest",
    "message": "Use of ConfidentialVM setting is not supported for the provided image."
  }
}

I consulted @jinyunma about this error, and she told me that a previous bug (OCPBUGS-41300) that also had a similar error.

We updated the image's SecurityType when creating the image, so perhaps we need to update the Marketplace image we're using to support these SecurityTypes?


Issue #2: Azure Stack Hub (ASH)

Problem Description

Another issue is on ASH. I noticed that the e2e-azurestack test in this PR also failed, but there wasn't a clear gather-extra log to determine the specific reason. In my test, I saw that the same issue occurred because the worker failed to be created successfully.

Error Log Location: machine-controller.log

Error Message:

E1115 08:09:24.998866       1 actuator.go:84] Machine error: failed to reconcile machine "ci-op-8sfli1mq-9cb48-t2glj-worker-mtcazs-7t4j6": failed to create vm ci-op-8sfli1mq-9cb48-t2glj-worker-mtcazs-7t4j6: failed to create VM: cannot create vm: PUT https://management.mtcazs.wwtatc.com/subscriptions/d751283a-64fa-401b-92a1-58f1750ac0a7/resourceGroups/ci-op-8sfli1mq-9cb48/providers/Microsoft.Compute/virtualMachines/ci-op-8sfli1mq-9cb48-t2glj-worker-mtcazs-7t4j6
--------------------------------------------------------------------------------
RESPONSE 400: 400 Bad Request
ERROR CODE: BadRequest
--------------------------------------------------------------------------------
{
  "error": {
    "code": "BadRequest",
    "message": "Id /subscriptions/d751283a-64fa-401b-92a1-58f1750ac0a7/subscriptions/d751283a-64fa-401b-92a1-58f1750ac0a7/resourceGroups/ci-op-8sfli1mq-9cb48/providers/Microsoft.Compute/images/ci-op-8sfli1mq-9cb48-t2glj is not a valid resource reference."
  }
}

Failed Image Configuration in the worker machineset

"image": {
    "offer": "",
    "publisher": "",
    "resourceID": "/subscriptions/d751283a-64fa-401b-92a1-58f1750ac0a7/resourceGroups/ci-op-8sfli1mq-9cb48/providers/Microsoft.Compute/images/ci-op-8sfli1mq-9cb48-t2glj",
    "sku": "",
    "version": ""
}

Working Image Configuration (Normal Nightly payload on ASH) in the worker machineset

While in a normal nightly payload ASH installation where the workers are successfully provisioned, the image configuration is:

"image": {
    "offer": "",
    "publisher": "",
    "resourceID": "/resourceGroups/ci-op-8bfzbv3z-0da13/providers/Microsoft.Compute/images/ci-op-8bfzbv3z-0da13-97mgz",
    "sku": "",
    "version": ""
}

Notice: The resourceID in the failed case has an extra subscription ID in the beginning, so there's duplicated subscription ID in the final request.


ARM Architecture Testing

Current Status

Because the cluster-bot can't build ARM arch payload, so the ARM installation test can't be completed for now. We can only wait until this PR is merged into the nightly payload.

While in the test results I can see so far, the bootstrap and master created by the installer correctly use the ARM image specified in the marketplace-rhcos.json:

Output:

VMName                                Publisher       Offer    SKU      Version
------------------------------------  --------------  -------  -------  --------------
ci-op-lk2rm09d-42c7e-qg6v6-bootstrap  azureopenshift  aro4     419-arm  419.6.20250523
ci-op-lk2rm09d-42c7e-qg6v6-master-0   azureopenshift  aro4     419-arm  419.6.20250523
ci-op-lk2rm09d-42c7e-qg6v6-master-1   azureopenshift  aro4     419-arm  419.6.20250523
ci-op-lk2rm09d-42c7e-qg6v6-master-2   azureopenshift  aro4     419-arm  419.6.20250523

Summary

Feature Status Notes
GovCloud ✅ Pass -
Gen 1 VM type ✅ Pass Correctly uses aro_419 SKU
NVMe VM type ✅ Pass -
UserManaged boot diagnostics ✅ Pass -
FIPS enabled ✅ Pass -
Confidentialvm/Trustedlaunch (compute nodes) Fail Image lacks ConfidentialVM support
Disk Types/Encryption/Multi Disk Setup ✅ Pass -
Accelerated networking ✅ Pass -
Marketplace image (previous way) ✅ Pass -
ARM architecture ⚠️ Partial Bootstrap/master use correct ARM image; full test pending merge
ASH (Azure Stack Hub) Fail Invalid resource reference (duplicate subscription ID)

@patrickdillon
Copy link
Contributor Author

/hold

Looks like there's a bug in how the azure stack image is specified.

Even more so, I think we are blocked until we can get confidential image support.

Thanks for the awesome help @gpei

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 19, 2025
@patrickdillon
Copy link
Contributor Author

@gpei I checked into the confidential VM situation, and it would require separate marketplace images for both ConfidentialVM and Trusted Launch. So we are going to stick with image galleries for that scenario (and move everything else to marketplace).

The latest commit SHOULD (I admit I have not tested) handle confidential VM, as well as a fix for azurestack. It looks like the e2e should cover azurestack. Can you test confidential VM?

@gpei
Copy link
Contributor

gpei commented Nov 19, 2025

@patrickdillon Ack, thanks for the quick fix!

@gpei
Copy link
Contributor

gpei commented Nov 19, 2025

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidentialvm-vmgueststateonly-mini-perm-f7

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2025

@gpei: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidentialvm-vmgueststateonly-mini-perm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7528b8c0-c4f5-11f0-9ff3-80dfb6c3839d-0

@gpei
Copy link
Contributor

gpei commented Nov 19, 2025

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidential-trustedlaunch-mini-perm-f7

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 19, 2025

@gpei: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidential-trustedlaunch-mini-perm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a65cda20-c4f5-11f0-9f5b-3e54ea60f0e6-0

@gpei
Copy link
Contributor

gpei commented Nov 19, 2025

/test e2e-azurestack

@gpei
Copy link
Contributor

gpei commented Nov 19, 2025

Morning Patrick. It appears that the ASH issue has been fixed. I can see that the worker was successfully created in another test job. The ultimate failure is the API version incompatibility issue we already know about.

Regarding Confidentialvm, in addition to the problem that Thuan pointed out that the custom image was not created during the confidentialvm installation, there might also be a potential permissions issue.

I see the following error in azure-ipi-confidentialvm-vmgueststateonly-mini-perm job, which looks like missing Microsoft.Compute/galleries/images/read permission.

level=warning msg=Condition Ready has status: "False", reason: "Failed", message: "virtualmachine failed to create or update. err: failed to create or update resource ci-op-1cdkhcs3-c2c65-lpfbm-rg/ci-op-1cdkhcs3-c2c65-lpfbm-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-1cdkhcs3-c2c65-lpfbm-rg/providers/Microsoft.Compute/virtualMachines/ci-op-1cdkhcs3-c2c65-lpfbm-bootstrap\n--------------------------------------------------------------------------------\nRESPONSE 403: 403 Forbidden\nERROR CODE: LinkedAuthorizationFailed\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"LinkedAuthorizationFailed\",\n    \"message\": \"The client '5dad503c-4c25-4291-a7cb-eb1e58422d56' with object id 'fc8a44c4-3dc5-4f2b-bfb8-f4f9e573641d' has permission to perform action 'Microsoft.Compute/virtualMachines/write' on scope '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-1cdkhcs3-c2c65-lpfbm-rg/providers/Microsoft.Compute/virtualMachines/ci-op-1cdkhcs3-c2c65-lpfbm-bootstrap'; however, it does not have permission to perform action(s) '/read' on the linked scope(s) '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups//providers/Microsoft.Compute/galleries/gallery_ci_op_1cdkhcs3_c2c65_lpfbm/images/ci-op-1cdkhcs3-c2c65-lpfbm-gen2' (respectively) or the linked scope(s) are invalid.\"\n  }\n}\n--------------------------------------------------------------------------------\n"

But, this permission should have been granted in our CI when testing the Azure minimal permission installation, so this may need further investigation.
As a comparison, a confidentialvm job without using the minimal permissions, it will report a 400 error on find the expected image.

level=warning msg=Condition Ready has status: "False", reason: "Failed", message: "virtualmachine failed to create or update. err: failed to create or update resource ci-op-1ggfsp7t-56add-mw5fd-rg/ci-op-1ggfsp7t-56add-mw5fd-bootstrap (service: virtualmachine): PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/ci-op-1ggfsp7t-56add-mw5fd-rg/providers/Microsoft.Compute/virtualMachines/ci-op-1ggfsp7t-56add-mw5fd-bootstrap\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: BadRequest\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"BadRequest\",\n    \"message\": \"Id /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups//providers/Microsoft.Compute/galleries/gallery_ci_op_1ggfsp7t_56add_mw5fd/images/ci-op-1ggfsp7t-56add-mw5fd-gen2 is not a valid resource reference.\"\n  }\n}\n--------------------------------------------------------------------------------\n"

@patrickdillon
Copy link
Contributor Author

/payload-job periodic-ci-openshift-verification-tests-main-installation-nightly-4.21-azure-ipi-confidentialvm-vmgueststateonly-no-mini-perm-f7

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 21, 2025
@patrickdillon
Copy link
Contributor Author

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidential-trustedlaunch-mini-perm-f7

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@patrickdillon: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidential-trustedlaunch-mini-perm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e96b2fa0-c712-11f0-912f-90c85796e079-0

@patrickdillon
Copy link
Contributor Author

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidentialvm-vmgueststateonly-f28-destructive

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@patrickdillon: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidentialvm-vmgueststateonly-f28-destructive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f97e0610-c712-11f0-90f3-77d70f3c8aca-0

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Waiting on testing 😁

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2025
Unfortunately we have a lot of big functions in the installer, and
that is not likely to change. Therefore bumping the cyclomatic
complexity threshold so the linter starts complaining at a threshold
of 40 rather than 30.

Also remove the tenv linter as it is deprecated.
Marketplace images do not support confidential VMs or trusted launch,
so when machinesets use confidential VMs the installer will still
create an image gallery compatible with the security settings.
This commit updates default value handling when loading the
install config to set values in machine pools based on the
defaultMachinePlatform.

By populating the values directly in the install config, we can
avoid repetitive checks throughout the codebase to ensure the
default machine platform is applied to the relevant machine pool.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2025
@patrickdillon
Copy link
Contributor Author

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidentialvm-vmgueststateonly-f28-destructive

@patrickdillon
Copy link
Contributor Author

/payload-job periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidential-trustedlaunch-mini-perm-f7

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@patrickdillon: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidentialvm-vmgueststateonly-f28-destructive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1af02bf0-c71a-11f0-9600-422088630755-0

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@patrickdillon: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-openshift-tests-private-release-4.21-amd64-nightly-azure-ipi-confidential-trustedlaunch-mini-perm-f7

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/253e91a0-c71a-11f0-9004-bb53ea4d7b62-0

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tthvo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2025
@patrickdillon
Copy link
Contributor Author

/test e2e-azurestack

Job was unable to connect to the instance. If we can't confirm testing on this, let's move forward and we can fix in a followup bug.

@patrickdillon
Copy link
Contributor Author

confidentialvm install passed

looks like azurestack reached install phase, will keep watching that but I'm going to go ahead and verify this to get it into the merge queue

/verified by a group effort

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 22, 2025
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This PR has been marked as verified by a group effort.

In response to this:

confidentialvm install passed

looks like azurestack reached install phase, will keep watching that but I'm going to go ahead and verify this to get it into the merge queue

/verified by a group effort

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@patrickdillon
Copy link
Contributor Author

/override ci/prow/e2e-aws-ovn

rhel repo outage

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 22, 2025

@patrickdillon: Overrode contexts on behalf of patrickdillon: ci/prow/e2e-aws-ovn

In response to this:

/override ci/prow/e2e-aws-ovn

rhel repo outage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gpei
Copy link
Contributor

gpei commented Nov 22, 2025

Confidential and ASH jobs are using the correct images, all the master and worker machines were created as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants