Skip to content

Conversation

@djoshy
Copy link
Contributor

@djoshy djoshy commented Dec 9, 2025

- What I did
This PR updates the boot image controller to use an event based queue. See additional context in #5332 (comment). It also renames the controller to be more generic as it is no longer only updating machinesets.

It also has some minor verify fixes that I was encountering locally, but not seeing in CI?

- How to verify it
All boot image tests should still pass; no functional changes are expected.

These seemed to be tripping local verify runs, but
not the CI ones.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 9, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 9, 2025
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request explicitly references no jira issue.

Details

In response to this:

- What I did
This PR updates the boot image controller to use an event based queue. See additional context in #5332 (comment). It also renames the controller to be more generic as it is no longer only updating machinesets.

It also has some minor verify fixes that I was encountering locally, but not seeing in CI?

- How to verify it
All boot image tests should still pass; no functional changes are expected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 9, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Dec 9, 2025

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-azure-mco-disruptive
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-vsphere-mco-disruptive
/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-gcp-mco-disruptive

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 9, 2025

@djoshy: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-azure-mco-disruptive
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-vsphere-mco-disruptive
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-gcp-mco-disruptive

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2f4ef270-d548-11f0-8f48-888e820b0c9f-0

@djoshy
Copy link
Contributor Author

djoshy commented Dec 10, 2025

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-azure-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-azure-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-gcp-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-gcp-mco-disruptive-techpreview-2of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-vsphere-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-vsphere-mco-disruptive-techpreview-2of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

@djoshy: trigger 8 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-azure-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-azure-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-gcp-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-gcp-mco-disruptive-techpreview-2of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-vsphere-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-vsphere-mco-disruptive-techpreview-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c14cbb20-d5df-11f0-9a0b-72d30d447e80-0

@djoshy djoshy marked this pull request as ready for review December 11, 2025 18:36
@djoshy
Copy link
Contributor Author

djoshy commented Dec 11, 2025

Payloads look good, opening up for review

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025
This commit simplifies the name of the package as
the controller now deals with more than just the
standard machineset resources.
@djoshy
Copy link
Contributor Author

djoshy commented Dec 11, 2025

/test unit

@yuqi-zhang
Copy link
Contributor

I think this warrants a tech debt card at least, even if we don't have a specific target version to merge this :) thanks for following up.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 11, 2025

@djoshy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit d6a4e18 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sergiordlr
Copy link
Contributor

One of the tests failed because it took a little bit longer than usual to update the machineset. It was updated, though.

When we created a machineset without configuring the machineconfiguration resource we saw this

I1212 10:40:39.822347       1 boot_image_controller.go:260] MAPI MachineSet cloned-tc-81403 added, reconciling enrolled machine resources
I1212 10:40:40.415505       1 boot_image_controller.go:283] MachineSet cloned-tc-81403 updated, reconciling enrolled machineset resources
E1212 10:40:42.718958       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:40:47.719781       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:40:52.719846       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:40:57.719785       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:41:02.718985       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:41:07.719274       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:41:12.719845       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:41:17.719239       1 helpers.go:148] MachineConfiguration.Status is not up to date.
E1212 10:41:22.719125       1 helpers.go:148] MachineConfiguration.Status is not up to date.
I1212 10:41:24.284946       1 boot_image_controller.go:449] Bootimages management configuration has been updated, reconciling enrolled machine resources
I1212 10:41:27.766451       1 platform_helpers.go:74] Reconciling MAPI machineset sregidor-qaws-cjqgm-worker-us-east-2a on AWS, with arch x86_64
I1212 10:41:27.772457       1 ms_helpers.go:170] No patching required for MAPI machineset sregidor-qaws-cjqgm-worker-us-east-2a
I1212 10:41:27.813004       1 platform_helpers.go:74] Reconciling MAPI machineset sregidor-qaws-cjqgm-worker-us-east-2b on AWS, with arch x86_64
I1212 10:41:27.815019       1 ms_helpers.go:170] No patching required for MAPI machineset sregidor-qaws-cjqgm-worker-us-east-2b
I1212 10:41:27.851045       1 platform_helpers.go:74] Reconciling MAPI machineset sregidor-qaws-cjqgm-worker-us-east-2c on AWS, with arch x86_64
I1212 10:41:27.852302       1 ms_helpers.go:170] No patching required for MAPI machineset sregidor-qaws-cjqgm-worker-us-east-2c
I1212 10:41:28.131547       1 platform_helpers.go:74] Reconciling MAPI machineset cloned-tc-81403 on AWS, with arch x86_64
I1212 10:41:28.132991       1 platform_helpers.go:191] Current image: us-east-2: ami-0ffec236307e00b94
I1212 10:41:28.133064       1 platform_helpers.go:192] New target boot image: us-east-2: ami-0bc8dda494f111572
I1212 10:41:28.136060       1 ms_helpers.go:167] Patching MAPI machineset cloned-tc-81403
I1212 10:41:28.149150       1 boot_image_controller.go:283] MachineSet cloned-tc-81403 updated, reconciling enrolled machineset resources
I1212 10:41:28.149187       1 ms_helpers.go:220] Successfully patched machineset cloned-tc-81403

Our test has a very tight timeout and failed.

I think that the solution is to use a slightly bigger timeout, since taking 50 seconds to update the machineset seems reasonable.

Nevertheless, could you confirm that you see no issue in the logs I pasted, please?

@djoshy
Copy link
Contributor Author

djoshy commented Dec 12, 2025

Chatted with @sergiordlr, and we think this is the after effect of a single threaded queue system vs a multiple thread system utilizing mutexes. The status from a previous test(Restore Partial MachineSet images [Disruptive] [Serial])hadn't finished processing and so the test that failed(In BootImages Machineset should update by default) didn't actually perform updates until about a minute later. Our solution was to bump the timeout on the failed test, and assess if it becomes an issue later.

/verified by @sergiordlr

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 12, 2025
@openshift-ci-robot
Copy link
Contributor

@djoshy: This PR has been marked as verified by @sergiordlr.

Details

In response to this:

Chatted with @sergiordlr, and we think this is the after effect of a single threaded queue system vs a multiple thread system utilizing mutexes. The status from a previous test(Restore Partial MachineSet images [Disruptive] [Serial])hadn't finished processing and so the test that failed(In BootImages Machineset should update by default) didn't actually perform updates until about a minute later. Our solution was to bump the timeout on the failed test, and assess if it becomes an issue later.

/verified by @sergiordlr

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy djoshy changed the title NO-ISSUE: Use an event based queue in the boot image controller MCO-2015: Use an event based queue in the boot image controller Dec 12, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 12, 2025

@djoshy: This pull request references MCO-2015 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

- What I did
This PR updates the boot image controller to use an event based queue. See additional context in #5332 (comment). It also renames the controller to be more generic as it is no longer only updating machinesets.

It also has some minor verify fixes that I was encountering locally, but not seeing in CI?

- How to verify it
All boot image tests should still pass; no functional changes are expected.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@pablintino pablintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, good one, but not sure why I missed this one.


out, err := uploadCmd.CombinedOutput()
logger.Infof(string(out))
logger.Infof("%s", string(out))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good one.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 16, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, pablintino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@djoshy
Copy link
Contributor Author

djoshy commented Dec 16, 2025

Running an AWS serial payload just to be safe:
/payload-job periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-serial-1of2 periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-serial-2of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 16, 2025

@djoshy: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-master-nightly-4.22-e2e-aws-ovn-serial-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a9cc10e0-da99-11f0-88ac-11f06fe1385c-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants