Skip to content

Conversation

@tthvo
Copy link
Member

@tthvo tthvo commented Nov 20, 2025

Description

The SDK v2 introduces a new built-in client rate limiter [0], whose default settings break the install process when under heavy stress. For example, we have seen rate limit issues when runnning in CI with high number of AWS resources, especially IAM.

Thus, we explicitly disable the rate limiter in the common client config, defined in pkg/asset/installconfig/aws/sessionv2.go. Any v2 client will need to use this config.

Background

While attempting to migrate the destroy code, we have run into rate limiting issues previously where IAM API calls are rate limited. We reverted that.

However, there are other IAM calls within the install path to getOrCreate IAM roles and instance profile. Thus, we need to make sure the IAM v2 client disables the rate limiter. Otherwise, we will run into the error such as:

time="2025-11-13T08:21:44Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create IAM master role: failed to get master role: operation error IAM: GetRole, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 910e3801-5fc1-45d7-91cc-092bb8e3e4b1, api error Throttling: Rate exceeded"}

The SDK v2 introduces a new built-in client rate limiter [0], whose default
settings break the install process when under heavy stress. For example,
we have seen rate limit issues when runnning in CI with high number of
AWS resources, especially IAM.

Thus, we explicitly disable the rate limiter in the common client
config, defined in pkg/asset/installconfig/aws/sessionv2.go. Any v2
client will need to use this config.

References

[0] https://docs.aws.amazon.com/sdk-for-go/v2/developer-guide/configure-retries-timeouts.html
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 20, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 20, 2025

@tthvo: This pull request references CORS-4055 which is a valid jira issue.

In response to this:

Description

The SDK v2 introduces a new built-in client rate limiter [0], whose default settings break the install process when under heavy stress. For example, we have seen rate limit issues when runnning in CI with high number of AWS resources, especially IAM.

Thus, we explicitly disable the rate limiter in the common client config, defined in pkg/asset/installconfig/aws/sessionv2.go. Any v2 client will need to use this config.

Background

While attempting to migrate the destroy code, we have run into rate limiting issues previously where IAM API calls are rate limited. We reverted that.

However, there are other IAM calls within the install path to getOrCreate IAM roles and instance profile. Thus, we need to make sure the IAM v2 client disables the rate limiter. Otherwise, we will run into the error such as:

time="2025-11-13T08:21:44Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create IAM master role: failed to get master role: operation error IAM: GetRole, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 910e3801-5fc1-45d7-91cc-092bb8e3e4b1, api error Throttling: Rate exceeded"}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tthvo
Copy link
Member Author

tthvo commented Nov 20, 2025

/cc @barbacbd @yunjiang29 @gpei
/label platform/aws

@tthvo
Copy link
Member Author

tthvo commented Nov 20, 2025

The main fix is for the IAM client, which is more likely to hit rate limiter via the get calls. Other changes are for cleaning up EC2 clients.

Other clients that either are still in SDK v1 or do not have a client construct here are left intact. That will be done in #9907 🙏

Copy link
Contributor

@barbacbd barbacbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 20, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: barbacbd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2025
@tthvo
Copy link
Member Author

tthvo commented Nov 20, 2025

Sample analysis: In ci/prow/e2e-aws-ovn-edge-zones, we can see that AWS API is under heavy load. The CCO is failing with IAM rate limiter, but the installer was able to completes it install.

These errors are hard to detect, and mostly visible when high merge traffic near freeze window... I am surprised it didn't happen in August during 4.20.

@tthvo
Copy link
Member Author

tthvo commented Nov 21, 2025

/retest
/test e2e-aws-ovn-public-subnets e2e-aws-ovn-public-ipv4-pool e2e-aws-ovn-custom-iam-profile e2e-aws-overlay-mtu-ovn-1200

Firing more tests to stress CI a bit more 😅

@tthvo
Copy link
Member Author

tthvo commented Nov 21, 2025

Okayy, if we compare the e2e runs here against other PRs... 👇

For example:

Here in this PR, the fix helped all jobs to pass successfully the step ipi-install-install without ever hitting the rate limiting issue.

Ideally, we should enable rate limiter some day in the future with well-tuned parameters. For now, this is to preserve the behaviour of SDK v1.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

@tthvo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-edge-zones 645e668 link false /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-single-node 645e668 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-shared-vpc-edge-zones 645e668 link false /test e2e-aws-ovn-shared-vpc-edge-zones
ci/prow/e2e-aws-ovn-public-ipv4-pool 645e668 link false /test e2e-aws-ovn-public-ipv4-pool
ci/prow/e2e-aws-ovn-heterogeneous 645e668 link false /test e2e-aws-ovn-heterogeneous

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sadasu
Copy link
Contributor

sadasu commented Nov 21, 2025

/verified by @tthvo

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Nov 21, 2025
@openshift-ci-robot
Copy link
Contributor

@sadasu: This PR has been marked as verified by @tthvo.

In response to this:

/verified by @tthvo

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sadasu
Copy link
Contributor

sadasu commented Nov 21, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 21, 2025
@tthvo
Copy link
Member Author

tthvo commented Nov 21, 2025

/skip

@patrickdillon patrickdillon changed the title CORS-4055: configure AWS SDK v2 clients with common config OCPBUGS-65893: CORS-4055: configure AWS SDK v2 clients with common config Nov 21, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Nov 21, 2025
@patrickdillon
Copy link
Contributor

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@tthvo: This pull request references Jira Issue OCPBUGS-65893, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Description

The SDK v2 introduces a new built-in client rate limiter [0], whose default settings break the install process when under heavy stress. For example, we have seen rate limit issues when runnning in CI with high number of AWS resources, especially IAM.

Thus, we explicitly disable the rate limiter in the common client config, defined in pkg/asset/installconfig/aws/sessionv2.go. Any v2 client will need to use this config.

Background

While attempting to migrate the destroy code, we have run into rate limiting issues previously where IAM API calls are rate limited. We reverted that.

However, there are other IAM calls within the install path to getOrCreate IAM roles and instance profile. Thus, we need to make sure the IAM v2 client disables the rate limiter. Otherwise, we will run into the error such as:

time="2025-11-13T08:21:44Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create IAM master role: failed to get master role: operation error IAM: GetRole, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 910e3801-5fc1-45d7-91cc-092bb8e3e4b1, api error Throttling: Rate exceeded"}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This pull request references Jira Issue OCPBUGS-65893, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 386dca3 and 2 for PR HEAD 645e668 in total

@patrickdillon
Copy link
Contributor

/override ci/prow/e2e-aws-ovn ci/prow/e2e-aws-ovn-edge-zones-manifest-validation

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 22, 2025

@patrickdillon: Overrode contexts on behalf of patrickdillon: ci/prow/e2e-aws-ovn, ci/prow/e2e-aws-ovn-edge-zones-manifest-validation

In response to this:

/override ci/prow/e2e-aws-ovn ci/prow/e2e-aws-ovn-edge-zones-manifest-validation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit b3eccf7 into openshift:main Nov 22, 2025
30 checks passed
@openshift-ci-robot
Copy link
Contributor

@tthvo: Jira Issue Verification Checks: Jira Issue OCPBUGS-65893
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-65893 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

In response to this:

Description

The SDK v2 introduces a new built-in client rate limiter [0], whose default settings break the install process when under heavy stress. For example, we have seen rate limit issues when runnning in CI with high number of AWS resources, especially IAM.

Thus, we explicitly disable the rate limiter in the common client config, defined in pkg/asset/installconfig/aws/sessionv2.go. Any v2 client will need to use this config.

Background

While attempting to migrate the destroy code, we have run into rate limiting issues previously where IAM API calls are rate limited. We reverted that.

However, there are other IAM calls within the install path to getOrCreate IAM roles and instance profile. Thus, we need to make sure the IAM v2 client disables the rate limiter. Otherwise, we will run into the error such as:

time="2025-11-13T08:21:44Z" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: failed to create IAM roles: failed to create IAM master role: failed to get master role: operation error IAM: GetRole, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: 910e3801-5fc1-45d7-91cc-092bb8e3e4b1, api error Throttling: Rate exceeded"}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tthvo tthvo deleted the CORS-4055-partial branch November 22, 2025 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. platform/aws verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants