internal/controller: return stale AuthConfig deletion errors so the controller retries by SAY-5 · Pull Request #1913 · Kuadrant/kuadrant-operator

SAY-5 · 2026-04-19T05:29:16Z

What

AuthConfigsReconciler.Reconcile in internal/controller/authconfigs_reconciler.go sweeps out stale AuthConfig objects but swallows the error:

for _, authConfig := range staleAuthConfigs {
    if err := r.client.Resource(...).Delete(ctx, authConfig.GetName(), metav1.DeleteOptions{}); err != nil {
        logger.Error(err, "failed to delete authconfig object", "authconfig", ...)
        // TODO: handle error
    }
}
return nil

The controller runtime only re-queues a reconcile when Reconcile returns a non-nil error. So a failed delete was a one-shot log line and the stale resource was left in-cluster until some other event happened to trigger another reconcile pass. Under API-server pressure, a transient permission glitch, or a brief conflict, that's long enough for stale AuthConfigs to accumulate, and for Authorino to keep enforcing policy fragments the operator already tried to remove.

Per #1867.

Fix

Collect per-delete failures and return them through errors.Join at the end of the function:

var deleteErrors []error
for _, authConfig := range staleAuthConfigs {
    if err := r.client.Resource(...).Delete(...); err != nil {
        logger.Error(err, "failed to delete authconfig object", "authconfig", ...)
        deleteErrors = append(deleteErrors, fmt.Errorf("failed to delete stale authconfig %s/%s: %w",
            authConfig.GetNamespace(), authConfig.GetName(), err))
    }
}

return errors.Join(deleteErrors...)

Successful deletes in the same loop still complete; only the failed ones are surfaced.
The existing per-object logger.Error(...) call is kept so the operator log still has the structured context.
errors.Is/errors.As in any upstream error handlers continue to work because we wrap the original error with %w.

errors.Join(nil...) returns nil, so the success path is unchanged from the previous return nil.

Fixes #1867

Summary by CodeRabbit

Bug Fixes
- Improved error handling during stale configuration cleanup: deletion errors encountered while removing obsolete AuthConfig resources are now accumulated and reported collectively, while not-found cases are ignored. As a result, cleanup failures are surfaced instead of being silently skipped, and the reconciler may now return an error when stale configuration deletions fail for reasons other than not-found.

coderabbitai · 2026-04-19T05:29:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5755be4e-6cd6-428b-8e0a-481e5d18cae9

📥 Commits

Reviewing files that changed from the base of the PR and between 1ba7d65 and f724ec8.

📒 Files selected for processing (1)

internal/controller/authconfigs_reconciler.go

💤 Files with no reviewable changes (1)

internal/controller/authconfigs_reconciler.go

📝 Walkthrough

Walkthrough

The reconciler now aggregates deletion errors when removing stale AuthConfig resources, ignores NotFound as idempotent, and returns a combined error via errors.Join(...) after attempting all deletions.

Changes

AuthConfigs reconciler

Layer / File(s)	Summary
Stale AuthConfig deletion and error aggregation `internal/controller/authconfigs_reconciler.go`	Adds `apierrors` import; collects non-NotFound deletion errors for stale AuthConfigs into a slice, wraps them, and returns an aggregated error with `errors.Join(...)` instead of only logging failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I found the deletes that used to hide,
Skipped the gone ones, kept the rest beside,
I gathered errors, stitched them tight,
Joined their voices into one clear bite,
Hop—reconcile—now retries with pride.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: returning stale AuthConfig deletion errors to enable controller retries.
Linked Issues check	✅ Passed	The changes fully implement the requirements from issue `#1867`: collect deletion errors with contextual information and return them via errors.Join for controller retry.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing the deletion error handling for stale AuthConfigs as required by issue `#1867`; no unrelated modifications present.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/authconfigs_reconciler.go`:
- Around line 98-102: The delete error aggregation in Reconcile for AuthConfigs
must ignore NotFound errors—import k8s.io/apimachinery/pkg/api/errors as
apierrors and, in the block that calls
r.client.Resource(kuadrantauthorino.AuthConfigsResource).Namespace(authConfig.GetNamespace()).Delete(...),
check if apierrors.IsNotFound(err) and if so treat it as success (optionally log
at V(1) rather than appending to deleteErrors); only append to deleteErrors and
log an error when err is non-nil and not apierrors.IsNotFound(err). Update the
delete handling in authconfigs_reconciler.go around the Delete call (the
deleteErrors slice and logger usage) accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f69e3ca-9793-4ac5-b0da-d8bc31b9b71e

📥 Commits

Reviewing files that changed from the base of the PR and between a0a0702 and 264ca2c.

📒 Files selected for processing (1)

internal/controller/authconfigs_reconciler.go

…eting stale Per CodeRabbit review on Kuadrant#1913: sibling reconcilers (effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131) already skip apierrors.IsNotFound when cleaning up stale policy objects. Align AuthConfigsReconciler to the same pattern so a stale config that was already deleted out of band doesn't retrigger reconciliation. Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>

…ontroller retries AuthConfigsReconciler.Reconcile loops over the stale AuthConfigs it intends to clean up and calls Delete on each one, but swallowed the error: for _, authConfig := range staleAuthConfigs { if err := r.client.Resource(...).Delete(...); err != nil { logger.Error(err, "failed to delete authconfig object", ...) // TODO: handle error } } return nil Because the controller runtime only retries a reconcile when Reconcile returns a non-nil error, a failed delete was a one-shot log line and the stale resource was left in-cluster until the next unrelated event triggered another reconcile. Under API-server pressure or transient permission glitches, that's long enough for stale AuthConfigs to accumulate and for Authorino to keep serving policy fragments the operator already tried to remove. Collect per-delete failures into a slice and return them via errors.Join, so any failed delete propagates up as a reconcile error and the runtime re-queues the work. Successes across the same loop still complete; only the failed ones are surfaced. Log line is kept so the operator still has per-object context in the logs. Fixes Kuadrant#1867 Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com> Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

…eting stale Per CodeRabbit review on Kuadrant#1913: sibling reconcilers (effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131) already skip apierrors.IsNotFound when cleaning up stale policy objects. Align AuthConfigsReconciler to the same pattern so a stale config that was already deleted out of band doesn't retrigger reconciliation. Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com> Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

maksymvavilov

@SAY-5 up to you to act or ignore the nit. In any case this is good

maksymvavilov · 2026-06-09T12:52:02Z

+			if apierrors.IsNotFound(err) {
+				// Already gone - idempotent success, matching the pattern
+				// used in effective_dnspolicies_reconciler.go:311 and
+				// effective_tls_policies_reconciler.go:131.


Nit: a grumpy side of mine sees line numbers and yells: "This will go out of date in a month!"; It is a fairly well-known pattern, and I see no added benefit of the comments. If this were me, I'd remove them, but no engineer died of having more comments afaik - this should not hold the merger.

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

SAY-5 · 2026-06-09T21:19:34Z

Fair point, the file references would rot quickly. Dropped the comment block in f724ec8.

maksymvavilov

Good work! Thank you

guicassolato

The solution proposed here is consistent with the hints left in #1867. However, I think I would favour a more holistic approach to tackle the problem of non-blocking, final errors raised by the API server (#946).

The current State of the World reconciliation has several locations where API server errors are logged but not propagated (marked with // TODO: handle error). The challenge is:

Returning errors aborts the workflow - If you return an error from a reconciler, the entire workflow stops, preventing other reconcilers from running
Ignoring errors causes resource leaks - Stale resources accumulate when deletion/update failures are swallowed
Standard controller retry waits for new events - Without returning an error, there's no automatic retry until another API server event arrives
Risk of infinite loops - Naive retry mechanisms could create reconciliation loops

I think we should introduce a reconciliation error registry that collects all non-blocking errors during workflow execution and triggers a single deferred reconciliation after the workflow completes successfully.

It could be added to the final steps of the workflow:

func (b *BootOptionsBuilder) finalStepsWorkflow() *controller.Workflow {
	workflow := &controller.Workflow{
		Tasks: []controller.ReconcileFunc{
			// ... existing tasks ...
		},
		Postcondition: // Add error retry reconciler as the final task
			traceReconcileFunc("finalize.error_retry", NewErrorRetryReconciler().Reconcile),
	}
    
	// ... rest of the function
}

Then, in the AuthConfig reconciler (and similar):

func (r *AuthConfigsReconciler) Reconcile(ctx context.Context, _ []controller.ResourceEvent, topology *machinery.Topology, _ error, state *sync.Map) error {
	// ... 

	// Get or create error registry from state
	errorRegistry := GetOrCreateErrorRegistry(state)

	for _, authConfig := range staleAuthConfigs {
		if err := r.client.Resource(kuadrantauthorino.AuthConfigsResource).Namespace(authConfig.GetNamespace()).Delete(ctx, authConfig.GetName(), metav1.DeleteOptions{}); err != nil {
			logger.Error(…)

			// Record the error for deferred retry instead of returning it
			errorRegistry.Record(…)
	}

	// ... 
}

Finally, the implementation of a hypothetical ErrorRetryReconciler can get complex, but it shoould:

Check if the registry contains any error
Merges the tracked errors into an error tracker struct that lives outside the workflow

The external error tracker can be a wrapper for Policy Machinery's controller that implements controller-runtime's Reconcile. It:

Keeps count of the accrued number of occurrences for each type of error
Triggers a retry (by returning a controller-runtime's Result)

SAY-5 · 2026-06-11T18:55:25Z

Agreed the registry plus deferred-retry reconciler in #946 is the right long-term shape; this only patches the one stale-AuthConfig delete path that #1867 pointed at and would be redundant once the workflow-wide handling lands. Happy to keep it as an interim stopgap if useful, or close it in favour of #946 if you'd rather not carry the localized version. Your call.

SAY-5 · 2026-06-17T22:45:18Z

Thanks, that framing of the four failure modes is helpful and the registry-plus-deferred-retry approach makes sense as the real fix for the broader set of swallowed errors across the State of the World workflow, not just this one delete.

I'm happy to take that on, but it's a meaningfully larger change than this PR's scope (a new error registry in state, GetOrCreateErrorRegistry, the ErrorRetryReconciler postcondition, and the external counting/backoff tracker), so I'd rather not bolt a half version onto this one. Would you prefer I close this and open a dedicated PR that introduces the registry and converts the existing // TODO: handle error sites to it, or keep this open and grow it into that? Either works, just want to match how you'd like #946 tracked.

SAY-5 · 2026-06-18T18:46:18Z

That makes sense, a deferred error-registry that collects non-blocking API-server errors and runs one retry reconciler at the end of the workflow is a cleaner fit for #946 than patching each TODO site individually. This PR only covers the AuthConfig deletion path, so it would become one consumer of that registry rather than the whole solution. Happy to either rescope this to just wire the AuthConfig path into the registry once the shared piece exists, or close it in favor of a dedicated PR for the holistic approach, whichever you prefer.

Implements a holistic solution for handling API server errors that should not block the reconciliation workflow but require automatic retry. Previously, errors during resource cleanup (e.g., AuthConfig deletion) were logged but not handled, leading to resource leaks when operations failed due to transient issues like rate limiting, network errors, or permission issues. Returning errors would abort the entire workflow, preventing other reconcilers from running. ## Solution Architecture Implements a two-layer error tracking system with automatic retry: ### 1. ErrorRegistry (Per-Reconciliation) - Lives in workflow state (`*sync.Map`) for a single reconciliation cycle - Reconcilers record non-blocking errors instead of returning them - Workflow continues - all reconcilers run despite individual failures ### 2. PersistentErrorTracker (Cross-Reconciliation) - Persists errors across reconciliation cycles - Deduplicates by unique key: {GroupKind}/{Namespace}/{Name}/{Operation} - Tracks retry count per error - Detects resolution: errors absent from new registry are removed - Implements exponential backoff: 2s → 4s → 8s → 16s → 32s → 60s (max) ### 3. ReconciliationErrorHandler (Workflow Postcondition) - Runs at workflow end to merge errors from registry to tracker - Calculates retry delay - Schedules retry via callback with original events ### 4. KuadrantController (Retry Trigger) - Wraps policy-machinery controller - Provides ScheduleRetry callback - Creates timer that calls Propagate(events) after delay - Replays original events to ensure proper context ## Key Features - **Non-blocking**: Errors recorded, not returned; workflow completes - **Automatic retry**: Programmatic retry via timer + Propagate() - **Event replay**: Original events captured and replayed on retry - **Exponential backoff**: 2s, 4s, 8s, 16s, 32s, 60s (capped) - **Max attempts**: 5 retries then give up (prevents infinite loops) - **Auto-resolution**: Successful operations remove errors from tracker - **Deduplication**: Same error key increments count, not duplicated - **Observability**: Detailed logging at each stage ## Implementation Details Modified reconcilers (AuthConfigsReconciler, LimitadorLimitsReconciler) now: 1. Get ErrorRegistry from workflow state 2. Record failures via registry.Record() instead of returning errors 3. Continue with nil return (workflow proceeds) Error handler runs as final postcondition: 1. Reads errors from ephemeral ErrorRegistry 2. Merges into PersistentErrorTracker 3. Calls scheduleRetry(delay, events) callback KuadrantController.ScheduleRetry(): 1. Creates time.AfterFunc(delay, ...) 2. Timer fires → calls c.Controller.Propagate(events) 3. Workflow re-runs with original events 4. Failed operations retry; successful ones don't error ## Testing Manually tested both scenarios: - Max retries exceeded: Operator stops after 5 attempts (~2 min) - Error recovery: RBAC restored mid-retry, cleanup succeeds Unit tests cover: - Error recording and registry behavior - Persistent tracker deduplication and retry counting - Exponential backoff calculation - Error resolution detection - Retry scheduling callback ## Dependencies Requires policy-machinery with exported Propagate() method: - Updated to v0.8.1-0.20260611112329-244a3ec8172a Resolves: #946, #1867 Supersedes: #1913 Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Guilherme Cassolato <guicassolato@gmail.com>

guicassolato · 2026-06-26T09:32:53Z

Closing in favour of #2043.

Thanks for the work, @SAY-5!

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread internal/controller/authconfigs_reconciler.go

SAY-5 force-pushed the fix/authconfigs-return-delete-errors branch from 61a21f6 to c7eb480 Compare May 31, 2026 02:41

SAY-5 added 2 commits June 2, 2026 13:59

SAY-5 force-pushed the fix/authconfigs-return-delete-errors branch from c7eb480 to 1ba7d65 Compare June 2, 2026 21:00

maksymvavilov previously approved these changes Jun 9, 2026

View reviewed changes

authconfigs_reconciler: drop line-number comments

f724ec8

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>

SAY-5 dismissed maksymvavilov’s stale review via f724ec8 June 9, 2026 21:19

maksymvavilov assigned SAY-5 Jun 10, 2026

maksymvavilov approved these changes Jun 10, 2026

View reviewed changes

maksymvavilov enabled auto-merge June 10, 2026 09:29

guicassolato requested changes Jun 10, 2026

View reviewed changes

maksymvavilov disabled auto-merge June 10, 2026 10:16

guicassolato mentioned this pull request Jun 12, 2026

feat: implement non-blocking error handling with automatic retry #2043

Open

guicassolato closed this Jun 26, 2026

Uh oh!

Conversation

SAY-5 commented Apr 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maksymvavilov left a comment

Choose a reason for hiding this comment

Uh oh!

maksymvavilov Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

SAY-5 commented Jun 9, 2026

Uh oh!

maksymvavilov left a comment

Choose a reason for hiding this comment

Uh oh!

guicassolato left a comment

Choose a reason for hiding this comment

Uh oh!

SAY-5 commented Jun 11, 2026

Uh oh!

SAY-5 commented Jun 17, 2026

Uh oh!

SAY-5 commented Jun 18, 2026

Uh oh!

guicassolato commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SAY-5 commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading