Skip to content

internal/controller: return stale AuthConfig deletion errors so the controller retries#1913

Closed
SAY-5 wants to merge 3 commits into
Kuadrant:mainfrom
SAY-5:fix/authconfigs-return-delete-errors
Closed

internal/controller: return stale AuthConfig deletion errors so the controller retries#1913
SAY-5 wants to merge 3 commits into
Kuadrant:mainfrom
SAY-5:fix/authconfigs-return-delete-errors

Conversation

@SAY-5

@SAY-5 SAY-5 commented Apr 19, 2026

Copy link
Copy Markdown

What

AuthConfigsReconciler.Reconcile in internal/controller/authconfigs_reconciler.go sweeps out stale AuthConfig objects but swallows the error:

for _, authConfig := range staleAuthConfigs {
    if err := r.client.Resource(...).Delete(ctx, authConfig.GetName(), metav1.DeleteOptions{}); err != nil {
        logger.Error(err, "failed to delete authconfig object", "authconfig", ...)
        // TODO: handle error
    }
}
return nil

The controller runtime only re-queues a reconcile when Reconcile returns a non-nil error. So a failed delete was a one-shot log line and the stale resource was left in-cluster until some other event happened to trigger another reconcile pass. Under API-server pressure, a transient permission glitch, or a brief conflict, that's long enough for stale AuthConfigs to accumulate, and for Authorino to keep enforcing policy fragments the operator already tried to remove.

Per #1867.

Fix

Collect per-delete failures and return them through errors.Join at the end of the function:

var deleteErrors []error
for _, authConfig := range staleAuthConfigs {
    if err := r.client.Resource(...).Delete(...); err != nil {
        logger.Error(err, "failed to delete authconfig object", "authconfig", ...)
        deleteErrors = append(deleteErrors, fmt.Errorf("failed to delete stale authconfig %s/%s: %w",
            authConfig.GetNamespace(), authConfig.GetName(), err))
    }
}

return errors.Join(deleteErrors...)
  • Successful deletes in the same loop still complete; only the failed ones are surfaced.
  • The existing per-object logger.Error(...) call is kept so the operator log still has the structured context.
  • errors.Is/errors.As in any upstream error handlers continue to work because we wrap the original error with %w.

errors.Join(nil...) returns nil, so the success path is unchanged from the previous return nil.

Fixes #1867

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling during stale configuration cleanup: deletion errors encountered while removing obsolete AuthConfig resources are now accumulated and reported collectively, while not-found cases are ignored. As a result, cleanup failures are surfaced instead of being silently skipped, and the reconciler may now return an error when stale configuration deletions fail for reasons other than not-found.

@coderabbitai

coderabbitai Bot commented Apr 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5755be4e-6cd6-428b-8e0a-481e5d18cae9

📥 Commits

Reviewing files that changed from the base of the PR and between 1ba7d65 and f724ec8.

📒 Files selected for processing (1)
  • internal/controller/authconfigs_reconciler.go
💤 Files with no reviewable changes (1)
  • internal/controller/authconfigs_reconciler.go

📝 Walkthrough

Walkthrough

The reconciler now aggregates deletion errors when removing stale AuthConfig resources, ignores NotFound as idempotent, and returns a combined error via errors.Join(...) after attempting all deletions.

Changes

AuthConfigs reconciler

Layer / File(s) Summary
Stale AuthConfig deletion and error aggregation
internal/controller/authconfigs_reconciler.go
Adds apierrors import; collects non-NotFound deletion errors for stale AuthConfigs into a slice, wraps them, and returns an aggregated error with errors.Join(...) instead of only logging failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I found the deletes that used to hide,
Skipped the gone ones, kept the rest beside,
I gathered errors, stitched them tight,
Joined their voices into one clear bite,
Hop—reconcile—now retries with pride.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: returning stale AuthConfig deletion errors to enable controller retries.
Linked Issues check ✅ Passed The changes fully implement the requirements from issue #1867: collect deletion errors with contextual information and return them via errors.Join for controller retry.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the deletion error handling for stale AuthConfigs as required by issue #1867; no unrelated modifications present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/authconfigs_reconciler.go`:
- Around line 98-102: The delete error aggregation in Reconcile for AuthConfigs
must ignore NotFound errors—import k8s.io/apimachinery/pkg/api/errors as
apierrors and, in the block that calls
r.client.Resource(kuadrantauthorino.AuthConfigsResource).Namespace(authConfig.GetNamespace()).Delete(...),
check if apierrors.IsNotFound(err) and if so treat it as success (optionally log
at V(1) rather than appending to deleteErrors); only append to deleteErrors and
log an error when err is non-nil and not apierrors.IsNotFound(err). Update the
delete handling in authconfigs_reconciler.go around the Delete call (the
deleteErrors slice and logger usage) accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f69e3ca-9793-4ac5-b0da-d8bc31b9b71e

📥 Commits

Reviewing files that changed from the base of the PR and between a0a0702 and 264ca2c.

📒 Files selected for processing (1)
  • internal/controller/authconfigs_reconciler.go

Comment thread internal/controller/authconfigs_reconciler.go
SAY-5 added a commit to SAY-5/kuadrant-operator that referenced this pull request Apr 20, 2026
…eting stale

Per CodeRabbit review on Kuadrant#1913: sibling reconcilers
(effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131)
already skip apierrors.IsNotFound when cleaning up stale policy
objects. Align AuthConfigsReconciler to the same pattern so a stale
config that was already deleted out of band doesn't retrigger
reconciliation.

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
SAY-5 added a commit to SAY-5/kuadrant-operator that referenced this pull request May 31, 2026
…eting stale

Per CodeRabbit review on Kuadrant#1913: sibling reconcilers
(effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131)
already skip apierrors.IsNotFound when cleaning up stale policy
objects. Align AuthConfigsReconciler to the same pattern so a stale
config that was already deleted out of band doesn't retrigger
reconciliation.

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
@SAY-5 SAY-5 force-pushed the fix/authconfigs-return-delete-errors branch from 61a21f6 to c7eb480 Compare May 31, 2026 02:41
SAY-5 added 2 commits June 2, 2026 13:59
…ontroller retries

AuthConfigsReconciler.Reconcile loops over the stale AuthConfigs it
intends to clean up and calls Delete on each one, but swallowed the
error:

	for _, authConfig := range staleAuthConfigs {
		if err := r.client.Resource(...).Delete(...); err != nil {
			logger.Error(err, "failed to delete authconfig object", ...)
			// TODO: handle error
		}
	}
	return nil

Because the controller runtime only retries a reconcile when Reconcile
returns a non-nil error, a failed delete was a one-shot log line and
the stale resource was left in-cluster until the next unrelated event
triggered another reconcile. Under API-server pressure or transient
permission glitches, that's long enough for stale AuthConfigs to
accumulate and for Authorino to keep serving policy fragments the
operator already tried to remove.

Collect per-delete failures into a slice and return them via
errors.Join, so any failed delete propagates up as a reconcile error
and the runtime re-queues the work. Successes across the same loop
still complete; only the failed ones are surfaced. Log line is kept
so the operator still has per-object context in the logs.

Fixes Kuadrant#1867

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
…eting stale

Per CodeRabbit review on Kuadrant#1913: sibling reconcilers
(effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131)
already skip apierrors.IsNotFound when cleaning up stale policy
objects. Align AuthConfigsReconciler to the same pattern so a stale
config that was already deleted out of band doesn't retrigger
reconciliation.

Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
@SAY-5 SAY-5 force-pushed the fix/authconfigs-return-delete-errors branch from c7eb480 to 1ba7d65 Compare June 2, 2026 21:00
maksymvavilov
maksymvavilov previously approved these changes Jun 9, 2026

@maksymvavilov maksymvavilov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SAY-5 up to you to act or ignore the nit. In any case this is good

if apierrors.IsNotFound(err) {
// Already gone - idempotent success, matching the pattern
// used in effective_dnspolicies_reconciler.go:311 and
// effective_tls_policies_reconciler.go:131.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a grumpy side of mine sees line numbers and yells: "This will go out of date in a month!"; It is a fairly well-known pattern, and I see no added benefit of the comments. If this were me, I'd remove them, but no engineer died of having more comments afaik - this should not hold the merger.

Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
@SAY-5

SAY-5 commented Jun 9, 2026

Copy link
Copy Markdown
Author

Fair point, the file references would rot quickly. Dropped the comment block in f724ec8.

@maksymvavilov maksymvavilov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Thank you

@maksymvavilov maksymvavilov enabled auto-merge June 10, 2026 09:29

@guicassolato guicassolato left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution proposed here is consistent with the hints left in #1867. However, I think I would favour a more holistic approach to tackle the problem of non-blocking, final errors raised by the API server (#946).

The current State of the World reconciliation has several locations where API server errors are logged but not propagated (marked with // TODO: handle error). The challenge is:

  • Returning errors aborts the workflow - If you return an error from a reconciler, the entire workflow stops, preventing other reconcilers from running
  • Ignoring errors causes resource leaks - Stale resources accumulate when deletion/update failures are swallowed
  • Standard controller retry waits for new events - Without returning an error, there's no automatic retry until another API server event arrives
  • Risk of infinite loops - Naive retry mechanisms could create reconciliation loops

I think we should introduce a reconciliation error registry that collects all non-blocking errors during workflow execution and triggers a single deferred reconciliation after the workflow completes successfully.

It could be added to the final steps of the workflow:

func (b *BootOptionsBuilder) finalStepsWorkflow() *controller.Workflow {
	workflow := &controller.Workflow{
		Tasks: []controller.ReconcileFunc{
			// ... existing tasks ...
		},
		Postcondition: // Add error retry reconciler as the final task
			traceReconcileFunc("finalize.error_retry", NewErrorRetryReconciler().Reconcile),
	}
    
	// ... rest of the function
}

Then, in the AuthConfig reconciler (and similar):

func (r *AuthConfigsReconciler) Reconcile(ctx context.Context, _ []controller.ResourceEvent, topology *machinery.Topology, _ error, state *sync.Map) error {
	// ... 

	// Get or create error registry from state
	errorRegistry := GetOrCreateErrorRegistry(state)

	for _, authConfig := range staleAuthConfigs {
		if err := r.client.Resource(kuadrantauthorino.AuthConfigsResource).Namespace(authConfig.GetNamespace()).Delete(ctx, authConfig.GetName(), metav1.DeleteOptions{}); err != nil {
			logger.Error(…)

			// Record the error for deferred retry instead of returning it
			errorRegistry.Record(…)
	}

	// ... 
}

Finally, the implementation of a hypothetical ErrorRetryReconciler can get complex, but it shoould:

  1. Check if the registry contains any error
  2. Merges the tracked errors into an error tracker struct that lives outside the workflow

The external error tracker can be a wrapper for Policy Machinery's controller that implements controller-runtime's Reconcile. It:

  1. Keeps count of the accrued number of occurrences for each type of error
  2. Triggers a retry (by returning a controller-runtime's Result)

@maksymvavilov maksymvavilov disabled auto-merge June 10, 2026 10:16
@SAY-5

SAY-5 commented Jun 11, 2026

Copy link
Copy Markdown
Author

Agreed the registry plus deferred-retry reconciler in #946 is the right long-term shape; this only patches the one stale-AuthConfig delete path that #1867 pointed at and would be redundant once the workflow-wide handling lands. Happy to keep it as an interim stopgap if useful, or close it in favour of #946 if you'd rather not carry the localized version. Your call.

@SAY-5

SAY-5 commented Jun 17, 2026

Copy link
Copy Markdown
Author

Thanks, that framing of the four failure modes is helpful and the registry-plus-deferred-retry approach makes sense as the real fix for the broader set of swallowed errors across the State of the World workflow, not just this one delete.

I'm happy to take that on, but it's a meaningfully larger change than this PR's scope (a new error registry in state, GetOrCreateErrorRegistry, the ErrorRetryReconciler postcondition, and the external counting/backoff tracker), so I'd rather not bolt a half version onto this one. Would you prefer I close this and open a dedicated PR that introduces the registry and converts the existing // TODO: handle error sites to it, or keep this open and grow it into that? Either works, just want to match how you'd like #946 tracked.

@SAY-5

SAY-5 commented Jun 18, 2026

Copy link
Copy Markdown
Author

That makes sense, a deferred error-registry that collects non-blocking API-server errors and runs one retry reconciler at the end of the workflow is a cleaner fit for #946 than patching each TODO site individually. This PR only covers the AuthConfig deletion path, so it would become one consumer of that registry rather than the whole solution. Happy to either rescope this to just wire the AuthConfig path into the registry once the shared piece exists, or close it in favor of a dedicated PR for the holistic approach, whichever you prefer.

guicassolato added a commit that referenced this pull request Jun 19, 2026
Implements a holistic solution for handling API server errors that should
not block the reconciliation workflow but require automatic retry.

Previously, errors during resource cleanup (e.g., AuthConfig deletion) were
logged but not handled, leading to resource leaks when operations failed due
to transient issues like rate limiting, network errors, or permission issues.
Returning errors would abort the entire workflow, preventing other reconcilers
from running.

## Solution Architecture

Implements a two-layer error tracking system with automatic retry:

### 1. ErrorRegistry (Per-Reconciliation)
- Lives in workflow state (`*sync.Map`) for a single reconciliation cycle
- Reconcilers record non-blocking errors instead of returning them
- Workflow continues - all reconcilers run despite individual failures

### 2. PersistentErrorTracker (Cross-Reconciliation)
- Persists errors across reconciliation cycles
- Deduplicates by unique key: {GroupKind}/{Namespace}/{Name}/{Operation}
- Tracks retry count per error
- Detects resolution: errors absent from new registry are removed
- Implements exponential backoff: 2s → 4s → 8s → 16s → 32s → 60s (max)

### 3. ReconciliationErrorHandler (Workflow Postcondition)
- Runs at workflow end to merge errors from registry to tracker
- Calculates retry delay
- Schedules retry via callback with original events

### 4. KuadrantController (Retry Trigger)
- Wraps policy-machinery controller
- Provides ScheduleRetry callback
- Creates timer that calls Propagate(events) after delay
- Replays original events to ensure proper context

## Key Features

- **Non-blocking**: Errors recorded, not returned; workflow completes
- **Automatic retry**: Programmatic retry via timer + Propagate()
- **Event replay**: Original events captured and replayed on retry
- **Exponential backoff**: 2s, 4s, 8s, 16s, 32s, 60s (capped)
- **Max attempts**: 5 retries then give up (prevents infinite loops)
- **Auto-resolution**: Successful operations remove errors from tracker
- **Deduplication**: Same error key increments count, not duplicated
- **Observability**: Detailed logging at each stage

## Implementation Details

Modified reconcilers (AuthConfigsReconciler, LimitadorLimitsReconciler) now:
1. Get ErrorRegistry from workflow state
2. Record failures via registry.Record() instead of returning errors
3. Continue with nil return (workflow proceeds)

Error handler runs as final postcondition:
1. Reads errors from ephemeral ErrorRegistry
2. Merges into PersistentErrorTracker
3. Calls scheduleRetry(delay, events) callback

KuadrantController.ScheduleRetry():
1. Creates time.AfterFunc(delay, ...)
2. Timer fires → calls c.Controller.Propagate(events)
3. Workflow re-runs with original events
4. Failed operations retry; successful ones don't error

## Testing

Manually tested both scenarios:
- Max retries exceeded: Operator stops after 5 attempts (~2 min)
- Error recovery: RBAC restored mid-retry, cleanup succeeds

Unit tests cover:
- Error recording and registry behavior
- Persistent tracker deduplication and retry counting
- Exponential backoff calculation
- Error resolution detection
- Retry scheduling callback

## Dependencies

Requires policy-machinery with exported Propagate() method:
- Updated to v0.8.1-0.20260611112329-244a3ec8172a

Resolves: #946, #1867
Supersedes: #1913

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Guilherme Cassolato <guicassolato@gmail.com>
@guicassolato

Copy link
Copy Markdown
Contributor

Closing in favour of #2043.

Thanks for the work, @SAY-5!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AuthConfigs reconciler doesn't handle deletion errors, causing potential resource leaks

3 participants