internal/controller: return stale AuthConfig deletion errors so the controller retries#1913
internal/controller: return stale AuthConfig deletion errors so the controller retries#1913SAY-5 wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThe reconciler now aggregates deletion errors when removing stale AuthConfig resources, ignores NotFound as idempotent, and returns a combined error via errors.Join(...) after attempting all deletions. ChangesAuthConfigs reconciler
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/authconfigs_reconciler.go`:
- Around line 98-102: The delete error aggregation in Reconcile for AuthConfigs
must ignore NotFound errors—import k8s.io/apimachinery/pkg/api/errors as
apierrors and, in the block that calls
r.client.Resource(kuadrantauthorino.AuthConfigsResource).Namespace(authConfig.GetNamespace()).Delete(...),
check if apierrors.IsNotFound(err) and if so treat it as success (optionally log
at V(1) rather than appending to deleteErrors); only append to deleteErrors and
log an error when err is non-nil and not apierrors.IsNotFound(err). Update the
delete handling in authconfigs_reconciler.go around the Delete call (the
deleteErrors slice and logger usage) accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1f69e3ca-9793-4ac5-b0da-d8bc31b9b71e
📒 Files selected for processing (1)
internal/controller/authconfigs_reconciler.go
…eting stale Per CodeRabbit review on Kuadrant#1913: sibling reconcilers (effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131) already skip apierrors.IsNotFound when cleaning up stale policy objects. Align AuthConfigsReconciler to the same pattern so a stale config that was already deleted out of band doesn't retrigger reconciliation. Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
…eting stale Per CodeRabbit review on Kuadrant#1913: sibling reconcilers (effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131) already skip apierrors.IsNotFound when cleaning up stale policy objects. Align AuthConfigsReconciler to the same pattern so a stale config that was already deleted out of band doesn't retrigger reconciliation. Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
61a21f6 to
c7eb480
Compare
…ontroller retries
AuthConfigsReconciler.Reconcile loops over the stale AuthConfigs it
intends to clean up and calls Delete on each one, but swallowed the
error:
for _, authConfig := range staleAuthConfigs {
if err := r.client.Resource(...).Delete(...); err != nil {
logger.Error(err, "failed to delete authconfig object", ...)
// TODO: handle error
}
}
return nil
Because the controller runtime only retries a reconcile when Reconcile
returns a non-nil error, a failed delete was a one-shot log line and
the stale resource was left in-cluster until the next unrelated event
triggered another reconcile. Under API-server pressure or transient
permission glitches, that's long enough for stale AuthConfigs to
accumulate and for Authorino to keep serving policy fragments the
operator already tried to remove.
Collect per-delete failures into a slice and return them via
errors.Join, so any failed delete propagates up as a reconcile error
and the runtime re-queues the work. Successes across the same loop
still complete; only the failed ones are surfaced. Log line is kept
so the operator still has per-object context in the logs.
Fixes Kuadrant#1867
Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com>
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
…eting stale Per CodeRabbit review on Kuadrant#1913: sibling reconcilers (effective_dnspolicies_reconciler:311, effective_tls_policies_reconciler:131) already skip apierrors.IsNotFound when cleaning up stale policy objects. Align AuthConfigsReconciler to the same pattern so a stale config that was already deleted out of band doesn't retrigger reconciliation. Signed-off-by: SAY-5 <SAY-5@users.noreply.github.com> Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
c7eb480 to
1ba7d65
Compare
maksymvavilov
left a comment
There was a problem hiding this comment.
@SAY-5 up to you to act or ignore the nit. In any case this is good
| if apierrors.IsNotFound(err) { | ||
| // Already gone - idempotent success, matching the pattern | ||
| // used in effective_dnspolicies_reconciler.go:311 and | ||
| // effective_tls_policies_reconciler.go:131. |
There was a problem hiding this comment.
Nit: a grumpy side of mine sees line numbers and yells: "This will go out of date in a month!"; It is a fairly well-known pattern, and I see no added benefit of the comments. If this were me, I'd remove them, but no engineer died of having more comments afaik - this should not hold the merger.
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
|
Fair point, the file references would rot quickly. Dropped the comment block in f724ec8. |
maksymvavilov
left a comment
There was a problem hiding this comment.
Good work! Thank you
guicassolato
left a comment
There was a problem hiding this comment.
The solution proposed here is consistent with the hints left in #1867. However, I think I would favour a more holistic approach to tackle the problem of non-blocking, final errors raised by the API server (#946).
The current State of the World reconciliation has several locations where API server errors are logged but not propagated (marked with // TODO: handle error). The challenge is:
- Returning errors aborts the workflow - If you return an error from a reconciler, the entire workflow stops, preventing other reconcilers from running
- Ignoring errors causes resource leaks - Stale resources accumulate when deletion/update failures are swallowed
- Standard controller retry waits for new events - Without returning an error, there's no automatic retry until another API server event arrives
- Risk of infinite loops - Naive retry mechanisms could create reconciliation loops
I think we should introduce a reconciliation error registry that collects all non-blocking errors during workflow execution and triggers a single deferred reconciliation after the workflow completes successfully.
It could be added to the final steps of the workflow:
func (b *BootOptionsBuilder) finalStepsWorkflow() *controller.Workflow {
workflow := &controller.Workflow{
Tasks: []controller.ReconcileFunc{
// ... existing tasks ...
},
Postcondition: // Add error retry reconciler as the final task
traceReconcileFunc("finalize.error_retry", NewErrorRetryReconciler().Reconcile),
}
// ... rest of the function
}Then, in the AuthConfig reconciler (and similar):
func (r *AuthConfigsReconciler) Reconcile(ctx context.Context, _ []controller.ResourceEvent, topology *machinery.Topology, _ error, state *sync.Map) error {
// ...
// Get or create error registry from state
errorRegistry := GetOrCreateErrorRegistry(state)
for _, authConfig := range staleAuthConfigs {
if err := r.client.Resource(kuadrantauthorino.AuthConfigsResource).Namespace(authConfig.GetNamespace()).Delete(ctx, authConfig.GetName(), metav1.DeleteOptions{}); err != nil {
logger.Error(…)
// Record the error for deferred retry instead of returning it
errorRegistry.Record(…)
}
// ...
}Finally, the implementation of a hypothetical ErrorRetryReconciler can get complex, but it shoould:
- Check if the registry contains any error
- Merges the tracked errors into an error tracker struct that lives outside the workflow
The external error tracker can be a wrapper for Policy Machinery's controller that implements controller-runtime's Reconcile. It:
- Keeps count of the accrued number of occurrences for each type of error
- Triggers a retry (by returning a controller-runtime's
Result)
|
Agreed the registry plus deferred-retry reconciler in #946 is the right long-term shape; this only patches the one stale-AuthConfig delete path that #1867 pointed at and would be redundant once the workflow-wide handling lands. Happy to keep it as an interim stopgap if useful, or close it in favour of #946 if you'd rather not carry the localized version. Your call. |
|
Thanks, that framing of the four failure modes is helpful and the registry-plus-deferred-retry approach makes sense as the real fix for the broader set of swallowed errors across the State of the World workflow, not just this one delete. I'm happy to take that on, but it's a meaningfully larger change than this PR's scope (a new error registry in state, GetOrCreateErrorRegistry, the ErrorRetryReconciler postcondition, and the external counting/backoff tracker), so I'd rather not bolt a half version onto this one. Would you prefer I close this and open a dedicated PR that introduces the registry and converts the existing |
|
That makes sense, a deferred error-registry that collects non-blocking API-server errors and runs one retry reconciler at the end of the workflow is a cleaner fit for #946 than patching each TODO site individually. This PR only covers the AuthConfig deletion path, so it would become one consumer of that registry rather than the whole solution. Happy to either rescope this to just wire the AuthConfig path into the registry once the shared piece exists, or close it in favor of a dedicated PR for the holistic approach, whichever you prefer. |
Implements a holistic solution for handling API server errors that should
not block the reconciliation workflow but require automatic retry.
Previously, errors during resource cleanup (e.g., AuthConfig deletion) were
logged but not handled, leading to resource leaks when operations failed due
to transient issues like rate limiting, network errors, or permission issues.
Returning errors would abort the entire workflow, preventing other reconcilers
from running.
## Solution Architecture
Implements a two-layer error tracking system with automatic retry:
### 1. ErrorRegistry (Per-Reconciliation)
- Lives in workflow state (`*sync.Map`) for a single reconciliation cycle
- Reconcilers record non-blocking errors instead of returning them
- Workflow continues - all reconcilers run despite individual failures
### 2. PersistentErrorTracker (Cross-Reconciliation)
- Persists errors across reconciliation cycles
- Deduplicates by unique key: {GroupKind}/{Namespace}/{Name}/{Operation}
- Tracks retry count per error
- Detects resolution: errors absent from new registry are removed
- Implements exponential backoff: 2s → 4s → 8s → 16s → 32s → 60s (max)
### 3. ReconciliationErrorHandler (Workflow Postcondition)
- Runs at workflow end to merge errors from registry to tracker
- Calculates retry delay
- Schedules retry via callback with original events
### 4. KuadrantController (Retry Trigger)
- Wraps policy-machinery controller
- Provides ScheduleRetry callback
- Creates timer that calls Propagate(events) after delay
- Replays original events to ensure proper context
## Key Features
- **Non-blocking**: Errors recorded, not returned; workflow completes
- **Automatic retry**: Programmatic retry via timer + Propagate()
- **Event replay**: Original events captured and replayed on retry
- **Exponential backoff**: 2s, 4s, 8s, 16s, 32s, 60s (capped)
- **Max attempts**: 5 retries then give up (prevents infinite loops)
- **Auto-resolution**: Successful operations remove errors from tracker
- **Deduplication**: Same error key increments count, not duplicated
- **Observability**: Detailed logging at each stage
## Implementation Details
Modified reconcilers (AuthConfigsReconciler, LimitadorLimitsReconciler) now:
1. Get ErrorRegistry from workflow state
2. Record failures via registry.Record() instead of returning errors
3. Continue with nil return (workflow proceeds)
Error handler runs as final postcondition:
1. Reads errors from ephemeral ErrorRegistry
2. Merges into PersistentErrorTracker
3. Calls scheduleRetry(delay, events) callback
KuadrantController.ScheduleRetry():
1. Creates time.AfterFunc(delay, ...)
2. Timer fires → calls c.Controller.Propagate(events)
3. Workflow re-runs with original events
4. Failed operations retry; successful ones don't error
## Testing
Manually tested both scenarios:
- Max retries exceeded: Operator stops after 5 attempts (~2 min)
- Error recovery: RBAC restored mid-retry, cleanup succeeds
Unit tests cover:
- Error recording and registry behavior
- Persistent tracker deduplication and retry counting
- Exponential backoff calculation
- Error resolution detection
- Retry scheduling callback
## Dependencies
Requires policy-machinery with exported Propagate() method:
- Updated to v0.8.1-0.20260611112329-244a3ec8172a
Resolves: #946, #1867
Supersedes: #1913
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Guilherme Cassolato <guicassolato@gmail.com>
What
AuthConfigsReconciler.Reconcileininternal/controller/authconfigs_reconciler.gosweeps out staleAuthConfigobjects but swallows the error:The controller runtime only re-queues a reconcile when
Reconcilereturns a non-nil error. So a failed delete was a one-shot log line and the stale resource was left in-cluster until some other event happened to trigger another reconcile pass. Under API-server pressure, a transient permission glitch, or a brief conflict, that's long enough for stale AuthConfigs to accumulate, and for Authorino to keep enforcing policy fragments the operator already tried to remove.Per #1867.
Fix
Collect per-delete failures and return them through
errors.Joinat the end of the function:logger.Error(...)call is kept so the operator log still has the structured context.errors.Is/errors.Asin any upstream error handlers continue to work because we wrap the original error with%w.errors.Join(nil...)returnsnil, so the success path is unchanged from the previousreturn nil.Fixes #1867
Summary by CodeRabbit