fix: cache only completed container profiles to ensure data integrity#813
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughContainer profile cache now requires the consolidated ContainerProfile to have a terminal status annotation (Completed or TooLarge) before populating or rebuilding projections; a CompletionNotifier interface and wiring let the profile manager notify the cache, which retries promoting pending entries when a CP becomes completed. Tests and fixtures are updated to reflect completed-status eligibility. ChangesContainer Profile Cache Completion-Only Gating
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/objectcache/containerprofilecache/containerprofilecache.go`:
- Around line 365-373: The code currently sets cp = nil when the retrieved
container profile's status is not Completed, but later the synthetic-CP fallback
(the overlay-based construction logic) can still populate the cache; change the
flow so a non-Completed real CP prevents any caching/overlay fallback. In
tryPopulateEntry, detect when cp exists but its
annotations[helpersv1.StatusMetadataKey] != helpersv1.Completed and either
return early or set a guard (e.g., skipCaching or allowFallback=false) so the
synthetic-CP creation path does not run; ensure the synthetic-CP/overlay logic
(the fallback starting where overlays are merged) checks that guard or that an
original cp was Completed before creating/adding an entry. Use the cp variable
and the synthetic-CP fallback code paths as the points to modify.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f2a84457-8e70-4481-9d49-6f655e7308c6
📒 Files selected for processing (5)
pkg/objectcache/containerprofilecache/containerprofilecache.gopkg/objectcache/containerprofilecache/containerprofilecache_test.gopkg/objectcache/containerprofilecache/reconciler.gopkg/objectcache/containerprofilecache/reconciler_test.gopkg/objectcache/containerprofilecache/t8_overlay_refresh_test.go
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
22f2e68 to
93156c8
Compare
Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Instead of polling pending containers every 5s (6× storage pressure), wire a CompletionNotifier from containerprofilemanager to the CP cache. NotifyContainerCompleted is called on ContainerReachedMaxTime and ObjectCompletedError, launching a bounded retry goroutine (5 × 3s) that promotes the pending entry as soon as storage consolidation completes. The 30s reconciler tick retains retryPendingEntries as a safety net. Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2814e43 to
9b767e3
Compare
…tion window Storage consolidation runs every 30s; the original 5×3s=15s retry window expired before consolidation completed, leaving the container in pending until the next 30s reconciler tick. Increasing to 20×3s=60s ensures at least one retry fires after consolidation, closing the race with Test_12_MergingProfilesTest's 10s post-completion sleep. Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
TooLarge is a terminal state (manager stopped collecting; data is truncated but valid for detection). The Completed-only gate in tryPopulateEntry and refreshOneEntry incorrectly kept TooLarge containers pending indefinitely since they never transition to Completed. Introduce isTerminalCPStatus(Completed|TooLarge) and apply it in both code paths. Add TestTooLargeCP_Accepted covering the addContainer and refresh paths. Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire notifyCompleted into the ObjectTooLargeError path in handleSaveProfileError so pending containers promoted to TooLarge get the same immediate cache-promotion goroutine as Completed ones. Without this, a TooLarge container stayed in the pending map for up to 30s (next reconciler tick) even though a valid truncated CP was already in storage. Add TestNotifyContainerTerminal_TooLarge to assert the fast-path promotion fires without waiting for the tick. Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
When a container exits normally the monitorContainer loop handles ContainerHasTerminatedError: it saves the final CP (Status=Completed set by lifecycle.go) but previously never called notifyCompleted, so pending cache entries were only promoted on the next 30s reconciler tick. Fire notifyCompleted when the terminal status is Completed so the fast-path promotion goroutine runs immediately. Add TestNotifyContainerTerminal_Completed: pending → Completed via notification, verified without waiting for the periodic tick. Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. |
goradd/maps.SafeMap.Load() reads m.items == nil without a lock while Set() initializes m.items under a write lock, causing a data race on the first concurrent access to a zero-value SafeMap. Pre-initialize entries and pending in NewContainerProfileCache so m.items is non-nil before any goroutine can call Load concurrently. Signed-off-by: Matthias Bertschy <matthias.bertschy@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup EffectivenessNo data available. Event Counters
|
Performance Benchmark ResultsNode-Agent Resource Usage
Dedup Effectiveness (AFTER only)
Event Counters
|
Summary by CodeRabbit
Bug Fixes
New Features
Tests