Skip to content

fix: harden FlowCache lifecycle and drop redundant cache from MetadataDAO#4137

Draft
MohamadJaara wants to merge 4 commits into
developfrom
fix/flowcache-race-and-metadata-cache
Draft

fix: harden FlowCache lifecycle and drop redundant cache from MetadataDAO#4137
MohamadJaara wants to merge 4 commits into
developfrom
fix/flowcache-race-and-metadata-cache

Conversation

@MohamadJaara
Copy link
Copy Markdown
Member

Context

Follow-up to #4087. Do not merge until #4087 is merged I'll rebase this branch onto develop once that lands.

#4087 fixed two coroutine leaks (zombie launchSharing coroutines in FlowCache, and a permanent SharedFlow per observeSerializable call). Code review on that PR surfaced four additional concerns that this PR addresses, plus one simplification enabled by the per-key SQLDelight notifications from #4135.

What changes and why

1. Race in FlowCache.onCompletion FlowCache.kt

When remove(key) is called externally, it cancels the old sharingJob. Cancellation propagates asynchronously to the upstream, firing its onCompletion. If a new get(key) slips in between, the late onCompletion of the old flow would have wiped the newly-registered entry.

Fix: onCompletion now calls removeIfOwned(key, sharingJob), which only clears state if the registered sharing job is still the one firing an identity check on the Job reference. Late completions from evicted flows become no-ops.

2. Orphan SupervisorJob when flowProducer throws FlowCache.kt

In the original implementation, sharingJobs[key] = sharingJob was set before invoking flowProducer(key). If the producer threw, the SupervisorJob remained as a child of cacheScope's Job forever (or until cacheScope itself was cancelled), with no corresponding storage entry.

Fix: createFlow() now uses a try/finally with a registered flag. The job is only added to sharingJobs after shareIn succeeds; if anything throws, the orphan SupervisorJob is cancelled in finally.

3. Defensive validation: cacheScope must have a Job FlowCache.kt

SupervisorJob(cacheScope.coroutineContext[Job]) accepts a nullable parent. If a caller ever constructed cacheScope from a bare EmptyCoroutineContext, the SupervisorJob would become orphaned and cancellation from cacheScope would not propagate.

Fix: requireNotNull check on construction. Standard CoroutineScope(...) factory always provides a Job so existing callers are unaffected; this guards against future misuse.

4. Remove FlowCache from MetadataDAO MetadataDAOImpl.kt, UserDatabaseBuilder.kt

With the per-key @CustomKey / @NotifyCustomKey annotations introduced in #4135, SQLDelight now wakes up only the listener for the specific key that changed. The original reason for FlowCache here collapsing N subscribers to one upstream query listener — no longer earns its keep. Each valueByKeyFlow(key) subscriber gets its own cheap per-key SQLDelight flow.

Changes:

  • valueByKeyFlow queries selectValueByKey directly with .flowOn(readDispatcher).
  • observeSerializable likewise queries directly, then .map { decode }.distinctUntilChanged().flowOn(readDispatcher).
  • The metadataCache: FlowCache<String, String?> field is removed from MetadataDAOImpl and from UserDatabaseBuilder.

This also dissolves the LazilyWhileSubscribed(5_000) behavioral change called out in the review of #4087: observeSerializable no longer does its own shareIn, so neither sharing strategy applies.

FlowCache itself is retained UserDAOImpl, ConversationDAOImpl, and MemberDAO still depend on it.

Tests

All changes are covered by tests in FlowCacheTest.kt:

  • givenEntryRecreatedAfterEviction_whenOldUpstreamLatelyCompletes_thenNewEntryIsNotRemoved regression for the eviction race.
  • givenFlowProducerThrows_whenGetIsCalled_thenNoOrphanSharingJobIsLeftBehind fails on the unfixed code with "1 orphan job(s) remain", passes after the try/finally fix.
  • givenCacheScopeWithoutJob_whenFlowCacheIsConstructed_thenItRejectsTheScope asserts the requireNotNull rejection.
  • givenSubscribedFlow_whenCacheScopeJobIsCancelled_thenSharingJobIsAlsoCancelled regression guard on parent-child wiring (would catch a future "fix" that swaps SupervisorJob(parent) for SupervisorJob()).
  • givenMultipleCacheEvictions_thenSharingCoroutinesShouldNotAccumulate enhanced from fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125] #4087 with per-child isCompleted/isCancelled assertions for better diagnostics if zombies ever regress.

Existing MetadataDAOTest, UserConfigDAOTest, and UserPrefsDAOTest confirm the observeSerializable / valueByKeyFlow refactor preserves caller-visible behavior.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Test Results

0 tests   - 4 911   0 ✅  - 4 796   0s ⏱️ - 2m 58s
0 suites  -   806   0 💤  -   115 
0 files    -   806   0 ❌ ±    0 

Results for commit 086fa9f. ± Comparison against base commit 541dc48.

♻️ This comment has been updated with latest results.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.59259% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.56%. Comparing base (9eca021) to head (086fa9f).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
...lin/com/wire/kalium/persistence/cache/FlowCache.kt 92.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop    #4137      +/-   ##
=============================================
+ Coverage      61.55%   61.56%   +0.01%     
- Complexity      4021     4024       +3     
=============================================
  Files           2067     2068       +1     
  Lines          67423    67447      +24     
  Branches        6650     6655       +5     
=============================================
+ Hits           41500    41523      +23     
- Misses         23276    23277       +1     
  Partials        2647     2647              
Files with missing lines Coverage Δ
...com/wire/kalium/persistence/dao/MetadataDAOImpl.kt 83.87% <100.00%> (-1.43%) ⬇️
.../wire/kalium/persistence/db/UserDatabaseBuilder.kt 79.68% <ø> (-0.32%) ⬇️
...lin/com/wire/kalium/persistence/cache/FlowCache.kt 95.12% <92.00%> (-4.88%) ⬇️

... and 5 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9eca021...086fa9f. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown
Contributor

🐰 Bencher Report

Branchfix/flowcache-race-and-metadata-cache
Testbedubuntu-latest

⚠️ WARNING: No Threshold found!

Without a Threshold, no Alerts will ever be generated.

Click here to create a new Threshold
For more information, see the Threshold documentation.
To only post results if a Threshold exists, set the --ci-only-thresholds flag.

Click to view all benchmark results
BenchmarkLatencymicroseconds (µs)
com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInFiles📈 view plot
⚠️ NO THRESHOLD
900.82 µs
com.wire.kalium.benchmarks.logic.CoreLogicBenchmark.createObjectInMemory📈 view plot
⚠️ NO THRESHOLD
330,414.00 µs
com.wire.kalium.benchmarks.persistence.MessageReadBenchmark.inboxPagingDeepPageBenchmark📈 view plot
⚠️ NO THRESHOLD
128,180.86 µs
com.wire.kalium.benchmarks.persistence.MessageReadBenchmark.inboxPagingFirstPageBenchmark📈 view plot
⚠️ NO THRESHOLD
122,748.38 µs
com.wire.kalium.benchmarks.persistence.MessageReadBenchmark.localMarkAsReadBenchmark📈 view plot
⚠️ NO THRESHOLD
3,224.62 µs
com.wire.kalium.benchmarks.persistence.MessageReadBenchmark.messagePagingDeepPageBenchmark📈 view plot
⚠️ NO THRESHOLD
27,011.37 µs
com.wire.kalium.benchmarks.persistence.MessageReadBenchmark.messagePagingFirstPageBenchmark📈 view plot
⚠️ NO THRESHOLD
11,282.66 µs
com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.messageInsertionBenchmark📈 view plot
⚠️ NO THRESHOLD
1,232,475.96 µs
com.wire.kalium.benchmarks.persistence.MessagesNoPragmaTuneBenchmark.queryMessagesBenchmark📈 view plot
⚠️ NO THRESHOLD
20,376.10 µs
🐰 View full continuous benchmarking report in Bencher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants