fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125]#4087
fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125]#4087sbra0902 wants to merge 3 commits into
Conversation
|
|
|
Hi, thanks for the contribution, but we enforce signed commits on this repo, and it can be merged only with signed commits. Can you please enable it and rebase your branch, and we can get it merged right after? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4087 +/- ##
==========================================
Coverage 61.55% 61.55%
Complexity 4021 4021
==========================================
Files 2067 2067
Lines 67423 67432 +9
Branches 6650 6651 +1
==========================================
+ Hits 41500 41510 +10
Misses 23276 23276
+ Partials 2647 2646 -1
... and 2 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
also for the sqlDelight listner this is something on our radar for a while and we have a POC fix for this that should make it better not only for metadata but for messages and conv lists as well. |
|
you can try the branch from this draft PR #4011 to see if this also fixes the other part of the issue |
done |
nice i suppose that will address the unecessary emits <3 But i think i won't fix the root cause for the zombies |
|
|
sorry that it took me some time to get the review Worth noting that
The user/conversation/member ones are keyed by IDs, so keys churn a lot as you navigate around the app, every eviction leaked a zombie coroutine into Also, after the recently merged SQLDelight fork update (#4135) we now have per-key custom query notifications on the Metadata table, writes to one key no longer wake listeners on other keys. So the metadata caching layer is less valuable than it was. Plan: let's merge this PR as-is to fix the leak across all 5 caches, and I'll remove the |
|
hmmm i (my claude code) just noticed something after 5s with zero subscribers the upstream now stops. A new subscriber arriving later:
this can be an issue if a function is calling one of those flows .first() which can be an issue for conversations users and other places where this is used |



Problem
Root Cause 1 — FlowCache: shareIn() internally launches a launchSharing coroutine in the provided scope. When WhileSubscribed sends STOP, the upstream collection is cancelled, but the launchSharing coroutine stays alive waiting for the next command. On cache eviction (remove()), only the flow reference was removed from the map — the coroutine in cacheScope lived on forever. Over time, zombie coroutines accumulated with each eviction cycle.
Root Cause 2 — observeSerializable(): Every call created a new SharedFlow in databaseScope with SharingStarted.Lazily, which starts on first subscriber and never stops. No caching meant repeated calls for the same key (e.g. appLockFlow(), isFileSharingEnabledFlow()) each spawned a permanent coroutine + SQLDelight query listener.
Impact
In our case, the accumulating zombie coroutines caused steadily increasing CPU load over time. After applying this fix, the rising CPU usage was eliminated and CPU load returned to expected levels.
Changes
FlowCache.kt
Each cached flow now gets its own SupervisorJob (child of cacheScope's job) and a derived sharingScope
remove() now cancels the associated sharingJob, killing the launchSharing coroutine
New sharingJobs map tracks jobs by key alongside the existing storage map
MetadataDAOImpl.kt
SharingStarted.Lazily → SharingStarted.WhileSubscribed(5_000L): upstream stops 5s after last subscriber leaves
databaseScope → child sharingScope with its own SupervisorJob
.onCompletion { sharingJob.cancel() }: when upstream completes after STOP, the child job is cancelled → launchSharing coroutine dies
Removed stale // TODO: Cache comment
FlowCacheTest.kt
New test givenMultipleCacheEvictions_thenSharingCoroutinesShouldNotAccumulate: repeats 5 get/evict cycles and asserts 0 remaining child coroutines
Review Request
Please review this PR carefully with regard to potential negative side effects — in particular around the new scope/job lifecycle management, cancellation behavior on cache eviction, and the change from Lazily to WhileSubscribed(5s).
Root Cause 2 might be also resolvable by applying the Flowcache here aswell.