Skip to content

fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125]#4087

Open
sbra0902 wants to merge 3 commits into
wireapp:developfrom
sbra0902:develop
Open

fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125]#4087
sbra0902 wants to merge 3 commits into
wireapp:developfrom
sbra0902:develop

Conversation

@sbra0902
Copy link
Copy Markdown

  • Fix memory/coroutine leak in FlowCache: each shareIn() now runs in a child SupervisorJob scope that is cancelled on cache eviction, preventing launchSharing coroutines from accumulating as zombies
  • Fix memory/coroutine leak in MetadataDAOImpl.observeSerializable(): replace SharingStarted.Lazily (never stops) with WhileSubscribed(5s) + child scope cleanup via onCompletion
  • Add regression test verifying that repeated cache evictions don't leave zombie coroutines behind

Problem
Root Cause 1 — FlowCache: shareIn() internally launches a launchSharing coroutine in the provided scope. When WhileSubscribed sends STOP, the upstream collection is cancelled, but the launchSharing coroutine stays alive waiting for the next command. On cache eviction (remove()), only the flow reference was removed from the map — the coroutine in cacheScope lived on forever. Over time, zombie coroutines accumulated with each eviction cycle.
Root Cause 2 — observeSerializable(): Every call created a new SharedFlow in databaseScope with SharingStarted.Lazily, which starts on first subscriber and never stops. No caching meant repeated calls for the same key (e.g. appLockFlow(), isFileSharingEnabledFlow()) each spawned a permanent coroutine + SQLDelight query listener.

Impact
In our case, the accumulating zombie coroutines caused steadily increasing CPU load over time. After applying this fix, the rising CPU usage was eliminated and CPU load returned to expected levels.

Changes
FlowCache.kt

Each cached flow now gets its own SupervisorJob (child of cacheScope's job) and a derived sharingScope
remove() now cancels the associated sharingJob, killing the launchSharing coroutine
New sharingJobs map tracks jobs by key alongside the existing storage map

MetadataDAOImpl.kt

SharingStarted.Lazily → SharingStarted.WhileSubscribed(5_000L): upstream stops 5s after last subscriber leaves
databaseScope → child sharingScope with its own SupervisorJob
.onCompletion { sharingJob.cancel() }: when upstream completes after STOP, the child job is cancelled → launchSharing coroutine dies
Removed stale // TODO: Cache comment

FlowCacheTest.kt

New test givenMultipleCacheEvictions_thenSharingCoroutinesShouldNotAccumulate: repeats 5 get/evict cycles and asserts 0 remaining child coroutines

Review Request
Please review this PR carefully with regard to potential negative side effects — in particular around the new scope/job lifecycle management, cancellation behavior on cache eviction, and the change from Lazily to WhileSubscribed(5s).
Root Cause 2 might be also resolvable by applying the Flowcache here aswell.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 28, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ sbra0902
❌ yamilmedina
You have signed the CLA already but the status is still pending? Let us recheck it.

@MohamadJaara
Copy link
Copy Markdown
Member

Hi, thanks for the contribution, but we enforce signed commits on this repo, and it can be merged only with signed commits. Can you please enable it and rebase your branch, and we can get it merged right after?

@MohamadJaara MohamadJaara changed the title fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125] Apr 28, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (9eca021) to head (2899763).

Files with missing lines Patch % Lines
...lin/com/wire/kalium/persistence/cache/FlowCache.kt 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #4087   +/-   ##
==========================================
  Coverage      61.55%   61.55%           
  Complexity      4021     4021           
==========================================
  Files           2067     2067           
  Lines          67423    67432    +9     
  Branches        6650     6651    +1     
==========================================
+ Hits           41500    41510   +10     
  Misses         23276    23276           
+ Partials        2647     2646    -1     
Files with missing lines Coverage Δ
...com/wire/kalium/persistence/dao/MetadataDAOImpl.kt 86.48% <100.00%> (+1.19%) ⬆️
...lin/com/wire/kalium/persistence/cache/FlowCache.kt 96.66% <92.85%> (-3.34%) ⬇️

... and 2 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9eca021...2899763. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@MohamadJaara
Copy link
Copy Markdown
Member

MohamadJaara commented Apr 28, 2026

also for the sqlDelight listner this is something on our radar for a while and we have a POC fix for this that should make it better not only for metadata but for messages and conv lists as well.

@MohamadJaara
Copy link
Copy Markdown
Member

you can try the branch from this draft PR #4011 to see if this also fixes the other part of the issue

@sbra0902
Copy link
Copy Markdown
Author

Hi, thanks for the contribution, but we enforce signed commits on this repo, and it can be merged only with signed commits. Can you please enable it and rebase your branch, and we can get it merged right after?

done

@sbra0902
Copy link
Copy Markdown
Author

you can try the branch from this draft PR #4011 to see if this also fixes the other part of the issue

nice i suppose that will address the unecessary emits <3

But i think i won't fix the root cause for the zombies

@sonarqubecloud
Copy link
Copy Markdown

@MohamadJaara
Copy link
Copy Markdown
Member

sorry that it took me some time to get the review

Worth noting that FlowCache is used in 5 places, not just metadata:

  • userCache (UserDAO)
  • conversationCache + conversationDetailsCache (ConversationDAO)
  • membersCache (MemberDAO)
  • metadataCache (MetadataDAO)

The user/conversation/member ones are keyed by IDs, so keys churn a lot as you navigate around the app, every eviction leaked a zombie coroutine into databaseScope (which lives for the whole session). That's probably the main source of the CPU creep, more than metadata.

Also, after the recently merged SQLDelight fork update (#4135) we now have per-key custom query notifications on the Metadata table, writes to one key no longer wake listeners on other keys. So the metadata caching layer is less valuable than it was.

Plan: let's merge this PR as-is to fix the leak across all 5 caches, and I'll remove the metadataCache in a follow-up since custom notifications make it mostly redundant there. The FlowCache fix stays useful for user/conversation/member which still benefit from shared upstreams.

@MohamadJaara
Copy link
Copy Markdown
Member

hmmm i (my claude code) just noticed something after 5s with zero subscribers the upstream now stops. A new subscriber arriving later:

  • gets the replay=1 cached value first (potentially stale by N minutes),
  • then gets a fresh DB read.

this can be an issue if a function is calling one of those flows .first() which can be an issue for conversations users and other places where this is used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants