fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125] by sbra0902 · Pull Request #4087 · wireapp/kalium

sbra0902 · 2026-04-28T10:50:19Z

Fix memory/coroutine leak in FlowCache: each shareIn() now runs in a child SupervisorJob scope that is cancelled on cache eviction, preventing launchSharing coroutines from accumulating as zombies
Fix memory/coroutine leak in MetadataDAOImpl.observeSerializable(): replace SharingStarted.Lazily (never stops) with WhileSubscribed(5s) + child scope cleanup via onCompletion
Add regression test verifying that repeated cache evictions don't leave zombie coroutines behind

Problem
Root Cause 1 — FlowCache: shareIn() internally launches a launchSharing coroutine in the provided scope. When WhileSubscribed sends STOP, the upstream collection is cancelled, but the launchSharing coroutine stays alive waiting for the next command. On cache eviction (remove()), only the flow reference was removed from the map — the coroutine in cacheScope lived on forever. Over time, zombie coroutines accumulated with each eviction cycle.
Root Cause 2 — observeSerializable(): Every call created a new SharedFlow in databaseScope with SharingStarted.Lazily, which starts on first subscriber and never stops. No caching meant repeated calls for the same key (e.g. appLockFlow(), isFileSharingEnabledFlow()) each spawned a permanent coroutine + SQLDelight query listener.

Impact
In our case, the accumulating zombie coroutines caused steadily increasing CPU load over time. After applying this fix, the rising CPU usage was eliminated and CPU load returned to expected levels.

Changes
FlowCache.kt

Each cached flow now gets its own SupervisorJob (child of cacheScope's job) and a derived sharingScope
remove() now cancels the associated sharingJob, killing the launchSharing coroutine
New sharingJobs map tracks jobs by key alongside the existing storage map

MetadataDAOImpl.kt

SharingStarted.Lazily → SharingStarted.WhileSubscribed(5_000L): upstream stops 5s after last subscriber leaves
databaseScope → child sharingScope with its own SupervisorJob
.onCompletion { sharingJob.cancel() }: when upstream completes after STOP, the child job is cancelled → launchSharing coroutine dies
Removed stale // TODO: Cache comment

FlowCacheTest.kt

New test givenMultipleCacheEvictions_thenSharingCoroutinesShouldNotAccumulate: repeats 5 get/evict cycles and asserts 0 remaining child coroutines

Review Request
Please review this PR carefully with regard to potential negative side effects — in particular around the new scope/job lifecycle management, cancellation behavior on cache eviction, and the change from Lazily to WhileSubscribed(5s).
Root Cause 2 might be also resolvable by applying the Flowcache here aswell.

CLAassistant · 2026-04-28T10:50:32Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ sbra0902
❌ yamilmedina
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

MohamadJaara · 2026-04-28T13:58:07Z

Hi, thanks for the contribution, but we enforce signed commits on this repo, and it can be merged only with signed commits. Can you please enable it and rebase your branch, and we can get it merged right after?

codecov-commenter · 2026-04-28T14:14:30Z

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 61.55%. Comparing base (9eca021) to head (2899763).

Files with missing lines	Patch %	Lines
...lin/com/wire/kalium/persistence/cache/FlowCache.kt	92.85%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #4087   +/-   ##
==========================================
  Coverage      61.55%   61.55%           
  Complexity      4021     4021           
==========================================
  Files           2067     2067           
  Lines          67423    67432    +9     
  Branches        6650     6651    +1     
==========================================
+ Hits           41500    41510   +10     
  Misses         23276    23276           
+ Partials        2647     2646    -1

Files with missing lines	Coverage Δ
...com/wire/kalium/persistence/dao/MetadataDAOImpl.kt	`86.48% <100.00%> (+1.19%)`	⬆️
...lin/com/wire/kalium/persistence/cache/FlowCache.kt	`96.66% <92.85%> (-3.34%)`	⬇️

... and 2 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9eca021...2899763. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

MohamadJaara · 2026-04-28T14:15:59Z

also for the sqlDelight listner this is something on our radar for a while and we have a POC fix for this that should make it better not only for metadata but for messages and conv lists as well.

MohamadJaara · 2026-04-28T15:07:42Z

you can try the branch from this draft PR #4011 to see if this also fixes the other part of the issue

sbra0902 · 2026-04-29T06:44:30Z

Hi, thanks for the contribution, but we enforce signed commits on this repo, and it can be merged only with signed commits. Can you please enable it and rebase your branch, and we can get it merged right after?

done

sbra0902 · 2026-04-29T07:08:23Z

you can try the branch from this draft PR #4011 to see if this also fixes the other part of the issue

nice i suppose that will address the unecessary emits <3

But i think i won't fix the root cause for the zombies

sonarqubecloud · 2026-05-13T11:47:06Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

MohamadJaara · 2026-05-13T11:58:37Z

sorry that it took me some time to get the review

Worth noting that FlowCache is used in 5 places, not just metadata:

userCache (UserDAO)
conversationCache + conversationDetailsCache (ConversationDAO)
membersCache (MemberDAO)
metadataCache (MetadataDAO)

The user/conversation/member ones are keyed by IDs, so keys churn a lot as you navigate around the app, every eviction leaked a zombie coroutine into databaseScope (which lives for the whole session). That's probably the main source of the CPU creep, more than metadata.

Also, after the recently merged SQLDelight fork update (#4135) we now have per-key custom query notifications on the Metadata table, writes to one key no longer wake listeners on other keys. So the metadata caching layer is less valuable than it was.

Plan: let's merge this PR as-is to fix the leak across all 5 caches, and I'll remove the metadataCache in a follow-up since custom notifications make it mostly redundant there. The FlowCache fix stays useful for user/conversation/member which still benefit from shared upstreams.

MohamadJaara · 2026-05-13T12:12:05Z

hmmm i (my claude code) just noticed something after 5s with zero subscribers the upstream now stops. A new subscriber arriving later:

gets the replay=1 cached value first (potentially stale by N minutes),
then gets a fresh DB read.

this can be an issue if a function is calling one of those flows .first() which can be an issue for conversations users and other places where this is used

MohamadJaara requested review from MohamadJaara and yamilmedina April 28, 2026 13:25

MohamadJaara assigned sbra0902 Apr 28, 2026

MohamadJaara changed the title ~~fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO~~ fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125] Apr 28, 2026

fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO

ad67399

sbra0902 force-pushed the develop branch from 3211f05 to ad67399 Compare April 29, 2026 06:36

yamilmedina and others added 2 commits May 5, 2026 16:13

Merge branch 'develop' into develop

1cdcc91

Merge branch 'develop' into develop

2899763

MohamadJaara approved these changes May 13, 2026

View reviewed changes

MohamadJaara mentioned this pull request May 13, 2026

fix: harden FlowCache lifecycle and drop redundant cache from MetadataDAO #4137

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125]#4087

fix: prevent zombie SharedFlow coroutines in FlowCache and MetadataDAO [WPB-25125]#4087
sbra0902 wants to merge 3 commits into
wireapp:developfrom
sbra0902:develop

sbra0902 commented Apr 28, 2026

Uh oh!

CLAassistant commented Apr 28, 2026 •

edited

Loading

Uh oh!

MohamadJaara commented Apr 28, 2026

Uh oh!

codecov-commenter commented Apr 28, 2026 •

edited

Loading

Uh oh!

MohamadJaara commented Apr 28, 2026 •

edited

Loading

Uh oh!

MohamadJaara commented Apr 28, 2026

Uh oh!

sbra0902 commented Apr 29, 2026

Uh oh!

sbra0902 commented Apr 29, 2026

Uh oh!

sonarqubecloud Bot commented May 13, 2026

Uh oh!

MohamadJaara commented May 13, 2026

Uh oh!

MohamadJaara commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sbra0902 commented Apr 28, 2026

Uh oh!

CLAassistant commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MohamadJaara commented Apr 28, 2026

Uh oh!

codecov-commenter commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MohamadJaara commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MohamadJaara commented Apr 28, 2026

Uh oh!

sbra0902 commented Apr 29, 2026

Uh oh!

sbra0902 commented Apr 29, 2026

Uh oh!

sonarqubecloud Bot commented May 13, 2026

Quality Gate passed

Uh oh!

MohamadJaara commented May 13, 2026

Uh oh!

MohamadJaara commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 28, 2026 •

edited

Loading

codecov-commenter commented Apr 28, 2026 •

edited

Loading

MohamadJaara commented Apr 28, 2026 •

edited

Loading