Skip to content

Cached artifact store#197

Draft
azimov wants to merge 10 commits into
developfrom
cached-artifact-store
Draft

Cached artifact store#197
azimov wants to merge 10 commits into
developfrom
cached-artifact-store

Conversation

@azimov
Copy link
Copy Markdown
Contributor

@azimov azimov commented May 7, 2026

This PR introduces a content-addressable caching system for runCmAnalyses() and restructures study population creation into two phases to maximize artifact reusability when adding new outcomes.

Key Changes

Content-Addressable Caching — Artifact filenames are now derived from SHA-256 hashes of all parameters that determine their content (including databaseId). This means:

  • Changing upstream settings naturally produces new filenames (no stale cache risk)
  • Introduction of databaseId parameter which is used in the checksums to prevent cross database issues - when using Strategus this will be based on its own hashing mechanism further reducing collision risk
  • Unchanged settings reuse existing files automatically
  • No more interactive "delete old files?" prompt — the system is append-only (issue CM v6: Silently delete old files when analysis specification changes? #192)
  • Two-Phase Study Population — createStudyPopulation() is split

createStudyPopulation changes

  1. Base population (outcome-independent): risk window creation, censoring, minDaysAtRisk filtering. Shared across all outcomes with the same time-at-risk settings.
  2. Study population (per-outcome): prior outcome removal + outcome event counting. Cheap derivation from the cached base population.

This means adding a new outcome to an existing analysis only requires the lightweight per-outcome step — all expensive shared computation (data loading, base population, PS fitting, matching/stratification) is reused from cache.

Artifact stores

Pluggable ArtifactStore interface — New R6 abstract class with LocalArtifactStore default implementation. Enables future custom storage backends (S3, shared filesystem, rdbms blob storage). Such an extension would allow for the multi-node execution of tasks and re-use of intermediate artifacts.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 88.03612% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.06%. Comparing base (4f7dcea) to head (9a37a87).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
R/RunAnalyses.R 88.85% 37 Missing ⚠️
R/ArtifactStore.R 74.28% 9 Missing ⚠️
R/StudyPopulation.R 90.78% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #197      +/-   ##
===========================================
- Coverage    94.25%   94.06%   -0.20%     
===========================================
  Files           22       23       +1     
  Lines         6531     6685     +154     
===========================================
+ Hits          6156     6288     +132     
- Misses         375      397      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant