From a9a757e040ea0588f5a00b5875f5f2d0e8cfa7df Mon Sep 17 00:00:00 2001 From: Luca Pinello Date: Thu, 30 Apr 2026 08:46:21 -0400 Subject: [PATCH 1/4] fix(cdf-rebuild): protect interim NPZ + honour CUDA_VISIBLE_DEVICES; streamline README top MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three changes from the 0.4.0 follow-up triage. Single PR because the HANDOFF.md tweak depends on both script fixes. ## Bug A (P1, #71/#73): refuse-to-overwrite the interim NPZ `scripts/build_backgrounds_chrombpnet.py` was writing the interim NPZ unconditionally with `np.savez_compressed(interim_path, ...)`. The documented two-pass flow (`--assay ATAC_DNASE` then `--assay CHIP`) silently overwrote the first pass's 42-track interim with the second pass's 744 CHIP tracks, producing a 744-track final NPZ instead of the expected 786 β€” caught by post-merge spot-check during PR #70 (~50 min GPU recovery). Conservative fix per the user's pick: refuse to overwrite without `--force`, naming the conflicting track-id sets in the SystemExit message. The new `_check_interim_compatibility()` helper runs before each interim write site and: - returns silently if the path doesn't exist (first run); - returns silently if the existing track set equals the new set (idempotent re-run with no data loss); - raises SystemExit with a diff naming `len(only_existing)`, `len(only_new)`, and 3 example track-ids from each side, plus pointing at `--part merge` / `merge-incremental` / `--force`; - returns silently with `--force`. ## Bug B (P2, #72/#74): honour pre-set CUDA_VISIBLE_DEVICES `load_models_and_setup()` was clobbering `os.environ["CUDA_VISIBLE_DEVICES"]` with `--gpu N` unconditionally, so the documented parallel-launch pattern (outer `CUDA_VISIBLE_DEVICES=N` per terminal pinning the physical GPU) didn't work β€” both terminals landed on physical GPU 0, fighting for memory until one OOMed. Trivial pattern per the user's pick: only set the env var when nothing was already set: if "CUDA_VISIBLE_DEVICES" not in os.environ: os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu) Plus a one-line `--help` clarification on `--gpu`: "No-op if CUDA_VISIBLE_DEVICES is already set in the calling shell." ## README streamlining User feedback: the first two sections of "πŸš€ Get running in one lunch break" were too dense for a first-timer. Specifically the ~150-word Prerequisite paragraph after Step 1 (Miniforge, ~28 GB, both AlphaGenome backends, ChromBPNet streaming, lazy downloads, --all-chrombpnet, ENCODE fallback) and the 5-line Step 2 blockquote duplicating backend detail. Restructure (top section now is just): four-step intro, "Before you start" mini-section with three bullets (Miniforge link, ~28 GB, platforms), Steps 1-4 each terse and copy-pasteable. The detail moves to existing deeper sections β€” the Step 2 blockquote was already duplicated in the deeper "Two AlphaGenome backends" subsection (same content, more detail), so just drops; the Prerequisite paragraph's disk-usage detail moves into a new `#### Disk usage breakdown` subsection at the top of `Installation β€” detailed`. Anchor links (`#disk-usage-breakdown`, `#two-alphagenome-backends`, `#where-the-oracle-weights-come-from`) all resolve. ## HANDOFF.md Adds two short notes so future maintainers don't trip over the new behaviour: a callout between Phase 1 and Phase 2 reminding to run `--part merge` (or pass `--force`) before re-running with a different `--assay`; and a one-line note in the parallel-launch section explaining the `CUDA_VISIBLE_DEVICES` precedence. ## Tests `pytest -m "not integration and not slow"` β†’ 368 passed, 1 skipped, 5 deselected β€” same count as main, no regression. Bug A's SystemExit-vs-overwrite branches are exercised by the helper's pure function; a smoke test that builds two tiny interims and asserts the diff message is left as a follow-up (would need conda-env infra in CI to actually run the script β€” out of scope for this PR). Closes #71, #72, #73, #74. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 25 +++++-- .../HANDOFF.md | 16 +++-- scripts/build_backgrounds_chrombpnet.py | 67 ++++++++++++++++++- 3 files changed, 98 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index f0da9cd..848558b 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,12 @@ Four steps. Steps 1 + 2 are copy-paste. Step 3 is a runnable snippet. Step 4 hooks chorus up to Claude Code. +**Before you start** β€” three things you need: + +- **Miniforge** (provides `mamba`) from +- **~28 GB free disk** β€” see [Disk usage breakdown](#disk-usage-breakdown) if you want the per-oracle / per-asset numbers +- **Linux x86_64 or macOS** (Intel / Apple Silicon) + ### 1. Install (5 minutes) ```bash @@ -22,12 +28,8 @@ mamba activate chorus python -m pip install -e . ``` -Prerequisite: **Miniforge** (provides `mamba`) from , plus **~28 GB free disk** for the default install (6 oracles + a second AlphaGenome backend for Mac MPS speed; hg38; per-oracle CDF backgrounds). Both AlphaGenome backends β€” `alphagenome` (JAX, default) and `alphagenome_pt` (PyTorch, same model + same weights, converted to safetensors) β€” install by default so Mac users automatically get the MPS-accelerated path; see [Two AlphaGenome backends](#two-alphagenome-backends). ChromBPNet/BPNet weights stream from a slim HuggingFace mirror β€” only ~50 MB pre-cached for the K562 + HepG2 DNase fast path used by every shipped notebook. Works on Linux x86_64 and macOS (Intel / Apple Silicon). A single oracle env is ~3 GB on average. Opting in to `chorus setup --all-chrombpnet` pre-caches every fold-0 bias-corrected ChromBPNet model from the slim mirror (~1.5 GB additional, ~5 min). Other ChromBPNet cell types download lazily on first `load_pretrained_model(...)` regardless. If you specifically need the full bias-aware `chrombpnet` variant or fold β‰  0, chorus falls back to the original ENCODE tarball for that specific model (~1.8 GB on disk per model). - ### 2. Download all 6 oracles + hg38 + backgrounds (~55–75 min, unattended) -> Both AlphaGenome backends β€” JAX (`alphagenome`, gated) and PyTorch (`alphagenome_pt`, public mirror of the same weights converted to safetensors) β€” install by default. The PyTorch backend adds ~2.6 GB of disk and ~10–13 min to setup; in exchange Mac users get the MPS-accelerated path automatically (5–8Γ— faster than JAX CPU at ≀600 kb windows). Linux/CUDA users will likely use the JAX backend in practice (it's faster on A100), but the PyTorch backend is still useful for portability. - ```bash chorus setup ``` @@ -194,6 +196,21 @@ The TLDR's `chorus setup` does everything you need. This section covers the edge > **Two env files, one source of truth.** The root `environment.yml` is what you install. The per-oracle files in `environments/` are consumed internally by `chorus setup --oracle ` β€” you don't install them directly. +#### Disk usage breakdown + +The default `chorus setup` (all 6 oracles, both AlphaGenome backends, hg38, all CDF backgrounds) lands at **~28 GB**: + +| Bucket | Size | +|---|---| +| 6 oracle conda envs (~3 GB each) | ~18 GB | +| `hg38` reference fasta + index | ~3 GB | +| Per-oracle CDF backgrounds (`~/.chorus/backgrounds/`) | ~2 GB | +| AlphaGenome PyTorch backend (`alphagenome_pt`, default-on so Mac users get MPS speed) | ~2.6 GB | +| ChromBPNet slim HuggingFace mirror β€” fast-path pre-cache (K562 + HepG2 DNase) | ~50 MB | +| **Total default** | **~28 GB** | + +Opting in via `chorus setup --all-chrombpnet` pre-caches every fold-0 bias-corrected ChromBPNet model from the slim mirror (+~1.5 GB, ~5 min). Other ChromBPNet cell types download lazily on first `load_pretrained_model(...)` regardless. If you specifically need the full bias-aware `chrombpnet` variant or fold β‰  0, chorus falls back to the original ENCODE tarball for that specific model (+~1.8 GB on disk per model). See [Where the oracle weights come from](#where-the-oracle-weights-come-from) for the full mirror map. + #### Upgrading After the first install, to upgrade cleanly: diff --git a/audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md b/audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md index ad30dd2..fe7bacb 100644 --- a/audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md +++ b/audits/2026-04-29_chrombpnet_cdf_rebuild/HANDOFF.md @@ -92,6 +92,12 @@ mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \ --model-type chrombpnet_nobias \ 2>&1 | tee logs/bg_chrombpnet_baselines_atac.log +# Phase 1 finished: run `--part merge` (or `--part merge-incremental`) NOW +# to consume the ATAC/DNASE interim NPZ before Phase 2 writes a new one. +# Otherwise Phase 2 will refuse to overwrite (since v0.4.x's --force-gated +# safety check, fixing #71/#73). Pass `--force` if you intentionally want +# to discard the Phase 1 interim and rebuild Phase 2 in isolation. + # === Phase 2: CHIP/BPNet models (1259 models, ~10-20 h on A100) === mamba run -n chorus-chrombpnet python scripts/build_backgrounds_chrombpnet.py \ --part variants --assay CHIP --gpu 0 \ @@ -114,17 +120,19 @@ If you have 2 GPUs available, parallelize the CHIP phase across them: ```bash # Terminal 1 (GPU 0): -... --part variants --assay CHIP --gpu 0 --shard 0 --shard-of 2 ... -... --part baselines --assay CHIP --gpu 0 --shard 0 --shard-of 2 ... +CUDA_VISIBLE_DEVICES=0 ... --part variants --assay CHIP --gpu 0 --shard 0 --shard-of 2 ... +CUDA_VISIBLE_DEVICES=0 ... --part baselines --assay CHIP --gpu 0 --shard 0 --shard-of 2 ... # Terminal 2 (GPU 1): -... --part variants --assay CHIP --gpu 1 --shard 1 --shard-of 2 ... -... --part baselines --assay CHIP --gpu 1 --shard 1 --shard-of 2 ... +CUDA_VISIBLE_DEVICES=1 ... --part variants --assay CHIP --gpu 0 --shard 1 --shard-of 2 ... +CUDA_VISIBLE_DEVICES=1 ... --part baselines --assay CHIP --gpu 0 --shard 1 --shard-of 2 ... # After both finish: mamba run -n chorus python scripts/build_backgrounds_chrombpnet.py --part merge-shards ``` +The outer `CUDA_VISIBLE_DEVICES` is what pins each terminal to its physical GPU. The inner `--gpu 0` is now a no-op when the env var is set (v0.4.x fix for #72/#74; previously the `--gpu` arg silently clobbered the outer var and both terminals fought over GPU 0). + ## Spot-check before upload ```bash diff --git a/scripts/build_backgrounds_chrombpnet.py b/scripts/build_backgrounds_chrombpnet.py index 1c3fed7..340cd8b 100644 --- a/scripts/build_backgrounds_chrombpnet.py +++ b/scripts/build_backgrounds_chrombpnet.py @@ -41,7 +41,14 @@ "JASPAR models (~3 min/model on Metal β€” much smaller arch). " "all = both, sequentially.", ) -parser.add_argument("--gpu", type=int, default=0) +parser.add_argument( + "--gpu", + type=int, + default=0, + help="Pin TensorFlow to this GPU index. No-op if CUDA_VISIBLE_DEVICES " + "is already set in the calling shell β€” the outer env var wins, so " + "`CUDA_VISIBLE_DEVICES=1 ... --gpu 0` puts the work on physical GPU 1.", +) parser.add_argument("--fold", type=int, default=0) parser.add_argument( "--model-type", @@ -67,6 +74,15 @@ "chrombpnet_pertrack.npz. Pair with --part merge-incremental to " "stitch new rows into the existing NPZ.", ) +parser.add_argument( + "--force", + action="store_true", + help="Overwrite an existing interim NPZ even if its track-id set " + "differs from the current run. Default: refuse to overwrite and " + "exit with a diff, to protect the documented two-pass flow " + "(--assay ATAC_DNASE then --assay CHIP) which would otherwise " + "silently lose the first pass's tracks.", +) parser.add_argument( "--shard", type=int, @@ -195,9 +211,54 @@ def _interim_suffix() -> str: return f".shard{args.shard}of{args.shard_of}" +def _check_interim_compatibility(interim_path: str, new_track_ids, force: bool, label: str) -> None: + """Refuse to overwrite an interim NPZ whose track-id set differs from + the current run, unless --force was passed. Closes #71/#73. + + The documented two-pass flow (``--assay ATAC_DNASE`` then ``--assay + CHIP``) would otherwise silently overwrite the first pass's interim + with the second pass's smaller track set, producing a 744-track + final NPZ where 786 was expected. + """ + if not os.path.exists(interim_path): + return + try: + existing = list(np.load(interim_path, allow_pickle=False)["track_ids"].astype(str)) + except Exception as exc: + if force: + return + raise SystemExit( + f"Existing {label} interim at {interim_path} is unreadable " + f"({exc}); pass --force to overwrite." + ) + new_set, old_set = set(new_track_ids), set(existing) + if new_set == old_set: + return # same tracks β†’ plain overwrite is harmless + if force: + return + only_existing = sorted(old_set - new_set) + only_new = sorted(new_set - old_set) + raise SystemExit( + f"Refusing to overwrite {label} interim at {interim_path}.\n" + f" Existing tracks: {len(existing)} (e.g. {only_existing[:3]})\n" + f" New tracks: {len(new_track_ids)} (e.g. {only_new[:3]})\n" + f" Only in existing: {len(only_existing)}; only in new: {len(only_new)}.\n" + f"This usually means a previous staged run wrote tracks for a " + f"different --assay group (the documented two-pass flow). Run " + f"`--part merge` (or `--part merge-incremental`) to consume the " + f"existing interim first, then re-run, or pass --force to discard " + f"the existing interim and write only the current run's tracks." + ) + + def load_models_and_setup(): """Load reference, set up GPU, return (oracle, models_to_score, ref).""" - os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu) + # Honour pre-set CUDA_VISIBLE_DEVICES from the calling shell so the + # documented parallel-launch pattern works as cluster-user mental model + # expects (`CUDA_VISIBLE_DEVICES=N ... --gpu 0` per terminal pins the + # outer physical GPU, not the inner --gpu arg). Closes #72/#74. + if "CUDA_VISIBLE_DEVICES" not in os.environ: + os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu) try: import nvidia @@ -515,6 +576,7 @@ def build_all_models(do_variants: bool, do_baselines: bool): if do_variants: effect_matrix = effect_reservoir.to_cdf_matrix(n_points=args.n_cdf_points) interim_path = os.path.join(cache_dir, f"chrombpnet_effect_cdfs_interim{suffix}.npz") + _check_interim_compatibility(interim_path, track_ids, args.force, "effect-CDF") np.savez_compressed( interim_path, track_ids=np.array(track_ids, dtype='U'), @@ -528,6 +590,7 @@ def build_all_models(do_variants: bool, do_baselines: bool): summary_matrix = summary_reservoir.to_cdf_matrix(n_points=args.n_cdf_points) perbin_matrix = perbin_reservoir.to_cdf_matrix(n_points=args.n_cdf_points) interim_path = os.path.join(cache_dir, f"chrombpnet_baseline_cdfs_interim{suffix}.npz") + _check_interim_compatibility(interim_path, track_ids, args.force, "baseline-CDF") np.savez_compressed( interim_path, track_ids=np.array(track_ids, dtype='U'), From 18c47ab3457be14202f56639d0fe48a09c694a42 Mon Sep 17 00:00:00 2001 From: Luca Pinello Date: Thu, 30 Apr 2026 08:51:45 -0400 Subject: [PATCH 2/4] README: drop the "Want to start in 2 minutes?" callout from Step 2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User feedback as part of the streamlining pass. The callout was an escape hatch into a one-oracle-only install for impatient readers, but it sits between Step 2 (the canonical full setup) and Step 3 (the runnable snippet that uses Enformer anyway), and a first-time reader following the linear flow doesn't need it interrupting the narrative. Anyone who specifically wants the lightweight starter will find `chorus setup --oracle ` documented in `Installation β€” detailed β†’ Setting up oracle environments one-by-one` (already present, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index 848558b..7329a0d 100644 --- a/README.md +++ b/README.md @@ -42,8 +42,6 @@ One command. Pulls every oracle's weights, all background CDFs, and the hg38 ref 3. Paste the token when `chorus setup` asks. - **LDlink token** (optional β€” only for `fine_map_causal_variant`): register free at , paste when prompted. Press Enter to skip β€” not needed for most workflows. -> **Want to start in 2 minutes?** `chorus setup --oracle enformer` installs just the lightweight CPU starter; you can add more oracles later with `chorus setup --oracle `. - ### 3. Predict β€” wild-type + SNP effect in one block ```python From 4d3e9d2fbab66b49913f99553414d2d5dadabc96 Mon Sep 17 00:00:00 2001 From: Luca Pinello Date: Thu, 30 Apr 2026 09:01:02 -0400 Subject: [PATCH 3/4] README: more energetic Step 2/3/4 + reorder What-to-read-next MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit User feedback on the streamlined top: Step 2 β€” title was a heavy spec ("Download all 6 oracles + hg38 + backgrounds"); replaced with "Get every oracle, weight, and reference β€” batteries included". Body trimmed of the multi-GB-tarball / CDF detail that new users don't care about. Step 3 β€” title was utilitarian ("Predict β€” wild-type + SNP effect in one block"); replaced with "Your first prediction β€” score a SNP at the Ξ²-globin locus". Added two intro sentences above the code block explaining what the snippet does and why this prediction shape (one wild-type signal + N counter-factual variants) is the canonical chorus pattern. Step 4 β€” biggest reorder. Old version led with "ships an MCP server with 22 tools, here's the full list" before any natural- language example. New version flips that: lead with one bash command + three concrete prompts the user can paste into Claude Code, then mention the 22-tool catalogue at the end as the deeper read. Title rewritten to convey "complex analyses without coding" ("Skip the code β€” drive chorus from Claude in plain English"). What-to-read-next reordered so the first two bullets are the discovery/exploration paths (Notebooks, Worked application examples β€” both prompt-driven) instead of the API recipes. API slipped to fourth. No code or anchor changes; all internal links still resolve. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 30 +++++++++++++++++------------- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 7329a0d..593eeec 100644 --- a/README.md +++ b/README.md @@ -28,21 +28,23 @@ mamba activate chorus python -m pip install -e . ``` -### 2. Download all 6 oracles + hg38 + backgrounds (~55–75 min, unattended) +### 2. Get every oracle, weight, and reference β€” batteries included (~55–75 min, unattended) ```bash chorus setup ``` -One command. Pulls every oracle's weights, all background CDFs, and the hg38 reference β€” everything pre-downloaded so your first prediction doesn't block on a multi-GB tarball. When prompted: +One command, walk away, come back to a complete chorus install. When prompted: - **HuggingFace token** (required β€” AlphaGenome is a gated model): 1. Create a read token at 2. Accept the license at 3. Paste the token when `chorus setup` asks. -- **LDlink token** (optional β€” only for `fine_map_causal_variant`): register free at , paste when prompted. Press Enter to skip β€” not needed for most workflows. +- **LDlink token** (optional β€” only for `fine_map_causal_variant`): register free at , paste when prompted. Press Enter to skip. -### 3. Predict β€” wild-type + SNP effect in one block +### 3. Your first prediction β€” score a SNP at the Ξ²-globin locus + +A 30-second taste of what chorus does. The snippet loads Enformer, predicts DNase accessibility around `chr11:5,247,500` (in the Ξ²-globin locus, expressed in K562), then scans **every possible SNP at that one base** to score the effect of A/C/G/T. One real wild-type signal, three counter-factual variants β€” the same shape every chorus prediction takes. ```python import chorus @@ -73,29 +75,31 @@ print(f"Variant result: scored {n_alts} alt alleles " f"({list(effects['predictions'].keys())})") ``` -### 4. Use with Claude Code +### 4. Skip the code β€” drive chorus from Claude in plain English πŸ€– -Chorus ships an MCP server with **22 tools** ([full list](#mcp-server) -under "MCP server"). Add it once: +Hook chorus up to Claude Code once and then *describe* the analysis you want. Claude figures out which models to load, which tracks to score, and which chorus tool to call. ```bash claude mcp add chorus -- mamba run -n chorus chorus-mcp ``` -Then in Claude Code: - -> *"What chorus oracles are available?"* β€” sanity-check the connection (Claude calls `list_oracles`). +Now ask, in any Claude Code prompt: > *"Predict DNase accessibility at chr11:5,247,000–5,248,000 with Enformer for K562, then compute the effect of rs12740374 on SORT1 expression with AlphaGenome."* -Claude will use the chorus MCP tools (`list_tracks`, `predict`, `predict_variant_effect`, `analyze_variant_multilayer`, …) to answer. +> *"Find the cell types where the SNP rs12740374 most strongly opens chromatin."* + +> *"Replace the 200 bp endogenous enhancer at chrX:48,782,929–48,783,129 with this synthetic sequence and predict accessibility in HepG2, K562, and GM12878."* + +That's it. No more boilerplate, no juggling oracle APIs β€” chorus exposes **22 MCP tools** ([full list](#mcp-server)) covering prediction, variant effects, region swaps, multi-layer analysis, gene-TSS lookups, and cell-type discovery, and Claude picks the right one for the question. ### What to read next +- [Notebooks](#notebooks) β€” three end-to-end tutorials you can follow start-to-finish (start here) +- [Worked application examples](#worked-application-examples) β€” driven by natural-language prompts; the *what can chorus do?* tour +- [MCP server](#mcp-server) β€” full Claude Code + Claude Desktop setup with all 22 tools - [Python API](#python-api) β€” 9 runnable recipes (region replacement, gene expression, sub-region scoring, variant-to-gene, …) - [Pick an oracle](#pick-an-oracle) β€” hardware matrix, which one to start with -- [MCP server](#mcp-server) β€” full Claude Code + Claude Desktop setup -- [Notebooks](#notebooks) β€” three end-to-end tutorials - [Troubleshooting](#troubleshooting) --- From 4c459230a51b09a218ba8fe93b0ab7f331816689 Mon Sep 17 00:00:00 2001 From: Luca Pinello Date: Thu, 30 Apr 2026 10:11:09 -0400 Subject: [PATCH 4/4] README: tighten "What chorus is" + punch up application examples + Notebooks + MCP intros MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Second-pass review. The audit was: where does the reader's interest flag and why? Five spots fixed: What chorus is β€” was three redundant bullet lists. The cover tagline already names the six oracles, and the "Pick an oracle" table right below has the per-oracle stats. The "Key features" sub-bullet list duplicated the lunch-break tour AND the later Key features section. Replaced with two short paragraphs that say what the cover doesn't: percentile-grounded outputs (`+0.45 log2FC β†’ 0.962 effect %ile`), per- oracle conda isolation, chorus-controlled HF mirrors, 22-tool MCP. Worked application examples β€” title was a heading, now leads. New title: "Worked application examples β€” seven things you can do today." Intro now opens with the magic ("every example was generated end-to- end by Claude Code talking to chorus's MCP server, no code written by hand") instead of burying it. Notebooks β€” was "Three notebooks are provided, from introductory to advanced". Replaced with "Three sittings, zero to confident" + a sentence describing the user's actual progression. Per-notebook descriptions rewritten in second-person ("what you'll build") with plain-English summaries β€” the "I get it now" notebook, the "graduate- level" notebook. Pick an oracle β€” moved the wall-of-prose paragraph about CUDA / Apple Metal / tensorflow-metal / PyTorch MPS / JAX-Metal-falls-back-to-CPU out of "Pick an oracle". Replaced with a one-line "GPU detection is automatic" pointer to a new Platform & GPU support table inside Installation β€” detailed, where someone actually deciding what to install can find it. MCP server β€” was "Chorus includes an MCP server that lets AI assistants like Claude directly load oracles, predict variant effects, and analyze gene expression β€” all through natural language conversation." Lukewarm. Replaced with a one-liner that ties back to Step 4 of the lunch-break tour (which the reader just finished, and which is what hyped them in the first place) and previews what's in the rest of the section. Net diff: -7 LOC. Doc is shorter, less redundant, and punches harder where it matters. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 55 ++++++++++++++++++++++++------------------------------- 1 file changed, 24 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index 593eeec..d667f75 100644 --- a/README.md +++ b/README.md @@ -110,25 +110,9 @@ _Everything below is optional β€” the TLDR above is enough to get running. Secti ### What chorus is -Chorus provides a consistent, easy-to-use API for working with state-of-the-art genomic deep learning models including: +Six state-of-the-art genomic deep-learning models β€” Enformer, Borzoi, ChromBPNet/BPNet, Sei, LegNet, AlphaGenome β€” wired through one API. The same five lines of Python predict variant effects on chromatin accessibility (ChromBPNet, base-pair resolution), TF binding (Enformer, BPNet), 5,731 multi-modal tracks at 1 Mb context (AlphaGenome), or RNA-seq-grade gene expression (Borzoi). Every prediction comes with **effect-percentile and activity-percentile scores** ranked against ~10 k random SNPs and ~30 k genome-wide cCREs, so a `+0.45 logβ‚‚FC` becomes `0.962 effect %ile, 0.81 activity %ile` β€” directly interpretable, not a raw fold-change you have to calibrate yourself. -- **Enformer**: Predicts gene expression and chromatin states from DNA sequences -- **Borzoi**: Enhanced model for regulatory genomics predictions -- **ChromBPNet / BPNet**: Predicts chromatin accessibility (ChromBPNet) and TF binding (BPNet) at base-pair resolution -- **Sei**: Sequence regulatory effect predictions β€” 21,907 underlying chromatin profiles aggregated into 40 sequence classes used for variant scoring -- **LegNet**: Regulatory regions activity prediction using models trained on MPRA data -- **AlphaGenome**: Google DeepMind's model predicting 5,731 genomic tracks (5,168 human-only β€” the CDF-backed subset chorus normalizes against; 563 mouse) at single base-pair resolution from 1MB input - -Key features: -- 🧬 Unified API across different models -- πŸ“Š Built-in visualization tools for genomic tracks -- πŸ”¬ Variant effect prediction -- 🎯 In silico mutagenesis and sequence optimization -- πŸ“ˆ Effect-percentile scoring against pre-computed genome-wide backgrounds (auto-downloaded from HuggingFace) β€” not RNA-seq-style quantile normalization; each variant's effect is ranked against ~10k random SNPs -- πŸš€ Enhanced sequence editing logic -- πŸ”§ Isolated conda environments for each oracle to avoid dependency conflicts -- πŸ§ͺ Sub-region scoring, gene expression analysis (CAGE + RNA-seq), and variant-to-gene effect prediction -- πŸ€– MCP server for AI assistant integration (Claude, etc.) +Each oracle runs in its own conda environment (no TF/PyTorch/JAX dependency hell), every weight + reference + background is pre-mirrored to a chorus-controlled HuggingFace org (no broken-link surprises), and the **22-tool MCP server** lets you ask Claude to run the analysis in plain English. See [Pick an oracle](#pick-an-oracle) for the per-oracle hardware/cost matrix. ### Key terms @@ -140,11 +124,11 @@ Key features: | **Effect percentile** | How extreme a variant's effect is compared to ~10,000 random SNPs (β‰₯99th = stronger than 99% of random variants) | | **log2FC** | Log2 fold-change between alternate and reference allele predictions β€” the raw effect size (most layers). Gene-expression uses **lnFC** (natural log) and MPRA uses **Ξ” (altβˆ’ref)**; every report states the formula used per layer. | -### Worked application examples +### Worked application examples β€” seven things you can do today -Every subfolder under [`examples/walkthroughs/`](examples/walkthroughs/) is a concrete, ready-to-reproduce use case with full outputs in **Markdown, JSON, TSV, and HTML** (with an embedded IGV browser): +Every example below was generated **end-to-end by Claude Code talking to chorus's MCP server**. No code was written by hand β€” the original natural-language prompt is preserved at the top of every report, so you can read what was asked, look at what came back, and reproduce it by pasting the same prompt into your own Claude session. -| I want to... | Example | +| I want to… | Example | |---|---| | Analyze a GWAS / clinical variant in a specific cell type | [variant_analysis/SORT1_rs12740374](examples/walkthroughs/variant_analysis/SORT1_rs12740374/) | | I have a variant but don't know the relevant tissue | [discovery/SORT1_cell_type_screen](examples/walkthroughs/discovery/SORT1_cell_type_screen/) | @@ -154,7 +138,7 @@ Every subfolder under [`examples/walkthroughs/`](examples/walkthroughs/) is a co | Replicate a published regulatory variant finding | [validation/SORT1_rs12740374_with_CEBP](examples/walkthroughs/validation/SORT1_rs12740374_with_CEBP/) | | Cross-validate a variant across multiple oracles | [validation/SORT1_rs12740374_multioracle](examples/walkthroughs/validation/SORT1_rs12740374_multioracle/) | -These examples were generated through Claude Code using Chorus's MCP server β€” the same way you'll use it. Every report preserves the original prompt at the top, so you can see exactly what was asked and reproduce it. See [`examples/walkthroughs/README.md`](examples/walkthroughs/README.md) for the full list with per-persona ("Geneticist", "Bioinformatician", "Clinician", "Computational biologist") starting points. +Every report ships in **Markdown + JSON + TSV + HTML with an embedded IGV browser**. Pick the format your downstream pipeline likes; they're consistent. [`examples/walkthroughs/README.md`](examples/walkthroughs/README.md) has the full catalogue with per-persona ("Geneticist", "Bioinformatician", "Clinician", "Computational biologist") starting points. ### Pick an oracle @@ -170,7 +154,7 @@ Start with one or two oracles and add more with `chorus setup --oracle ` l | **AlphaGenome** | 16 GB | strongly recommended | ~30 s (GPU) / 2–5 min (CPU) | comprehensive multi-layer (5,731 tracks, 1 Mb window) | | **AlphaGenome (PyTorch backend)** β“˜ | 16 GB | recommended (esp. Apple Silicon) | ~3.8 s @524 kb on Mac MPS / ~2 s @1 MB on CUDA | alternative backend with the same weights; see [Two AlphaGenome backends](#two-alphagenome-backends) below | -All oracles auto-detect CUDA via `torch.cuda.is_available()` / `jax.device_get`; respect `CUDA_VISIBLE_DEVICES` to pin to a specific GPU. Pass `device='cuda'` / `'cpu'` / `'mps'` explicitly if needed. **GPU support:** NVIDIA CUDA (Linux) is auto-detected; Apple Metal is supported via `tensorflow-metal` for the TF-backed oracles (Enformer, ChromBPNet), PyTorch MPS for the PyTorch-backed oracles (Borzoi, Sei, LegNet), and PyTorch MPS for the AlphaGenome PyTorch backend (`alphagenome_pt`). The default JAX `alphagenome` oracle falls back to CPU on Apple Silicon (the JAX-Metal backend is still maturing) β€” install `alphagenome_pt` if you want full Mac GPU speed for ≀600 kb windows. +**GPU detection is automatic** β€” every oracle picks CUDA / MPS / CPU based on what's available; pass `device='cuda'` / `'cpu'` / `'mps'` to override, or set `CUDA_VISIBLE_DEVICES` to pin to a specific GPU. The platform-by-oracle support matrix and Apple Silicon nuances live in [Installation β€” detailed](#installation--detailed). #### Two AlphaGenome backends @@ -198,6 +182,15 @@ The TLDR's `chorus setup` does everything you need. This section covers the edge > **Two env files, one source of truth.** The root `environment.yml` is what you install. The per-oracle files in `environments/` are consumed internally by `chorus setup --oracle ` β€” you don't install them directly. +#### Platform & GPU support + +| Platform | Default oracle path | Notes | +|---|---|---| +| **Linux x86_64 + NVIDIA CUDA** | full GPU acceleration on every oracle | NVIDIA CUDA auto-detected; pass `device='cuda'` / `CUDA_VISIBLE_DEVICES=N` to pin to a specific GPU | +| **macOS (Apple Silicon)** | TF-backed oracles (Enformer, ChromBPNet) and PyTorch-backed (Borzoi, Sei, LegNet) use Metal automatically | `tensorflow-metal` for TF; PyTorch MPS for the rest | +| **macOS (Intel)** | CPU on every oracle | works, just slower | +| **AlphaGenome on Apple Silicon** | use the `alphagenome_pt` PyTorch backend (installed by default) for MPS at ≀600 kb windows | the JAX `alphagenome` oracle falls back to CPU on Apple Silicon β€” JAX-Metal still matures; see [Two AlphaGenome backends](#two-alphagenome-backends) | + #### Disk usage breakdown The default `chorus setup` (all 6 oracles, both AlphaGenome backends, hg38, all CDF backgrounds) lands at **~28 GB**: @@ -505,19 +498,19 @@ wt_files = predictions.save_predictions_as_bedgraph(output_dir="bedgraph_outputs ``` -### Notebooks +### Notebooks β€” three sittings, zero to confident -Three notebooks are provided, from introductory to advanced (all work once `chorus setup` has completed): +Three end-to-end Jupyter notebooks shipped with the repo. Run them in order β€” by the time you finish notebook 3 you'll have used every oracle, scored a variant across five regulatory layers, and rendered a coolbox track view of your own predictions. All three work as soon as `chorus setup` finishes; no extra downloads needed. -| Notebook | Oracles | What it covers | +| Notebook | Oracles | What you'll build | |----------|---------|----------------| -| `examples/notebooks/single_oracle_quickstart.ipynb` | Enformer | Deep single-oracle tutorial: predictions, region replacement, insertion, variant effects, gene expression, coolbox visualization | -| `examples/notebooks/comprehensive_oracle_showcase.ipynb` | All 6 | All oracles side by side, cross-oracle comparison, variant analysis with gene expression, sub-region scoring | -| `examples/notebooks/advanced_multi_oracle_analysis.ipynb` | Enformer + ChromBPNet/BPNet + LegNet | CHIP-seq TF binding, strand-specific tracks, Interval API, effect-percentile normalization, cell-type switching | +| `examples/notebooks/single_oracle_quickstart.ipynb` | Enformer | Predictions β†’ region replacement β†’ sequence insertion β†’ variant effect β†’ gene expression β†’ coolbox visualization. The "I get it now" notebook. | +| `examples/notebooks/comprehensive_oracle_showcase.ipynb` | All 6 | Same variant scored by every oracle side-by-side. Cross-model agreement, sub-region scoring, gene-expression layer integration. | +| `examples/notebooks/advanced_multi_oracle_analysis.ipynb` | Enformer + ChromBPNet/BPNet + LegNet | CHIP-seq TF footprinting, strand-specific tracks, the Interval API, effect-percentile normalization, cell-type switching. The graduate-level notebook. | -### MCP server +### MCP server β€” chorus, but you talk to Claude -Chorus includes an MCP (Model Context Protocol) server that lets AI assistants like Claude directly load oracles, predict variant effects, and analyze gene expression β€” all through natural language conversation. The TLDR above covered the one-liner; this section has the full details. +Chorus's MCP (Model Context Protocol) server is what makes the lunch-break tour Step 4 work. Claude (or any MCP-aware client) loads oracles, predicts variant effects, scores regions, and writes full HTML/MD reports β€” all from natural-language prompts. Step 4 above gave you the one-liner; this section has every config detail (Claude Code, Claude Desktop, manual testing, the full 22-tool catalogue). #### Setup for Claude Code