Fix/igv multi resolution normalization#79
Open
lorenzoruggerii wants to merge 2 commits intomainfrom
Open
Conversation
…GV. Add adaptive downsampling with max pooling for 1bp tracks (ChromBPNet), per-bin size calculation based on native resolution, IGV windowFunction hints, and CDF fallback for models without per-bin distributions (LegNet)
lucapinello
pushed a commit
that referenced
this pull request
May 9, 2026
Resolves conflict in chorus/analysis/normalization.py: - Adopt Lorenzo's _match_track_id / _find_matching_cdf helpers - Extend _match_track_id to also strip CHIP strand suffix (:+/:-) so per-strand track IDs match merged CDF rows - Move _has_samples guard inside _find_matching_cdf so failed-build perbin rows fall through to summary CDF instead of saturating - Set perbin_floor_rescale_batch max_value default to 3.0 (matches _DISPLAY_MAX in IGV) Lorenzo's other files (_igv_report.py, multi_oracle_report.py, scripts/ regenerate_*.py, example artefacts) auto-merged.
lucapinello
pushed a commit
that referenced
this pull request
May 9, 2026
…tebooks Single source of truth for normalization semantics: every renderer now goes through `chorus.analysis._igv_report.rescale_for_display()`. By default (no extra params) all four paths produce CDF-rescaled output — 1.0 = genome-wide p99, 3.0 cap, signed layers symmetric around 0. Key changes - New `rescale_for_display(values, layer, normalizer, oracle_name, assay_id) → (out, cfg)` returns rescaled values + display config (ymin, ymax, signed flag) usable by any renderer. - `apply_floor_rescale` (the IGV ref/alt wrapper) now returns a 4-tuple (rescaled, ref, alt, signed) so callers can pick symmetric vs unsigned scale_cfg. - New `signed_floor_rescale_batch` rescales signed values to [-DISPLAY_MAX, +DISPLAY_MAX] using p99(|cdf|) so Borzoi RNA / Sei / LentiMPRA repressive effects are visible (was clipped to 0 before). - `is_signed()` and `_match_track_id()` now share fuzzy track-id matching incl. CHIP `:+`/`:-` strand suffix stripping, so LegNet (`LentiMPRA:HepG2` → CDF row `HepG2`) correctly registers as signed. - `OraclePrediction.add()` backfills `track.assay_id` from the dict key so CoolBox/matplotlib autoload paths can find the right CDF row on tracks that left assay_id None (notably ChromBPNet). - CoolBox `get_coolbox_representation()` and matplotlib `render_track_figures()` now auto-load the per-track normalizer from `~/.chorus/backgrounds/` when called with no kwargs; pass `normalize=False` to opt out for raw values. - `ChromBPNetOracle.predict_sliding()` slides the 2114-bp model across arbitrary intervals with cigar substitutions preserved, so the multi-oracle IGV panel covers the full AlphaGenome 1 Mb locus instead of 0.2 % of it. `_predict()` auto-routes wide queries to it (PR #79's wider region had been triggering a pre-existing IndexError in `_predict_direct`'s sliding formula). - `_calculate_track_bin_size` uses `(20, "max")` for ChromBPNet (was `(20, "mean")` — code/description mismatch in PR #79); max-pool preserves 1-bp peak heights instead of diluting them by 20x. - Lower per-layer floors so peaks have visible base/shoulder: `chromatin_accessibility 0.95→0.90`, `promoter_activity 0.95→0.85`. - Causal-report IGV (`causal._build_causal_igv`) now goes through the same helper as variant + multi-oracle reports. Lorenzo's PR #79 changes preserved - `_match_track_id` / `_find_matching_cdf` (with CHIP-strand fuzzy match added on top), `_calculate_track_bin_size`, `windowFunction: "max"` IGV hint, `(per-track norm)` LegNet label suffix, `get_max_output_size()` multi-oracle region width. Tests: 376 passed, 1 skipped (env-gating), 5 deselected (integration). Updated `test_perbin_none_for_scalar_oracles` (perbin → summary fallback now succeeds for LegNet) and `test_apply_floor_rescale_passthrough` (4-tuple). Added `test_rescale_for_display_unified_helper`. Annotations directory + screenshot sweeps gitignored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lucapinello
pushed a commit
that referenced
this pull request
May 9, 2026
Documents what was tested, what passed, and one deferred follow-up (DHS-augmented chrombpnet CDF needs to be rebuilt for all 786 tracks incl. BPNet/CHIP before uploading to HuggingFace). Also fixes two README stale claims that survived the unification work: display-rescale range is [0, 3.0] not [0, 1.5] (matches _DISPLAY_MAX in _igv_report.py). Adds a row for the new signed-layer symmetric [-3, +3] rescale semantics. Re-regenerates SORT1 chrombpnet + multi-oracle artefacts against the HF-shipped 786-track CDF (the production CDF every fresh install gets) — drops the local DHS-only 42-track CDF that the previous regen run had used. SORT1 chrombpnet effect under HF CDF: +0.318 log2FC, ≥99th %ile, Activity %ile 0.603 (same qualitative interpretation as the local-DHS run). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improve IGV visualization for mixed-resolution genomic tracks and fix normalization fallback for models without per-bin background distributions.
Changes
1. Per-track adaptive downsampling (
_downsample_to_features)aggregation_methodparameter supporting"mean"(default) and"max"2. Per-track bin size calculation (
_calculate_track_bin_size)3. IGV rendering hints for high-res tracks
windowFunction: "max"to WIG track configs for high-resolution models4. Robust normalization fallback (
apply_floor_rescale/perbin_floor_rescale_batch)"LentiMPRA:HepG2"vs"HepG2")perbin_cdfsis missing (e.g., LegNet), falls back tosummary_cdfs→effect_cdfsautomatically_match_track_id()and_find_matching_cdf()helper methods to keep rescaling logic clean5. LegNet user-facing clarification
(per-track norm)suffix