Skip to content

Fix/igv multi resolution normalization#79

Open
lorenzoruggerii wants to merge 2 commits intomainfrom
fix/igv-multi-resolution-normalization
Open

Fix/igv multi resolution normalization#79
lorenzoruggerii wants to merge 2 commits intomainfrom
fix/igv-multi-resolution-normalization

Conversation

@lorenzoruggerii
Copy link
Copy Markdown
Collaborator

Summary

Improve IGV visualization for mixed-resolution genomic tracks and fix normalization fallback for models without per-bin background distributions.

Changes

1. Per-track adaptive downsampling (_downsample_to_features)

  • Added aggregation_method parameter supporting "mean" (default) and "max"
  • Max pooling preserves peak signals when downsampling high-resolution (1bp) tracks (e.g., ChromBPNet) over large windows, preventing signal wash-out from averaging 200+ values per bin
  • Threshold for zero-skipping adapts to aggregation method (percentile-based for mean, 1% of global max for max pooling)

2. Per-track bin size calculation (_calculate_track_bin_size)

  • Tracks now compute their own optimal bin size based on native resolution:
    • Binned models (≥128bp, e.g., AlphaGenome) keep native resolution (~78 features in 10kb)
    • High-resolution models (1bp, e.g., ChromBPNet) use ~50,000 features with max pooling
  • Returns both bin size and recommended aggregation method

3. IGV rendering hints for high-res tracks

  • Added windowFunction: "max" to WIG track configs for high-resolution models
  • Prevents IGV.js internal mean-based downsampling from hiding sharp peaks at zoomed-out views

4. Robust normalization fallback (apply_floor_rescale / perbin_floor_rescale_batch)

  • Track ID matching: Resolves format mismatches between predictions and normalizer (e.g., "LentiMPRA:HepG2" vs "HepG2")
  • CDF type fallback: When perbin_cdfs is missing (e.g., LegNet), falls back to summary_cdfseffect_cdfs automatically
  • Added _match_track_id() and _find_matching_cdf() helper methods to keep rescaling logic clean

5. LegNet user-facing clarification

  • Tracks using per-track normalization (LegNet) are labeled with (per-track norm) suffix
  • Distinguishes them from per-bin normalized tracks, since values are not directly cross-comparable

…GV. Add adaptive downsampling with max pooling for 1bp tracks (ChromBPNet), per-bin size calculation based on native resolution, IGV windowFunction hints, and CDF fallback for models without per-bin distributions (LegNet)
lucapinello pushed a commit that referenced this pull request May 9, 2026
Resolves conflict in chorus/analysis/normalization.py:
- Adopt Lorenzo's _match_track_id / _find_matching_cdf helpers
- Extend _match_track_id to also strip CHIP strand suffix (:+/:-)
  so per-strand track IDs match merged CDF rows
- Move _has_samples guard inside _find_matching_cdf so failed-build
  perbin rows fall through to summary CDF instead of saturating
- Set perbin_floor_rescale_batch max_value default to 3.0 (matches
  _DISPLAY_MAX in IGV)

Lorenzo's other files (_igv_report.py, multi_oracle_report.py, scripts/
regenerate_*.py, example artefacts) auto-merged.
lucapinello pushed a commit that referenced this pull request May 9, 2026
…tebooks

Single source of truth for normalization semantics: every renderer now
goes through `chorus.analysis._igv_report.rescale_for_display()`. By
default (no extra params) all four paths produce CDF-rescaled output —
1.0 = genome-wide p99, 3.0 cap, signed layers symmetric around 0.

Key changes
- New `rescale_for_display(values, layer, normalizer, oracle_name,
  assay_id) → (out, cfg)` returns rescaled values + display config
  (ymin, ymax, signed flag) usable by any renderer.
- `apply_floor_rescale` (the IGV ref/alt wrapper) now returns a
  4-tuple (rescaled, ref, alt, signed) so callers can pick symmetric
  vs unsigned scale_cfg.
- New `signed_floor_rescale_batch` rescales signed values to
  [-DISPLAY_MAX, +DISPLAY_MAX] using p99(|cdf|) so Borzoi RNA / Sei /
  LentiMPRA repressive effects are visible (was clipped to 0 before).
- `is_signed()` and `_match_track_id()` now share fuzzy track-id
  matching incl. CHIP `:+`/`:-` strand suffix stripping, so LegNet
  (`LentiMPRA:HepG2` → CDF row `HepG2`) correctly registers as signed.
- `OraclePrediction.add()` backfills `track.assay_id` from the dict
  key so CoolBox/matplotlib autoload paths can find the right CDF row
  on tracks that left assay_id None (notably ChromBPNet).
- CoolBox `get_coolbox_representation()` and matplotlib
  `render_track_figures()` now auto-load the per-track normalizer
  from `~/.chorus/backgrounds/` when called with no kwargs; pass
  `normalize=False` to opt out for raw values.
- `ChromBPNetOracle.predict_sliding()` slides the 2114-bp model across
  arbitrary intervals with cigar substitutions preserved, so the
  multi-oracle IGV panel covers the full AlphaGenome 1 Mb locus
  instead of 0.2 % of it.  `_predict()` auto-routes wide queries to
  it (PR #79's wider region had been triggering a pre-existing
  IndexError in `_predict_direct`'s sliding formula).
- `_calculate_track_bin_size` uses `(20, "max")` for ChromBPNet (was
  `(20, "mean")` — code/description mismatch in PR #79); max-pool
  preserves 1-bp peak heights instead of diluting them by 20x.
- Lower per-layer floors so peaks have visible base/shoulder:
  `chromatin_accessibility 0.95→0.90`, `promoter_activity 0.95→0.85`.
- Causal-report IGV (`causal._build_causal_igv`) now goes through the
  same helper as variant + multi-oracle reports.

Lorenzo's PR #79 changes preserved
- `_match_track_id` / `_find_matching_cdf` (with CHIP-strand fuzzy
  match added on top), `_calculate_track_bin_size`,
  `windowFunction: "max"` IGV hint, `(per-track norm)` LegNet label
  suffix, `get_max_output_size()` multi-oracle region width.

Tests: 376 passed, 1 skipped (env-gating), 5 deselected (integration).
Updated `test_perbin_none_for_scalar_oracles` (perbin → summary
fallback now succeeds for LegNet) and `test_apply_floor_rescale_passthrough`
(4-tuple).  Added `test_rescale_for_display_unified_helper`.

Annotations directory + screenshot sweeps gitignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lucapinello pushed a commit that referenced this pull request May 9, 2026
Documents what was tested, what passed, and one deferred follow-up
(DHS-augmented chrombpnet CDF needs to be rebuilt for all 786 tracks
incl. BPNet/CHIP before uploading to HuggingFace).

Also fixes two README stale claims that survived the unification work:
display-rescale range is [0, 3.0] not [0, 1.5] (matches _DISPLAY_MAX
in _igv_report.py).  Adds a row for the new signed-layer symmetric
[-3, +3] rescale semantics.

Re-regenerates SORT1 chrombpnet + multi-oracle artefacts against the
HF-shipped 786-track CDF (the production CDF every fresh install
gets) — drops the local DHS-only 42-track CDF that the previous
regen run had used.  SORT1 chrombpnet effect under HF CDF:
+0.318 log2FC, ≥99th %ile, Activity %ile 0.603 (same qualitative
interpretation as the local-DHS run).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant