docs: reduce text volume (~47%) + fix scientific correctness issues by Jaureguy760 · Pull Request #99 · mcvickerlab/WASP2

Jaureguy760 · 2026-04-19T02:36:56Z

Summary

Docs reduction pass based on multi-agent review (structural triage, editorial reviewer-simulator, scientific methods reviewer, comparison to scanpy / MACS3 / pysam norms). Takes the docs from ~6,300 lines to ~3,360 — 46.7% reduction — while fixing 8 scientific-correctness issues flagged by the methods review.

Why

Maintainer feedback: "too much text" on https://mcvickerlab.github.io/WASP2/. Field comparison (scanpy, MACS3, samtools) showed WASP2 was the only mature bioinformatics package with 5 dedicated methods pages re-deriving textbook statistics. Those peers cite the primary paper and move on; WASP2 was the outlier.

What changed

Deletions (fully redundant material):

`tutorials/statistical_methods_tutorial.ipynb` (1,516 lines) — fully duplicated `methods/statistical_models.rst`
`tutorials/atac_seq_workflow.ipynb` (944 lines) — orphaned from toctree; content covered by `bulk_workflow.rst`
`tutorials/rna_seq.rst` (203 lines) — merged into `bulk_workflow.rst`
`tutorials/quickstart_mapping.rst` (258 lines) — merged into `bulk_workflow.rst`

New consolidated page:

`tutorials/bulk_workflow.rst` (~175 lines) — end-to-end RNA-seq + ATAC-seq bulk pipeline with data-type callouts

Shrinks (bloat removed, content preserved):

`methods/fdr_correction.rst` 258 → 66 lines — cut FWER/FDR primers, BH derivation, q-value derivation; kept BH pointer + PRDS assumption + NaN-handling warning + citations
`methods/dispersion_estimation.rst` 270 → 126 lines — cut Cramer-Rao argument, unused MoM sections, unsourced CV-by-n heuristic
`methods/statistical_models.rst` 307 → 173 lines — cut "why not binomial" motivation + variance-inflation example
`tutorials/comparative_imbalance.rst` 545 → 188 lines — cut duplicate sex/treatment tutorials (same command, different barcode map)
`tutorials/scrna_seq.rst` 333 → 112 lines — cut duplicate Seurat/Scanpy barcode-export code (now single source in `user_guide/single_cell.rst`)
`development.rst` 272 → 124 lines — cut generic black/flake8/pre-commit tutorials

Scientific correctness fixes

Per the methods-reviewer pass (8 issues):

HIGH — Unsupported power table (`statistical_models.rst:276-286`): old table listed μ/N but no ρ. Beta-binomial power depends on ρ; numbers were optimistic. Removed; replaced with a note that power should be simulated at the dataset's own ρ̂.
HIGH — Missing Kumasaka 2016 (RASQUAL) citation: WASP2's BB+LRT+dispersion framework sits directly in RASQUAL's lineage. Added.
HIGH — Unsourced Rust-vs-Python benchmark tables (`mapping_filter.rst` + `counting_algorithm.rst`): no hardware spec, no reproducible script. Replaced with qualitative "substantially faster, measure on your hardware" paragraphs.
MEDIUM — "Theorem" claim in mapping_filter.rst:68-74: P(map|ref)=P(map|alt) was stated without proof or caveat. Holds only under deterministic-aligner assumption. Softened to an "under the following assumption" statement with citation pointer to van de Geijn §Methods.
MEDIUM — Unsourced "Typical Filter Rates" table: RNA 5-15%, ATAC 2-8%, ChIP 3-10%, WGS 1-5% presented as authoritative without source. Replaced with developer-experience prose.
MEDIUM — LRT clarification: old doc didn't state whether ρ is profile-MLE'd or jointly MLE'd. Added explicit note that ρ is held at its null-model MLE.
MEDIUM — NB vs BB reference mismatch: dispersion page cited Robinson 2010 + Yu 2013 (NB) to support BB estimation. Added Kumasaka 2016 as the correct BB reference; labeled Robinson/Yu as "analogous NB literature".
MEDIUM — Missing NaN warning on manual BH: added warning that `np.minimum.accumulate` silently propagates NaN; `scipy.stats.false_discovery_control` raises on it.

Test plan

`sphinx-build` completes without new warnings (only pre-existing autodoc `:class:` warnings on `optional` / `type` / `pysam.` / `polars.`, which existed before this PR and are docstring issues in `src/`)
No broken `:doc:` cross-references (grep-verified: no remaining refs to deleted pages)
`index.rst` toctree updated; `choosing_workflow.rst` decision points updated
All 9 commits are independently reviewable; each touches one page or one concern
CI: ruff + Rust tests + pytest 3.10/3.11/3.12 + CodeQL (pending)

Commits (9)

```
docs: remove redundant statistical_methods_tutorial notebook
docs(methods): shrink fdr_correction from 258 to ~60 lines
docs(methods): shrink dispersion_estimation from 270 to ~110 lines
docs(methods): shrink + fix statistical_models from 307 to ~150 lines
docs(tutorials): shrink comparative_imbalance from 545 to ~175 lines
docs(tutorials): consolidate 3 bulk workflow tutorials into one
docs(methods): scientific correctness fixes to mapping_filter + counting_algorithm
docs(tutorials): shrink scrna_seq from 333 to ~100 lines
docs: shrink development.rst from 272 to ~120 lines
```

Not changed

`methods/mapping_filter.rst` Canonical Filter Contract section — methods reviewer confirmed this is accurate post-PR rust: restore canonical WASP filter contract (van de Geijn 2015) #96 (van de Geijn 2015 contract, only 0x4 filter). Left intact as the most valuable prose on the site.
`user_guide/*.rst` — appropriately scoped already; no bloat found.
`api/*.rst` — autodoc-driven, already concise.
`scatac_workflow.rst` — 156 lines, single-cell specific, no duplication with bulk workflow.

The 1,516-line notebook covered beta-binomial PMF, LRT, MLE, and BH — all content already present in methods/statistical_models.rst, methods/dispersion_estimation.rst, and methods/fdr_correction.rst at appropriate depth. The notebook duplicated equations and prose without adding worked examples on real data. Per field norms (scanpy, MACS3, pysam), software docs should defer statistical derivations to the primary literature rather than re-teaching them alongside the API.

Keep: BH algorithm pointer, scipy API, PRDS assumption, reporting guidance, output-column table, citations. Cut: FWER/FDR definition primers, BH algorithm derivation, manual BH code block, q-value estimator derivation, discrete-FDR alternative section, threshold-selection table, "when to use stricter control" guidance. Added: NaN-propagation warning (pitfall we've hit before). Added: Storey q-value reference preserved as a citation pointer. Rationale: scanpy, MACS3, and samtools docs cite BH in a sentence; they do not re-teach it. WASP2's LRT produces continuous p-values so BH is standard — users looking for a statistics primer will find one in any multiple-testing textbook.

Keep: MLE description + single-model code, linear-model description + code, model-choice table, convergence note, references. Cut: Cramer-Rao-bound argument (textbook statistics not WASP2-specific), MoM variance-based estimator (not used in WASP2), MoM vs MLE comparison table, CV-by-sample-size table (unsourced heuristic), generic "sample-size requirements" guidance. Scientific fixes: - Add Kumasaka 2016 (RASQUAL) as primary BB-dispersion reference - Label Robinson 2010 + Yu 2013 as "analogous NB literature" (they are NB, not BB, so they do not directly support BB dispersion) - Add explicit note that rho is held at its null-model MLE when computing the LRT (removes ambiguity about profile vs joint MLE)

Keep: model definition, LRT formulation, phased/unphased treatment, output-columns table, pseudocount/min-count/aggregation notes. Cut: "Why not binomial" motivation section (covered by 2-sentence intro), variance-inflation numerical example, redundant implementation-section restatement of dispersion code (lives in dispersion_estimation). Scientific fixes (per reviewer pass): - REMOVE the unsupported power table (lines 276-286 of old file). Beta-binomial power depends on rho; the old table varied only mu/N and stated no rho, making the values optimistic for typical genomic dispersion. Replaced with a note that power should be simulated at the dataset's own rho estimate. - ADD Kumasaka 2016 citation (RASQUAL, Nat Genet). WASP2's BB + LRT + pooled-dispersion framework sits directly in RASQUAL's lineage; previous docs cited only Mayba 2014 (MBASED). - ADD van de Geijn 2015 citation (properly placed with the mapping filter pointer). - CLARIFY LRT: rho is held at its null-model MLE when evaluating L_1 (profile likelihood in mu). Removes ambiguity between profile and joint MLE noted by the methods reviewer.

Keep: one worked cell-type example, CLI reference, output-columns table, minimal volcano plot, good practices, common issues. Cut: full duplicate tutorials for sex-differences and treatment-vs- control (same command, different barcode map — one note suffices). Cut: full heatmap code block (links to analysis guide instead). Cut: duplicate Seurat barcode-export R snippet (now single source in user_guide/single_cell.rst). Cut: input-data format over-specification (moved to single_cell guide). Three near-identical tutorial sections collapsed to one example plus "other comparisons use the same command with a different barcode map."

Before: quickstart_mapping.rst (258) + rna_seq.rst (203) + atac_seq_workflow.ipynb (944, orphaned from toctree) = 1405 lines covering overlapping bulk workflows with copy-pasted make-reads / remap / filter-remapped blocks and troubleshooting sections. After: bulk_workflow.rst (~175) covers the full RNA-seq and ATAC-seq bulk pipeline end-to-end (WASP filter + count + analyze) with data-type-specific callouts (GTF vs BED, phased vs unphased). - Deleted: quickstart_mapping.rst, rna_seq.rst, atac_seq_workflow.ipynb - Updated: index.rst toctree, choosing_workflow.rst decision points - No broken cross-references remain (grep-verified). Net: ~1230 lines removed, single canonical bulk walkthrough; scATAC and scRNA tutorials untouched (genuinely distinct workflows).

…ing_algorithm mapping_filter.rst: - Soften the "Theorem" box that claimed P(map|ref) = P(map|alt) after filtering. The equality is approximate, holds under deterministic-aligner assumptions, and lacks a published proof under that framing. Replaced with an "under the following assumption..." statement and pointer to van de Geijn 2015 §Methods. - Replace the unsourced Rust-vs-Python benchmark table ("1M reads ~5min vs ~30s" etc.) with a qualitative description. The table had no hardware spec, no reproducible script, no dataset — a reviewer would flag it. - Replace the unsourced "Typical Filter Rates by Data Type" table (RNA 5-15%, ATAC 2-8%, ChIP 3-10%, WGS 1-5%) with a qualitative developer-experience paragraph. The tabulated ranges were stated as authoritative without a source. - Add the [vandeGeijn2015]_ reference definition (was cited but not defined; would trigger a Sphinx warning). counting_algorithm.rst: - Replace the unsourced counting-benchmark table (~45s vs 5s, etc.) with a qualitative paragraph. Same issue as the mapping-filter benchmark table.

Keep: command-line recipe for count → per-celltype imbalance → optional compare, interpretation snippet, troubleshooting for barcode-suffix mismatches and sparsity. Cut: duplicate Seurat + Scanpy barcode-export code blocks (these now live canonically in user_guide/single_cell.rst, referenced by link). Cut: duplicate Cell Ranger output-tree diagram and BAM CB-tag description (repeated in scatac_workflow.rst + user_guide/single_cell). Cut: overlong troubleshooting and next-steps sections.

Keep: setup, code-standards summary, test/mypy commands, Rust layer notes + maturin build recipe, project layout, release flow, branching policy (master/dev promotion, feature-branch-off-dev rule). Cut: generic black/flake8/pytest tutorials (these tools have their own docs), step-by-step PR walkthrough, AI-assisted-development section (link-to-nowhere — seqera_ai_integration doc not served), obsolete "WASP2-exp" repo paths. Added: Rust parity-test requirement, explicit PyPI-OIDC + Docker publish flow, dev → master branching policy learned this session.

…rsion/FDR Rewrite statistical_models.rst as single ~180-line page covering: - Beta-binomial model (unchanged essentials) - LRT with profile-likelihood convention (ρ held at null MLE, df=1) - Dispersion: single vs linear, AIC/BIC choice, ρ bounds (1e-6, 1-1e-6), logit clip (±10) — all WASP2-specific implementation contracts not in any paper - Phased vs unphased (uniform prior over 2^(n-1) configurations) - BH via scipy.stats.false_discovery_control - WASP2 defaults: pseudocount=1, min_count=10 - NaN-propagation warning for manual BH (matches MEMORY.md pitfall) - Output columns Extends mapping_filter.rst canonical-filter contract section with a single paragraph stating the same contract applies at the counting step (only 0x4 unmapped filter). This was previously buried in counting_algorithm.rst:69-75. Preserves four WASP2-unique contracts flagged by both the sophia decision-auditor and co-scientist hypothesis-critic: 1. Canonical filter contract at count step 2. Profile-LRT ρ convention (fix from PR #99) 3. BH-NaN trap warning 4. Default parameter values + ρ/logit bounds Next commit will delete counting_algorithm.rst, dispersion_estimation.rst, fdr_correction.rst and update toctree + cross-refs.

…e contracts) (#103) * docs(methods): consolidate statistical_models + absorb counting/dispersion/FDR Rewrite statistical_models.rst as single ~180-line page covering: - Beta-binomial model (unchanged essentials) - LRT with profile-likelihood convention (ρ held at null MLE, df=1) - Dispersion: single vs linear, AIC/BIC choice, ρ bounds (1e-6, 1-1e-6), logit clip (±10) — all WASP2-specific implementation contracts not in any paper - Phased vs unphased (uniform prior over 2^(n-1) configurations) - BH via scipy.stats.false_discovery_control - WASP2 defaults: pseudocount=1, min_count=10 - NaN-propagation warning for manual BH (matches MEMORY.md pitfall) - Output columns Extends mapping_filter.rst canonical-filter contract section with a single paragraph stating the same contract applies at the counting step (only 0x4 unmapped filter). This was previously buried in counting_algorithm.rst:69-75. Preserves four WASP2-unique contracts flagged by both the sophia decision-auditor and co-scientist hypothesis-critic: 1. Canonical filter contract at count step 2. Profile-LRT ρ convention (fix from PR #99) 3. BH-NaN trap warning 4. Default parameter values + ρ/logit bounds Next commit will delete counting_algorithm.rst, dispersion_estimation.rst, fdr_correction.rst and update toctree + cross-refs. * docs(methods): delete 3 consolidated pages + methods/index landing Removed: counting_algorithm.rst, dispersion_estimation.rst, fdr_correction.rst, methods/index.rst. All load-bearing WASP2-unique content (canonical filter contract at count step, profile-LRT ρ convention, BH-NaN trap, default values, ρ/logit bounds) was absorbed into statistical_models.rst and mapping_filter.rst in the previous commit. Updated 2 inbound cross-references: - faq.rst: methods/fdr_correction → methods/statistical_models - tutorials/bulk_workflow.rst: removed duplicate fdr_correction ref, kept single statistical_models ref Top-level index.rst toctree reduced from 6 methods entries to 2. Removed stale "Statistical Methods" landing caption was replaced by the main index.rst's own :caption: Statistical Methods toctree. Side benefit: eliminates the 3 pre-existing Sphinx duplicate-citation warnings (Kumasaka2016, vandeGeijn2015) by removing the duplicate reference definitions in methods/index.rst and the deleted pages. Methods section: 5 pages (883 lines) → 2 pages (~500 lines), a further 43% reduction on top of Tier-B + Tier-D. Agreed reframe from sophia decision-auditor + co-scientist hypothesis-critic review of the "delete all 4" proposal: consolidate instead of erase, to preserve WASP2-specific operational contracts not present in van de Geijn 2015 or in cited textbooks.

…_guide/analysis Per user direction, the consolidated statistical_models.rst is removed. Methods section is now a single page: mapping_filter.rst (canonical WASP filter contract, van de Geijn 2015). Rescued into user_guide/analysis.rst under a new "Defaults and contracts" section (~25 lines) to avoid losing the four WASP2-unique contracts that both the sophia decision-auditor and co-scientist hypothesis-critic flagged as load-bearing when the original "delete the stats stuff" proposal was evaluated: 1. pseudocount=1 + min_count=10 defaults 2. Profile-LRT ρ convention (ρ at null-MLE, df=1) — the scientific fix from PR #99 3. ρ bounds (1e-6, 1-1e-6) and linear-dispersion logit clip (±10) 4. BH-NaN trap warning (scipy raises; manual loop silently propagates) Plus pointer to mapping_filter.rst for the canonical filter contract that applies to both remapping and counting. Updated 6 inbound :doc: references: - index.rst toctree (removed entry) - faq.rst → points to user_guide/analysis - tutorials/bulk_workflow.rst (2 refs) → user_guide/analysis - tutorials/single_cell_workflow.rst → methods/mapping_filter - methods/mapping_filter.rst → user_guide/analysis Sphinx build clean, zero warnings.

…_guide/analysis (#107) Per user direction, the consolidated statistical_models.rst is removed. Methods section is now a single page: mapping_filter.rst (canonical WASP filter contract, van de Geijn 2015). Rescued into user_guide/analysis.rst under a new "Defaults and contracts" section (~25 lines) to avoid losing the four WASP2-unique contracts that both the sophia decision-auditor and co-scientist hypothesis-critic flagged as load-bearing when the original "delete the stats stuff" proposal was evaluated: 1. pseudocount=1 + min_count=10 defaults 2. Profile-LRT ρ convention (ρ at null-MLE, df=1) — the scientific fix from PR #99 3. ρ bounds (1e-6, 1-1e-6) and linear-dispersion logit clip (±10) 4. BH-NaN trap warning (scipy raises; manual loop silently propagates) Plus pointer to mapping_filter.rst for the canonical filter contract that applies to both remapping and counting. Updated 6 inbound :doc: references: - index.rst toctree (removed entry) - faq.rst → points to user_guide/analysis - tutorials/bulk_workflow.rst (2 refs) → user_guide/analysis - tutorials/single_cell_workflow.rst → methods/mapping_filter - methods/mapping_filter.rst → user_guide/analysis Sphinx build clean, zero warnings.

Jaureguy760 added 9 commits April 18, 2026 19:26

Jaureguy760 merged commit cea62f4 into dev Apr 19, 2026
6 checks passed

Jaureguy760 deleted the feat/docs-reduction-tier-b branch April 19, 2026 02:42

This was referenced Apr 19, 2026

chore: promote dev → master (docs reduction pass) #100

Merged

docs(tutorials): Tier-D reduction — 6 getting-started pages to 3 #101

Merged

chore: promote dev → master (Tier-D tutorial reduction) #102

Merged

Jaureguy760 mentioned this pull request Apr 19, 2026

docs(methods): consolidate 5 methods pages to 2 (preserve WASP2-unique contracts) #103

Merged

4 tasks

Jaureguy760 mentioned this pull request Apr 19, 2026

chore: promote dev → master (methods consolidation) #104

Merged

3 tasks

Jaureguy760 mentioned this pull request Apr 19, 2026

docs: remove methods/statistical_models page; rescue WASP2 contracts to user_guide/analysis #107

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: reduce text volume (~47%) + fix scientific correctness issues#99

docs: reduce text volume (~47%) + fix scientific correctness issues#99
Jaureguy760 merged 9 commits intodevfrom
feat/docs-reduction-tier-b

Jaureguy760 commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jaureguy760 commented Apr 19, 2026

Summary

Why

What changed

Scientific correctness fixes

Test plan

Commits (9)

Not changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant