docs: reduce text volume (~47%) + fix scientific correctness issues#99
Merged
Jaureguy760 merged 9 commits intodevfrom Apr 19, 2026
Merged
docs: reduce text volume (~47%) + fix scientific correctness issues#99Jaureguy760 merged 9 commits intodevfrom
Jaureguy760 merged 9 commits intodevfrom
Conversation
The 1,516-line notebook covered beta-binomial PMF, LRT, MLE, and BH — all content already present in methods/statistical_models.rst, methods/dispersion_estimation.rst, and methods/fdr_correction.rst at appropriate depth. The notebook duplicated equations and prose without adding worked examples on real data. Per field norms (scanpy, MACS3, pysam), software docs should defer statistical derivations to the primary literature rather than re-teaching them alongside the API.
Keep: BH algorithm pointer, scipy API, PRDS assumption, reporting guidance, output-column table, citations. Cut: FWER/FDR definition primers, BH algorithm derivation, manual BH code block, q-value estimator derivation, discrete-FDR alternative section, threshold-selection table, "when to use stricter control" guidance. Added: NaN-propagation warning (pitfall we've hit before). Added: Storey q-value reference preserved as a citation pointer. Rationale: scanpy, MACS3, and samtools docs cite BH in a sentence; they do not re-teach it. WASP2's LRT produces continuous p-values so BH is standard — users looking for a statistics primer will find one in any multiple-testing textbook.
Keep: MLE description + single-model code, linear-model description + code, model-choice table, convergence note, references. Cut: Cramer-Rao-bound argument (textbook statistics not WASP2-specific), MoM variance-based estimator (not used in WASP2), MoM vs MLE comparison table, CV-by-sample-size table (unsourced heuristic), generic "sample-size requirements" guidance. Scientific fixes: - Add Kumasaka 2016 (RASQUAL) as primary BB-dispersion reference - Label Robinson 2010 + Yu 2013 as "analogous NB literature" (they are NB, not BB, so they do not directly support BB dispersion) - Add explicit note that rho is held at its null-model MLE when computing the LRT (removes ambiguity about profile vs joint MLE)
Keep: model definition, LRT formulation, phased/unphased treatment, output-columns table, pseudocount/min-count/aggregation notes. Cut: "Why not binomial" motivation section (covered by 2-sentence intro), variance-inflation numerical example, redundant implementation-section restatement of dispersion code (lives in dispersion_estimation). Scientific fixes (per reviewer pass): - REMOVE the unsupported power table (lines 276-286 of old file). Beta-binomial power depends on rho; the old table varied only mu/N and stated no rho, making the values optimistic for typical genomic dispersion. Replaced with a note that power should be simulated at the dataset's own rho estimate. - ADD Kumasaka 2016 citation (RASQUAL, Nat Genet). WASP2's BB + LRT + pooled-dispersion framework sits directly in RASQUAL's lineage; previous docs cited only Mayba 2014 (MBASED). - ADD van de Geijn 2015 citation (properly placed with the mapping filter pointer). - CLARIFY LRT: rho is held at its null-model MLE when evaluating L_1 (profile likelihood in mu). Removes ambiguity between profile and joint MLE noted by the methods reviewer.
Keep: one worked cell-type example, CLI reference, output-columns table, minimal volcano plot, good practices, common issues. Cut: full duplicate tutorials for sex-differences and treatment-vs- control (same command, different barcode map — one note suffices). Cut: full heatmap code block (links to analysis guide instead). Cut: duplicate Seurat barcode-export R snippet (now single source in user_guide/single_cell.rst). Cut: input-data format over-specification (moved to single_cell guide). Three near-identical tutorial sections collapsed to one example plus "other comparisons use the same command with a different barcode map."
Before: quickstart_mapping.rst (258) + rna_seq.rst (203) + atac_seq_workflow.ipynb (944, orphaned from toctree) = 1405 lines covering overlapping bulk workflows with copy-pasted make-reads / remap / filter-remapped blocks and troubleshooting sections. After: bulk_workflow.rst (~175) covers the full RNA-seq and ATAC-seq bulk pipeline end-to-end (WASP filter + count + analyze) with data-type-specific callouts (GTF vs BED, phased vs unphased). - Deleted: quickstart_mapping.rst, rna_seq.rst, atac_seq_workflow.ipynb - Updated: index.rst toctree, choosing_workflow.rst decision points - No broken cross-references remain (grep-verified). Net: ~1230 lines removed, single canonical bulk walkthrough; scATAC and scRNA tutorials untouched (genuinely distinct workflows).
…ing_algorithm
mapping_filter.rst:
- Soften the "Theorem" box that claimed P(map|ref) = P(map|alt)
after filtering. The equality is approximate, holds under
deterministic-aligner assumptions, and lacks a published proof
under that framing. Replaced with an "under the following
assumption..." statement and pointer to van de Geijn 2015 §Methods.
- Replace the unsourced Rust-vs-Python benchmark table ("1M reads
~5min vs ~30s" etc.) with a qualitative description. The table
had no hardware spec, no reproducible script, no dataset — a
reviewer would flag it.
- Replace the unsourced "Typical Filter Rates by Data Type" table
(RNA 5-15%, ATAC 2-8%, ChIP 3-10%, WGS 1-5%) with a qualitative
developer-experience paragraph. The tabulated ranges were stated
as authoritative without a source.
- Add the [vandeGeijn2015]_ reference definition (was cited but
not defined; would trigger a Sphinx warning).
counting_algorithm.rst:
- Replace the unsourced counting-benchmark table (~45s vs 5s, etc.)
with a qualitative paragraph. Same issue as the mapping-filter
benchmark table.
Keep: command-line recipe for count → per-celltype imbalance → optional compare, interpretation snippet, troubleshooting for barcode-suffix mismatches and sparsity. Cut: duplicate Seurat + Scanpy barcode-export code blocks (these now live canonically in user_guide/single_cell.rst, referenced by link). Cut: duplicate Cell Ranger output-tree diagram and BAM CB-tag description (repeated in scatac_workflow.rst + user_guide/single_cell). Cut: overlong troubleshooting and next-steps sections.
Keep: setup, code-standards summary, test/mypy commands, Rust layer notes + maturin build recipe, project layout, release flow, branching policy (master/dev promotion, feature-branch-off-dev rule). Cut: generic black/flake8/pytest tutorials (these tools have their own docs), step-by-step PR walkthrough, AI-assisted-development section (link-to-nowhere — seqera_ai_integration doc not served), obsolete "WASP2-exp" repo paths. Added: Rust parity-test requirement, explicit PyPI-OIDC + Docker publish flow, dev → master branching policy learned this session.
This was referenced Apr 19, 2026
Jaureguy760
added a commit
that referenced
this pull request
Apr 19, 2026
…rsion/FDR Rewrite statistical_models.rst as single ~180-line page covering: - Beta-binomial model (unchanged essentials) - LRT with profile-likelihood convention (ρ held at null MLE, df=1) - Dispersion: single vs linear, AIC/BIC choice, ρ bounds (1e-6, 1-1e-6), logit clip (±10) — all WASP2-specific implementation contracts not in any paper - Phased vs unphased (uniform prior over 2^(n-1) configurations) - BH via scipy.stats.false_discovery_control - WASP2 defaults: pseudocount=1, min_count=10 - NaN-propagation warning for manual BH (matches MEMORY.md pitfall) - Output columns Extends mapping_filter.rst canonical-filter contract section with a single paragraph stating the same contract applies at the counting step (only 0x4 unmapped filter). This was previously buried in counting_algorithm.rst:69-75. Preserves four WASP2-unique contracts flagged by both the sophia decision-auditor and co-scientist hypothesis-critic: 1. Canonical filter contract at count step 2. Profile-LRT ρ convention (fix from PR #99) 3. BH-NaN trap warning 4. Default parameter values + ρ/logit bounds Next commit will delete counting_algorithm.rst, dispersion_estimation.rst, fdr_correction.rst and update toctree + cross-refs.
4 tasks
Jaureguy760
added a commit
that referenced
this pull request
Apr 19, 2026
…e contracts) (#103) * docs(methods): consolidate statistical_models + absorb counting/dispersion/FDR Rewrite statistical_models.rst as single ~180-line page covering: - Beta-binomial model (unchanged essentials) - LRT with profile-likelihood convention (ρ held at null MLE, df=1) - Dispersion: single vs linear, AIC/BIC choice, ρ bounds (1e-6, 1-1e-6), logit clip (±10) — all WASP2-specific implementation contracts not in any paper - Phased vs unphased (uniform prior over 2^(n-1) configurations) - BH via scipy.stats.false_discovery_control - WASP2 defaults: pseudocount=1, min_count=10 - NaN-propagation warning for manual BH (matches MEMORY.md pitfall) - Output columns Extends mapping_filter.rst canonical-filter contract section with a single paragraph stating the same contract applies at the counting step (only 0x4 unmapped filter). This was previously buried in counting_algorithm.rst:69-75. Preserves four WASP2-unique contracts flagged by both the sophia decision-auditor and co-scientist hypothesis-critic: 1. Canonical filter contract at count step 2. Profile-LRT ρ convention (fix from PR #99) 3. BH-NaN trap warning 4. Default parameter values + ρ/logit bounds Next commit will delete counting_algorithm.rst, dispersion_estimation.rst, fdr_correction.rst and update toctree + cross-refs. * docs(methods): delete 3 consolidated pages + methods/index landing Removed: counting_algorithm.rst, dispersion_estimation.rst, fdr_correction.rst, methods/index.rst. All load-bearing WASP2-unique content (canonical filter contract at count step, profile-LRT ρ convention, BH-NaN trap, default values, ρ/logit bounds) was absorbed into statistical_models.rst and mapping_filter.rst in the previous commit. Updated 2 inbound cross-references: - faq.rst: methods/fdr_correction → methods/statistical_models - tutorials/bulk_workflow.rst: removed duplicate fdr_correction ref, kept single statistical_models ref Top-level index.rst toctree reduced from 6 methods entries to 2. Removed stale "Statistical Methods" landing caption was replaced by the main index.rst's own :caption: Statistical Methods toctree. Side benefit: eliminates the 3 pre-existing Sphinx duplicate-citation warnings (Kumasaka2016, vandeGeijn2015) by removing the duplicate reference definitions in methods/index.rst and the deleted pages. Methods section: 5 pages (883 lines) → 2 pages (~500 lines), a further 43% reduction on top of Tier-B + Tier-D. Agreed reframe from sophia decision-auditor + co-scientist hypothesis-critic review of the "delete all 4" proposal: consolidate instead of erase, to preserve WASP2-specific operational contracts not present in van de Geijn 2015 or in cited textbooks.
3 tasks
Jaureguy760
added a commit
that referenced
this pull request
Apr 19, 2026
…_guide/analysis Per user direction, the consolidated statistical_models.rst is removed. Methods section is now a single page: mapping_filter.rst (canonical WASP filter contract, van de Geijn 2015). Rescued into user_guide/analysis.rst under a new "Defaults and contracts" section (~25 lines) to avoid losing the four WASP2-unique contracts that both the sophia decision-auditor and co-scientist hypothesis-critic flagged as load-bearing when the original "delete the stats stuff" proposal was evaluated: 1. pseudocount=1 + min_count=10 defaults 2. Profile-LRT ρ convention (ρ at null-MLE, df=1) — the scientific fix from PR #99 3. ρ bounds (1e-6, 1-1e-6) and linear-dispersion logit clip (±10) 4. BH-NaN trap warning (scipy raises; manual loop silently propagates) Plus pointer to mapping_filter.rst for the canonical filter contract that applies to both remapping and counting. Updated 6 inbound :doc: references: - index.rst toctree (removed entry) - faq.rst → points to user_guide/analysis - tutorials/bulk_workflow.rst (2 refs) → user_guide/analysis - tutorials/single_cell_workflow.rst → methods/mapping_filter - methods/mapping_filter.rst → user_guide/analysis Sphinx build clean, zero warnings.
Merged
2 tasks
Jaureguy760
added a commit
that referenced
this pull request
Apr 19, 2026
…_guide/analysis (#107) Per user direction, the consolidated statistical_models.rst is removed. Methods section is now a single page: mapping_filter.rst (canonical WASP filter contract, van de Geijn 2015). Rescued into user_guide/analysis.rst under a new "Defaults and contracts" section (~25 lines) to avoid losing the four WASP2-unique contracts that both the sophia decision-auditor and co-scientist hypothesis-critic flagged as load-bearing when the original "delete the stats stuff" proposal was evaluated: 1. pseudocount=1 + min_count=10 defaults 2. Profile-LRT ρ convention (ρ at null-MLE, df=1) — the scientific fix from PR #99 3. ρ bounds (1e-6, 1-1e-6) and linear-dispersion logit clip (±10) 4. BH-NaN trap warning (scipy raises; manual loop silently propagates) Plus pointer to mapping_filter.rst for the canonical filter contract that applies to both remapping and counting. Updated 6 inbound :doc: references: - index.rst toctree (removed entry) - faq.rst → points to user_guide/analysis - tutorials/bulk_workflow.rst (2 refs) → user_guide/analysis - tutorials/single_cell_workflow.rst → methods/mapping_filter - methods/mapping_filter.rst → user_guide/analysis Sphinx build clean, zero warnings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Docs reduction pass based on multi-agent review (structural triage, editorial reviewer-simulator, scientific methods reviewer, comparison to scanpy / MACS3 / pysam norms). Takes the docs from ~6,300 lines to ~3,360 — 46.7% reduction — while fixing 8 scientific-correctness issues flagged by the methods review.
Why
Maintainer feedback: "too much text" on https://mcvickerlab.github.io/WASP2/. Field comparison (scanpy, MACS3, samtools) showed WASP2 was the only mature bioinformatics package with 5 dedicated methods pages re-deriving textbook statistics. Those peers cite the primary paper and move on; WASP2 was the outlier.
What changed
Deletions (fully redundant material):
New consolidated page:
Shrinks (bloat removed, content preserved):
Scientific correctness fixes
Per the methods-reviewer pass (8 issues):
Test plan
Commits (9)
```
docs: remove redundant statistical_methods_tutorial notebook
docs(methods): shrink fdr_correction from 258 to ~60 lines
docs(methods): shrink dispersion_estimation from 270 to ~110 lines
docs(methods): shrink + fix statistical_models from 307 to ~150 lines
docs(tutorials): shrink comparative_imbalance from 545 to ~175 lines
docs(tutorials): consolidate 3 bulk workflow tutorials into one
docs(methods): scientific correctness fixes to mapping_filter + counting_algorithm
docs(tutorials): shrink scrna_seq from 333 to ~100 lines
docs: shrink development.rst from 272 to ~120 lines
```
Not changed