Skip to content

docs: reduce text volume (~47%) + fix scientific correctness issues#99

Merged
Jaureguy760 merged 9 commits intodevfrom
feat/docs-reduction-tier-b
Apr 19, 2026
Merged

docs: reduce text volume (~47%) + fix scientific correctness issues#99
Jaureguy760 merged 9 commits intodevfrom
feat/docs-reduction-tier-b

Conversation

@Jaureguy760
Copy link
Copy Markdown
Collaborator

Summary

Docs reduction pass based on multi-agent review (structural triage, editorial reviewer-simulator, scientific methods reviewer, comparison to scanpy / MACS3 / pysam norms). Takes the docs from ~6,300 lines to ~3,360 — 46.7% reduction — while fixing 8 scientific-correctness issues flagged by the methods review.

Why

Maintainer feedback: "too much text" on https://mcvickerlab.github.io/WASP2/. Field comparison (scanpy, MACS3, samtools) showed WASP2 was the only mature bioinformatics package with 5 dedicated methods pages re-deriving textbook statistics. Those peers cite the primary paper and move on; WASP2 was the outlier.

What changed

Deletions (fully redundant material):

  • `tutorials/statistical_methods_tutorial.ipynb` (1,516 lines) — fully duplicated `methods/statistical_models.rst`
  • `tutorials/atac_seq_workflow.ipynb` (944 lines) — orphaned from toctree; content covered by `bulk_workflow.rst`
  • `tutorials/rna_seq.rst` (203 lines) — merged into `bulk_workflow.rst`
  • `tutorials/quickstart_mapping.rst` (258 lines) — merged into `bulk_workflow.rst`

New consolidated page:

  • `tutorials/bulk_workflow.rst` (~175 lines) — end-to-end RNA-seq + ATAC-seq bulk pipeline with data-type callouts

Shrinks (bloat removed, content preserved):

  • `methods/fdr_correction.rst` 258 → 66 lines — cut FWER/FDR primers, BH derivation, q-value derivation; kept BH pointer + PRDS assumption + NaN-handling warning + citations
  • `methods/dispersion_estimation.rst` 270 → 126 lines — cut Cramer-Rao argument, unused MoM sections, unsourced CV-by-n heuristic
  • `methods/statistical_models.rst` 307 → 173 lines — cut "why not binomial" motivation + variance-inflation example
  • `tutorials/comparative_imbalance.rst` 545 → 188 lines — cut duplicate sex/treatment tutorials (same command, different barcode map)
  • `tutorials/scrna_seq.rst` 333 → 112 lines — cut duplicate Seurat/Scanpy barcode-export code (now single source in `user_guide/single_cell.rst`)
  • `development.rst` 272 → 124 lines — cut generic black/flake8/pre-commit tutorials

Scientific correctness fixes

Per the methods-reviewer pass (8 issues):

  1. HIGH — Unsupported power table (`statistical_models.rst:276-286`): old table listed μ/N but no ρ. Beta-binomial power depends on ρ; numbers were optimistic. Removed; replaced with a note that power should be simulated at the dataset's own ρ̂.
  2. HIGH — Missing Kumasaka 2016 (RASQUAL) citation: WASP2's BB+LRT+dispersion framework sits directly in RASQUAL's lineage. Added.
  3. HIGH — Unsourced Rust-vs-Python benchmark tables (`mapping_filter.rst` + `counting_algorithm.rst`): no hardware spec, no reproducible script. Replaced with qualitative "substantially faster, measure on your hardware" paragraphs.
  4. MEDIUM — "Theorem" claim in mapping_filter.rst:68-74: P(map|ref)=P(map|alt) was stated without proof or caveat. Holds only under deterministic-aligner assumption. Softened to an "under the following assumption" statement with citation pointer to van de Geijn §Methods.
  5. MEDIUM — Unsourced "Typical Filter Rates" table: RNA 5-15%, ATAC 2-8%, ChIP 3-10%, WGS 1-5% presented as authoritative without source. Replaced with developer-experience prose.
  6. MEDIUM — LRT clarification: old doc didn't state whether ρ is profile-MLE'd or jointly MLE'd. Added explicit note that ρ is held at its null-model MLE.
  7. MEDIUM — NB vs BB reference mismatch: dispersion page cited Robinson 2010 + Yu 2013 (NB) to support BB estimation. Added Kumasaka 2016 as the correct BB reference; labeled Robinson/Yu as "analogous NB literature".
  8. MEDIUM — Missing NaN warning on manual BH: added warning that `np.minimum.accumulate` silently propagates NaN; `scipy.stats.false_discovery_control` raises on it.

Test plan

  • `sphinx-build` completes without new warnings (only pre-existing autodoc `:class:` warnings on `optional` / `type` / `pysam.` / `polars.`, which existed before this PR and are docstring issues in `src/`)
  • No broken `:doc:` cross-references (grep-verified: no remaining refs to deleted pages)
  • `index.rst` toctree updated; `choosing_workflow.rst` decision points updated
  • All 9 commits are independently reviewable; each touches one page or one concern
  • CI: ruff + Rust tests + pytest 3.10/3.11/3.12 + CodeQL (pending)

Commits (9)

```
docs: remove redundant statistical_methods_tutorial notebook
docs(methods): shrink fdr_correction from 258 to ~60 lines
docs(methods): shrink dispersion_estimation from 270 to ~110 lines
docs(methods): shrink + fix statistical_models from 307 to ~150 lines
docs(tutorials): shrink comparative_imbalance from 545 to ~175 lines
docs(tutorials): consolidate 3 bulk workflow tutorials into one
docs(methods): scientific correctness fixes to mapping_filter + counting_algorithm
docs(tutorials): shrink scrna_seq from 333 to ~100 lines
docs: shrink development.rst from 272 to ~120 lines
```

Not changed

  • `methods/mapping_filter.rst` Canonical Filter Contract section — methods reviewer confirmed this is accurate post-PR rust: restore canonical WASP filter contract (van de Geijn 2015) #96 (van de Geijn 2015 contract, only 0x4 filter). Left intact as the most valuable prose on the site.
  • `user_guide/*.rst` — appropriately scoped already; no bloat found.
  • `api/*.rst` — autodoc-driven, already concise.
  • `scatac_workflow.rst` — 156 lines, single-cell specific, no duplication with bulk workflow.

The 1,516-line notebook covered beta-binomial PMF, LRT, MLE, and
BH — all content already present in methods/statistical_models.rst,
methods/dispersion_estimation.rst, and methods/fdr_correction.rst
at appropriate depth. The notebook duplicated equations and prose
without adding worked examples on real data.

Per field norms (scanpy, MACS3, pysam), software docs should defer
statistical derivations to the primary literature rather than
re-teaching them alongside the API.
Keep: BH algorithm pointer, scipy API, PRDS assumption, reporting guidance,
output-column table, citations.

Cut: FWER/FDR definition primers, BH algorithm derivation, manual BH code
block, q-value estimator derivation, discrete-FDR alternative section,
threshold-selection table, "when to use stricter control" guidance.

Added: NaN-propagation warning (pitfall we've hit before).
Added: Storey q-value reference preserved as a citation pointer.

Rationale: scanpy, MACS3, and samtools docs cite BH in a sentence; they do
not re-teach it. WASP2's LRT produces continuous p-values so BH is standard —
users looking for a statistics primer will find one in any multiple-testing
textbook.
Keep: MLE description + single-model code, linear-model description +
code, model-choice table, convergence note, references.

Cut: Cramer-Rao-bound argument (textbook statistics not WASP2-specific),
MoM variance-based estimator (not used in WASP2), MoM vs MLE comparison
table, CV-by-sample-size table (unsourced heuristic), generic
"sample-size requirements" guidance.

Scientific fixes:
- Add Kumasaka 2016 (RASQUAL) as primary BB-dispersion reference
- Label Robinson 2010 + Yu 2013 as "analogous NB literature" (they
  are NB, not BB, so they do not directly support BB dispersion)
- Add explicit note that rho is held at its null-model MLE when
  computing the LRT (removes ambiguity about profile vs joint MLE)
Keep: model definition, LRT formulation, phased/unphased treatment,
output-columns table, pseudocount/min-count/aggregation notes.

Cut: "Why not binomial" motivation section (covered by 2-sentence intro),
variance-inflation numerical example, redundant implementation-section
restatement of dispersion code (lives in dispersion_estimation).

Scientific fixes (per reviewer pass):

- REMOVE the unsupported power table (lines 276-286 of old file).
  Beta-binomial power depends on rho; the old table varied only mu/N
  and stated no rho, making the values optimistic for typical genomic
  dispersion. Replaced with a note that power should be simulated at
  the dataset's own rho estimate.

- ADD Kumasaka 2016 citation (RASQUAL, Nat Genet). WASP2's BB + LRT +
  pooled-dispersion framework sits directly in RASQUAL's lineage;
  previous docs cited only Mayba 2014 (MBASED).

- ADD van de Geijn 2015 citation (properly placed with the mapping
  filter pointer).

- CLARIFY LRT: rho is held at its null-model MLE when evaluating
  L_1 (profile likelihood in mu). Removes ambiguity between profile
  and joint MLE noted by the methods reviewer.
Keep: one worked cell-type example, CLI reference, output-columns
table, minimal volcano plot, good practices, common issues.

Cut: full duplicate tutorials for sex-differences and treatment-vs-
control (same command, different barcode map — one note suffices).
Cut: full heatmap code block (links to analysis guide instead).
Cut: duplicate Seurat barcode-export R snippet (now single source in
user_guide/single_cell.rst).
Cut: input-data format over-specification (moved to single_cell guide).

Three near-identical tutorial sections collapsed to one example plus
"other comparisons use the same command with a different barcode map."
Before: quickstart_mapping.rst (258) + rna_seq.rst (203) +
atac_seq_workflow.ipynb (944, orphaned from toctree) = 1405 lines
covering overlapping bulk workflows with copy-pasted make-reads /
remap / filter-remapped blocks and troubleshooting sections.

After: bulk_workflow.rst (~175) covers the full RNA-seq and ATAC-seq
bulk pipeline end-to-end (WASP filter + count + analyze) with
data-type-specific callouts (GTF vs BED, phased vs unphased).

- Deleted: quickstart_mapping.rst, rna_seq.rst, atac_seq_workflow.ipynb
- Updated: index.rst toctree, choosing_workflow.rst decision points
- No broken cross-references remain (grep-verified).

Net: ~1230 lines removed, single canonical bulk walkthrough; scATAC
and scRNA tutorials untouched (genuinely distinct workflows).
…ing_algorithm

mapping_filter.rst:

- Soften the "Theorem" box that claimed P(map|ref) = P(map|alt)
  after filtering. The equality is approximate, holds under
  deterministic-aligner assumptions, and lacks a published proof
  under that framing. Replaced with an "under the following
  assumption..." statement and pointer to van de Geijn 2015 §Methods.

- Replace the unsourced Rust-vs-Python benchmark table ("1M reads
  ~5min vs ~30s" etc.) with a qualitative description. The table
  had no hardware spec, no reproducible script, no dataset — a
  reviewer would flag it.

- Replace the unsourced "Typical Filter Rates by Data Type" table
  (RNA 5-15%, ATAC 2-8%, ChIP 3-10%, WGS 1-5%) with a qualitative
  developer-experience paragraph. The tabulated ranges were stated
  as authoritative without a source.

- Add the [vandeGeijn2015]_ reference definition (was cited but
  not defined; would trigger a Sphinx warning).

counting_algorithm.rst:

- Replace the unsourced counting-benchmark table (~45s vs 5s, etc.)
  with a qualitative paragraph. Same issue as the mapping-filter
  benchmark table.
Keep: command-line recipe for count → per-celltype imbalance →
optional compare, interpretation snippet, troubleshooting for
barcode-suffix mismatches and sparsity.

Cut: duplicate Seurat + Scanpy barcode-export code blocks (these
now live canonically in user_guide/single_cell.rst, referenced by
link).

Cut: duplicate Cell Ranger output-tree diagram and BAM CB-tag
description (repeated in scatac_workflow.rst + user_guide/single_cell).
Cut: overlong troubleshooting and next-steps sections.
Keep: setup, code-standards summary, test/mypy commands, Rust layer
notes + maturin build recipe, project layout, release flow, branching
policy (master/dev promotion, feature-branch-off-dev rule).

Cut: generic black/flake8/pytest tutorials (these tools have their
own docs), step-by-step PR walkthrough, AI-assisted-development
section (link-to-nowhere — seqera_ai_integration doc not served),
obsolete "WASP2-exp" repo paths.

Added: Rust parity-test requirement, explicit PyPI-OIDC + Docker
publish flow, dev → master branching policy learned this session.
@Jaureguy760 Jaureguy760 merged commit cea62f4 into dev Apr 19, 2026
6 checks passed
@Jaureguy760 Jaureguy760 deleted the feat/docs-reduction-tier-b branch April 19, 2026 02:42
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
…rsion/FDR

Rewrite statistical_models.rst as single ~180-line page covering:
- Beta-binomial model (unchanged essentials)
- LRT with profile-likelihood convention (ρ held at null MLE, df=1)
- Dispersion: single vs linear, AIC/BIC choice, ρ bounds (1e-6, 1-1e-6),
  logit clip (±10) — all WASP2-specific implementation contracts not
  in any paper
- Phased vs unphased (uniform prior over 2^(n-1) configurations)
- BH via scipy.stats.false_discovery_control
- WASP2 defaults: pseudocount=1, min_count=10
- NaN-propagation warning for manual BH (matches MEMORY.md pitfall)
- Output columns

Extends mapping_filter.rst canonical-filter contract section with a
single paragraph stating the same contract applies at the counting
step (only 0x4 unmapped filter). This was previously buried in
counting_algorithm.rst:69-75.

Preserves four WASP2-unique contracts flagged by both the sophia
decision-auditor and co-scientist hypothesis-critic:
1. Canonical filter contract at count step
2. Profile-LRT ρ convention (fix from PR #99)
3. BH-NaN trap warning
4. Default parameter values + ρ/logit bounds

Next commit will delete counting_algorithm.rst, dispersion_estimation.rst,
fdr_correction.rst and update toctree + cross-refs.
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
…e contracts) (#103)

* docs(methods): consolidate statistical_models + absorb counting/dispersion/FDR

Rewrite statistical_models.rst as single ~180-line page covering:
- Beta-binomial model (unchanged essentials)
- LRT with profile-likelihood convention (ρ held at null MLE, df=1)
- Dispersion: single vs linear, AIC/BIC choice, ρ bounds (1e-6, 1-1e-6),
  logit clip (±10) — all WASP2-specific implementation contracts not
  in any paper
- Phased vs unphased (uniform prior over 2^(n-1) configurations)
- BH via scipy.stats.false_discovery_control
- WASP2 defaults: pseudocount=1, min_count=10
- NaN-propagation warning for manual BH (matches MEMORY.md pitfall)
- Output columns

Extends mapping_filter.rst canonical-filter contract section with a
single paragraph stating the same contract applies at the counting
step (only 0x4 unmapped filter). This was previously buried in
counting_algorithm.rst:69-75.

Preserves four WASP2-unique contracts flagged by both the sophia
decision-auditor and co-scientist hypothesis-critic:
1. Canonical filter contract at count step
2. Profile-LRT ρ convention (fix from PR #99)
3. BH-NaN trap warning
4. Default parameter values + ρ/logit bounds

Next commit will delete counting_algorithm.rst, dispersion_estimation.rst,
fdr_correction.rst and update toctree + cross-refs.

* docs(methods): delete 3 consolidated pages + methods/index landing

Removed: counting_algorithm.rst, dispersion_estimation.rst,
fdr_correction.rst, methods/index.rst.

All load-bearing WASP2-unique content (canonical filter contract at
count step, profile-LRT ρ convention, BH-NaN trap, default values,
ρ/logit bounds) was absorbed into statistical_models.rst and
mapping_filter.rst in the previous commit.

Updated 2 inbound cross-references:
- faq.rst: methods/fdr_correction → methods/statistical_models
- tutorials/bulk_workflow.rst: removed duplicate fdr_correction ref,
  kept single statistical_models ref

Top-level index.rst toctree reduced from 6 methods entries to 2.
Removed stale "Statistical Methods" landing caption was replaced by
the main index.rst's own :caption: Statistical Methods toctree.

Side benefit: eliminates the 3 pre-existing Sphinx duplicate-citation
warnings (Kumasaka2016, vandeGeijn2015) by removing the duplicate
reference definitions in methods/index.rst and the deleted pages.

Methods section: 5 pages (883 lines) → 2 pages (~500 lines),
a further 43% reduction on top of Tier-B + Tier-D.

Agreed reframe from sophia decision-auditor + co-scientist
hypothesis-critic review of the "delete all 4" proposal: consolidate
instead of erase, to preserve WASP2-specific operational contracts
not present in van de Geijn 2015 or in cited textbooks.
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
…_guide/analysis

Per user direction, the consolidated statistical_models.rst is removed.
Methods section is now a single page: mapping_filter.rst (canonical WASP
filter contract, van de Geijn 2015).

Rescued into user_guide/analysis.rst under a new "Defaults and contracts"
section (~25 lines) to avoid losing the four WASP2-unique contracts that
both the sophia decision-auditor and co-scientist hypothesis-critic
flagged as load-bearing when the original "delete the stats stuff"
proposal was evaluated:

1. pseudocount=1 + min_count=10 defaults
2. Profile-LRT ρ convention (ρ at null-MLE, df=1) — the scientific fix
   from PR #99
3. ρ bounds (1e-6, 1-1e-6) and linear-dispersion logit clip (±10)
4. BH-NaN trap warning (scipy raises; manual loop silently propagates)

Plus pointer to mapping_filter.rst for the canonical filter contract
that applies to both remapping and counting.

Updated 6 inbound :doc: references:
- index.rst toctree (removed entry)
- faq.rst → points to user_guide/analysis
- tutorials/bulk_workflow.rst (2 refs) → user_guide/analysis
- tutorials/single_cell_workflow.rst → methods/mapping_filter
- methods/mapping_filter.rst → user_guide/analysis

Sphinx build clean, zero warnings.
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
…_guide/analysis (#107)

Per user direction, the consolidated statistical_models.rst is removed.
Methods section is now a single page: mapping_filter.rst (canonical WASP
filter contract, van de Geijn 2015).

Rescued into user_guide/analysis.rst under a new "Defaults and contracts"
section (~25 lines) to avoid losing the four WASP2-unique contracts that
both the sophia decision-auditor and co-scientist hypothesis-critic
flagged as load-bearing when the original "delete the stats stuff"
proposal was evaluated:

1. pseudocount=1 + min_count=10 defaults
2. Profile-LRT ρ convention (ρ at null-MLE, df=1) — the scientific fix
   from PR #99
3. ρ bounds (1e-6, 1-1e-6) and linear-dispersion logit clip (±10)
4. BH-NaN trap warning (scipy raises; manual loop silently propagates)

Plus pointer to mapping_filter.rst for the canonical filter contract
that applies to both remapping and counting.

Updated 6 inbound :doc: references:
- index.rst toctree (removed entry)
- faq.rst → points to user_guide/analysis
- tutorials/bulk_workflow.rst (2 refs) → user_guide/analysis
- tutorials/single_cell_workflow.rst → methods/mapping_filter
- methods/mapping_filter.rst → user_guide/analysis

Sphinx build clean, zero warnings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant