diff --git a/docs/source/choosing_workflow.rst b/docs/source/choosing_workflow.rst index d393b988..8d82bd24 100644 --- a/docs/source/choosing_workflow.rst +++ b/docs/source/choosing_workflow.rst @@ -14,11 +14,11 @@ WASP2 supports four major data types. Use this guide to find your workflow. * - **Bulk RNA-seq** - BAM + phased VCF - Allele-specific expression (ASE) - - :doc:`tutorials/rna_seq` + - :doc:`tutorials/bulk_workflow` * - **Bulk ATAC-seq** - BAM + phased VCF - Allele-specific chromatin accessibility - - :doc:`tutorials/atac_seq_workflow` + - :doc:`tutorials/bulk_workflow` * - **scRNA-seq (10x)** - Cell Ranger BAM + VCF + barcodes - Per-cell or per-cell-type ASE @@ -38,15 +38,14 @@ Decision Flowchart **Step 2: Bulk or single-cell RNA-seq?** -* Bulk RNA-seq → :doc:`tutorials/rna_seq` +* Bulk RNA-seq → :doc:`tutorials/bulk_workflow` * 10x Chromium scRNA-seq → :doc:`tutorials/scrna_seq` * Other single-cell protocol → see :doc:`user_guide/single_cell` **Step 3: Bulk or single-cell ATAC-seq?** -* Bulk ATAC-seq → :doc:`tutorials/atac_seq_workflow` -* 10x scATAC-seq (fragments file) → :doc:`tutorials/scatac_workflow` -* 10x scATAC-seq (BAM with CB tag) → :doc:`tutorials/scatac_workflow` (BAM path) +* Bulk ATAC-seq → :doc:`tutorials/bulk_workflow` (use BED peak file as ``--region``) +* 10x scATAC-seq → :doc:`tutorials/scatac_workflow` Do I Need to Run the WASP Remapping Step? ------------------------------------------ diff --git a/docs/source/development.rst b/docs/source/development.rst index c568173c..4a4b71e3 100644 --- a/docs/source/development.rst +++ b/docs/source/development.rst @@ -1,272 +1,124 @@ Development Guide ================= -Contributing to WASP2 ---------------------- +Contributions are welcome. This page covers the WASP2-specific bits; for +generic Python tooling (pytest, black, pre-commit) refer to each project's +own docs. -We welcome contributions! This guide helps you get started. - -Development Setup ------------------ - -Clone Repository -~~~~~~~~~~~~~~~~ +Setup +----- .. code-block:: bash git clone https://github.com/mcvickerlab/WASP2 - cd WASP2-exp - -Install Development Dependencies -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - + cd WASP2 pip install -e ".[dev]" - -This installs: -* pytest (testing) -* mypy (type checking) -* black (code formatting) -* flake8 (linting) -* pre-commit (git hooks) - -Install Pre-commit Hooks -~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - pre-commit install -Hooks run automatically before each commit: -* Black formatting -* Flake8 linting -* mypy type checking -* Quick tests +The ``[dev]`` extra pulls in pytest, mypy, black, flake8, pre-commit, and +Sphinx. Pre-commit runs formatting + linting + mypy + a quick test subset +on every commit. -Code Standards +Code standards -------------- -Type Hints -~~~~~~~~~~ - -WASP2 has 100% type hint coverage. All new code must include type hints: - -.. code-block:: python - - def count_alleles( - bam_file: str, - vcf_file: str, - min_count: int = 10 - ) -> pd.DataFrame: - """Count alleles from BAM file.""" - ... - -Formatting -~~~~~~~~~~ - -Use Black with 100-character lines: +- **Type hints are required** on new public API. WASP2 maintains full + mypy coverage on ``src/counting``, ``src/mapping``, and ``src/analysis``. +- **Line length 100**, enforced by black. +- **Ruff** runs as the primary lint tool; flake8 is retained for + docstring-specific rules. +- **Tests** live in ``tests/``. All new features need a test. Maintain + ≥80% line coverage. .. code-block:: bash - black src/ --line-length=100 - -Linting -~~~~~~~ - -Pass Flake8 checks: - -.. code-block:: bash - - flake8 src/ --max-line-length=100 - -Testing -------- - -Run Tests Locally -~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - # All tests - pytest tests/ -v - - # Fast tests only (skip slow integration tests) - pytest tests/ -v -m "not slow" - - # With coverage - pytest tests/ --cov=src --cov-report=html - -Test Requirements -~~~~~~~~~~~~~~~~~ - -* All new features need tests -* Maintain >80% code coverage -* Tests must pass in CI before merge - -Type Checking -------------- - -Run mypy: - -.. code-block:: bash + pytest tests/ -v # full suite + pytest tests/ -v -m "not slow" # fast subset + pytest tests/ --cov=src # coverage mypy src/counting/ src/mapping/ src/analysis/ -All code must pass mypy with 0 errors. +Branching +--------- -CI/CD Pipeline --------------- - -GitHub Actions -~~~~~~~~~~~~~~ - -Tests run automatically on every push: -* Python 3.10 and 3.11 -* Type checking (mypy) -* Unit tests (pytest) -* Full pipeline validation -* Documentation build - -CI must pass before PR can be merged. - -Pre-commit Hooks -~~~~~~~~~~~~~~~~ - -Local checks before commit: -* Code formatting -* Type checking -* Quick tests - -To bypass (not recommended): - -.. code-block:: bash - - git commit --no-verify - -Pull Request Process --------------------- - -1. Fork & Branch -~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - git checkout -b feature/my-feature - -2. Develop & Test -~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - # Make changes - vim src/analysis/my_feature.py - - # Add type hints - # Write tests - # Run locally - pytest tests/ -v - mypy src/ +- ``master`` is the release branch. Tagged versions (e.g., ``v1.4.1``) + trigger PyPI + Docker publish via GitHub Actions. +- ``dev`` is the integration branch. Feature PRs target ``dev``, not + ``master``. +- Use short, kebab-case feature branches off ``dev``: ``feat/...``, + ``fix/...``, ``docs/...``. +- After a batch of features merges to ``dev``, a ``dev → master`` PR + promotes them; CI runs on both sides of that merge. -3. Commit -~~~~~~~~~ - -.. code-block:: bash +Pull requests +------------- - git add src/analysis/my_feature.py tests/test_my_feature.py - git commit -m "Add my feature" +1. Branch off ``dev``. +2. Commit changes with descriptive messages; the squash-merge commit on + ``dev`` is what ends up in the changelog, so the PR title and body + should be clean. +3. CI must be green: ruff + Rust tests + pytest (Python 3.10 / 3.11 / + 3.12) + CodeQL. +4. Code review looks at correctness, type safety, test coverage, docs, + and API surface. - # Pre-commit hooks run automatically +Rust layer +---------- -4. Push & PR -~~~~~~~~~~~~ +The ``wasp2_rust`` extension is built with maturin against Python 3.10+. .. code-block:: bash - git push origin feature/my-feature - - # Open PR on GitHub - # CI will run automatically - # Request review + cd rust/ + cargo fmt --check + cargo clippy -- -D warnings + cargo test + maturin develop --release --manifest-path Cargo.toml -Code Review ------------ +Changes to Rust code must keep the parity test +(``tests/test_rust_python_counting_parity.py``) green — the Rust counter +must produce byte-identical output to the Python reference implementation +on the shared chr21 fixture. -PRs are reviewed for: -* Correctness -* Type safety -* Test coverage -* Documentation -* Code style - -Project Structure ------------------ +Project layout +-------------- .. code-block:: text - WASP2-exp/ + WASP2/ ├── src/ - │ ├── counting/ # Allele counting - │ ├── mapping/ # WASP remapping - │ └── analysis/ # Statistical analysis - ├── tests/ - │ └── regression/ # Regression tests - ├── docs/ # Sphinx documentation - ├── scripts/ # Utility scripts - ├── baselines/ # Test baselines - └── test_data/ # Example data + │ ├── counting/ # allele counting (Python + Rust bridge) + │ ├── mapping/ # WASP remap-and-filter + │ └── analysis/ # beta-binomial LRT, FDR + ├── rust/ # Rust extension (bam filter, counter) + ├── tests/ # pytest + regression + ├── docs/ # Sphinx source + └── pipelines/ # Nextflow DSL2 workflows -Building Documentation ----------------------- +Docs +---- .. code-block:: bash - cd docs - make html - open build/html/index.html + cd docs && make html -Documentation must build without warnings. +The live site is built and deployed by GitHub Pages on every push to +``master``. Docs must build without warnings; cross-references are +checked by ``sphinx-build -n`` in CI. -Release Process ---------------- +Releases +-------- -1. Update version in ``pyproject.toml`` -2. Update ``docs/source/changelog.rst`` -3. Merge to main -4. Tag release: ``git tag v1.1.0`` -5. Push tag: ``git push origin v1.1.0`` -6. Publish to PyPI: ``python -m build && twine upload dist/*`` +1. Update ``pyproject.toml`` and ``rust/Cargo.toml`` versions (they share + one source of truth — ``pyproject.toml`` reads from Cargo). +2. Update ``CHANGELOG.md``. +3. Merge ``dev → master``. +4. Tag: ``git tag vX.Y.Z && git push origin vX.Y.Z``. +5. The ``release.yml`` workflow builds wheels and publishes to PyPI via + OIDC trusted publishing; ``docker.yml`` publishes the ghcr.io image. -AI-Assisted Development ------------------------ - -WASP2 pipeline development benefits from AI tooling. See the full integration guide: -:doc:`/seqera_ai_integration` - -Recommended Workflow -~~~~~~~~~~~~~~~~~~~~ - -1. **Design**: Use Claude Code for architecture and complex logic -2. **Generate**: Use Seqera AI for DSL2 syntax and nf-test templates -3. **Validate**: Use Anthropic life-sciences scripts for environment checks -4. **Review**: Use Claude Code for code review and optimization - -Tool Selection -~~~~~~~~~~~~~~ - -* **Architecture and design** → Claude Code -* **Nextflow DSL2 syntax** → Seqera AI -* **nf-test generation** → Seqera AI -* **Environment validation** → ``nextflow run . -profile test -preview`` - -Getting Help +Getting help ------------ -* **Issues**: https://github.com/mcvickerlab/WASP2/issues -* **Discussions**: GitHub Discussions -* **Email**: Contact maintainers - -License -------- - -WASP2 is released under the MIT License. See LICENSE file. +- Issues and feature requests: https://github.com/mcvickerlab/WASP2/issues +- Discussions: GitHub Discussions diff --git a/docs/source/index.rst b/docs/source/index.rst index 36fe4ba2..ab8dde66 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -71,8 +71,7 @@ Documentation :caption: Tutorials tutorials/quickstart_counting - tutorials/quickstart_mapping - tutorials/rna_seq + tutorials/bulk_workflow tutorials/scrna_seq tutorials/scatac_workflow tutorials/comparative_imbalance diff --git a/docs/source/methods/counting_algorithm.rst b/docs/source/methods/counting_algorithm.rst index 70969308..3839acc9 100644 --- a/docs/source/methods/counting_algorithm.rst +++ b/docs/source/methods/counting_algorithm.rst @@ -135,21 +135,11 @@ The Rust implementation provides: - **Memory efficiency**: Stream through BAM without loading all reads - **htslib integration**: Direct access to BAM index for efficient queries -Performance Characteristics -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. table:: Counting Performance - :widths: 30 35 35 - - =================== ======================== ======================== - Dataset Size Python (pysam) Rust (wasp2_rust) - =================== ======================== ======================== - 10,000 SNPs ~45 seconds ~5 seconds - 100,000 SNPs ~7 minutes ~40 seconds - 1,000,000 SNPs ~70 minutes ~6 minutes - =================== ======================== ======================== - -*Benchmarks on typical ATAC-seq data with 100M reads, single thread.* +The Rust counter is substantially faster than the pure-Python pysam path +(roughly an order of magnitude on typical ATAC-seq data in internal +testing). Exact throughput depends on read length, BAM index density, +storage type, and thread count — measure on your own hardware before +planning large batches. Quality Filtering ----------------- diff --git a/docs/source/methods/dispersion_estimation.rst b/docs/source/methods/dispersion_estimation.rst index 8c31de7f..9b1e4e94 100644 --- a/docs/source/methods/dispersion_estimation.rst +++ b/docs/source/methods/dispersion_estimation.rst @@ -1,62 +1,31 @@ Dispersion Parameter Estimation ================================ -This document describes how WASP2 estimates the overdispersion parameter -:math:`\rho` in the beta-binomial model. - -.. contents:: Contents - :local: - :depth: 2 - -The Dispersion Parameter ------------------------- - -The dispersion parameter :math:`\rho` quantifies the excess variance beyond -the binomial expectation. In the beta-binomial model: +The beta-binomial overdispersion parameter :math:`\rho \in (0, 1)` quantifies +variance inflation beyond the binomial expectation: .. math:: - \text{Var}(X) = N\mu(1-\mu)[1 + (N-1)\rho] - -where: - -- :math:`\rho = 0`: Binomial variance (no overdispersion) -- :math:`\rho > 0`: Variance inflation due to correlated sampling - -Estimation Methods ------------------- - -WASP2 supports two approaches for dispersion estimation: - -1. **Maximum Likelihood Estimation (MLE)**: Optimize the likelihood function -2. **Method of Moments (MoM)**: Solve equations based on sample moments + \text{Var}(X) = N\mu(1-\mu)\left[1 + (N-1)\rho\right] -Both methods have trade-offs in terms of efficiency, bias, and computational cost. +Dispersion arises from biological variation, PCR amplification, and +aggregation of counts across SNPs within a region. Estimation approaches +follow the beta-binomial ASE literature [Kumasaka2016]_; analogous methods +for negative-binomial dispersion in RNA-seq counts are reviewed in +[Robinson2010]_ and [Yu2013]_. -Maximum Likelihood Estimation ------------------------------ +Single Dispersion Model (default) +--------------------------------- -MLE finds the value of :math:`\rho` that maximizes the likelihood of the -observed data. - -Single Dispersion Model -^^^^^^^^^^^^^^^^^^^^^^^ - -WASP2's default approach estimates a single :math:`\rho` across all observations: - -.. math:: - - \hat{\rho}_{\text{MLE}} = \arg\max_{\rho} \sum_{i=1}^{n} \log P(X_i | N_i, \mu=0.5, \rho) - -Under the null hypothesis of no imbalance (:math:`\mu = 0.5`), the log-likelihood is: +WASP2's default ``single`` model estimates one pooled :math:`\rho` under the +null hypothesis :math:`\mu = 0.5` by maximum likelihood: .. math:: - \ell(\rho) = \sum_{i=1}^{n} \log \text{BetaBinom}(X_i; N_i, \alpha(\rho), \beta(\rho)) + \hat{\rho}_{\text{MLE}} = \arg\max_{\rho} \sum_{i=1}^{n} + \log \text{BetaBinom}(X_i; N_i, \alpha(\rho), \beta(\rho)) -where :math:`\alpha = \beta = 0.5(1-\rho)/\rho`. - -**Implementation**: +with :math:`\alpha = \beta = 0.5(1-\rho)/\rho`. .. code-block:: python @@ -69,46 +38,29 @@ where :math:`\alpha = \beta = 0.5(1-\rho)/\rho`. beta = 0.5 * (1 - rho) / rho return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) - result = minimize_scalar( + rho_mle = minimize_scalar( neg_log_likelihood, args=(ref_array, n_array), method='bounded', - bounds=(1e-6, 1 - 1e-6) - ) - rho_mle = result.x - -**Properties of MLE**: + bounds=(1e-6, 1 - 1e-6), + ).x -- Asymptotically unbiased as :math:`n \to \infty` -- Achieves the Cramér-Rao lower bound (efficient) -- Computationally requires numerical optimization -- May be sensitive to outliers +:math:`\rho` is held at this null-model MLE when computing the likelihood-ratio +test statistic (see :doc:`statistical_models`). Linear Dispersion Model -^^^^^^^^^^^^^^^^^^^^^^^ +----------------------- -For large datasets with variable coverage, WASP2 offers a model where -dispersion varies linearly with total count on the logit scale: +For large datasets with wide coverage ranges, WASP2 provides a ``linear`` model +in which dispersion varies with total count on the logit scale: .. math:: \text{logit}(\rho_i) = \beta_0 + \beta_1 \cdot N_i -The logit link ensures :math:`\rho_i \in (0, 1)`: - -.. math:: - - \rho_i = \frac{\exp(\beta_0 + \beta_1 N_i)}{1 + \exp(\beta_0 + \beta_1 N_i)} - -**Motivation**: - -Empirically, regions with different coverage levels may exhibit different -dispersion characteristics: - -- **Low coverage**: Greater sampling noise, potentially higher apparent dispersion -- **High coverage**: More stable estimates, may reveal true biological variance - -**Implementation**: +Low-coverage regions typically show greater apparent dispersion from sampling +noise; high-coverage regions reveal more of the true biological variance. The +linear model captures this trend with one additional parameter. .. code-block:: python @@ -117,154 +69,58 @@ dispersion characteristics: def neg_ll_linear(params, ref_counts, n_counts): beta0, beta1 = params - logit_rho = beta0 + beta1 * n_counts - # Clip to avoid numerical issues - logit_rho = np.clip(logit_rho, -10, 10) - rho = expit(logit_rho) + rho = expit(np.clip(beta0 + beta1 * n_counts, -10, 10)) alpha = 0.5 * (1 - rho) / rho beta = 0.5 * (1 - rho) / rho return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) - result = minimize( - neg_ll_linear, - x0=(0, 0), - method='Nelder-Mead', - args=(ref_array, n_array) - ) - beta0, beta1 = result.x - -Method of Moments ------------------ + beta0, beta1 = minimize(neg_ll_linear, x0=(0, 0), + method='Nelder-Mead', + args=(ref_array, n_array)).x -MoM estimates :math:`\rho` by equating theoretical and sample moments. +Choosing a model +---------------- -Variance-Based Estimator -^^^^^^^^^^^^^^^^^^^^^^^^ +.. table:: + :widths: 30 70 -For a beta-binomial with :math:`\mu = 0.5`, the variance is: + ================================ ============================================================= + Context Recommended model + ================================ ============================================================= + Small dataset (< 1,000 regions) ``single`` — linear model is underpowered + Uniform coverage ``single`` + Large dataset, variable coverage ``linear`` + Model-selection check Compare AIC/BIC; linear has one extra parameter + ================================ ============================================================= -.. math:: - - \text{Var}(X) = \frac{N}{4}[1 + (N-1)\rho] - -Solving for :math:`\rho`: - -.. math:: - - \hat{\rho}_{\text{MoM}} = \frac{4S^2/N - 1}{N - 1} +Model comparison via AIC (:math:`\text{AIC} = 2k - 2\ell(\hat\theta)`) can be +used to check whether the extra parameter is justified on a given dataset. -where :math:`S^2` is the sample variance of :math:`X/N` (the allelic ratio). +Convergence +----------- -**Pooled Estimator**: - -For observations with varying :math:`N`: - -.. math:: - - \hat{\rho}_{\text{MoM}} = \frac{\sum_i (X_i - N_i/2)^2 - \sum_i N_i/4}{\sum_i N_i(N_i-1)/4} - -**Properties of MoM**: - -- Closed-form solution (fast computation) -- May produce negative estimates (truncate to 0) -- Less efficient than MLE, especially for small samples -- More robust to model misspecification - -Comparison: MLE vs MoM ----------------------- - -.. table:: MLE vs MoM for Dispersion Estimation - :widths: 25 37 38 - - =================== ============================== ============================== - Property MLE MoM - =================== ============================== ============================== - Computation Iterative optimization Closed-form - Efficiency Optimal (achieves CRLB) Suboptimal - Bias Asymptotically unbiased May be biased for small n - Robustness Sensitive to outliers More robust - Boundary behavior Always in (0,1) May give ρ < 0 - WASP2 default Yes No - =================== ============================== ============================== - -WASP2 uses MLE because: - -1. The beta-binomial likelihood is well-behaved -2. Modern optimization is fast enough for typical datasets -3. MLE provides consistent estimates across sample sizes - -Practical Considerations ------------------------- - -Convergence Issues -^^^^^^^^^^^^^^^^^^ - -The MLE optimizer may fail to converge if: - -- :math:`\rho` is very close to 0 (nearly binomial data) -- :math:`\rho` is very close to 1 (extreme overdispersion) -- The data contains extreme outliers - -WASP2 handles these by: - -- Bounding :math:`\rho` away from 0 and 1 -- Clipping logit values to avoid overflow -- Using robust optimization methods (bounded, Nelder-Mead) - -Sample Size Requirements -^^^^^^^^^^^^^^^^^^^^^^^^ - -MLE performance depends on sample size: - -.. table:: Dispersion Estimate Quality by Sample Size - :widths: 25 25 50 - - ============ ================ ===================================== - n (regions) CV of estimate Recommendation - ============ ================ ===================================== - < 50 > 50% Use pooled estimate or prior - 50-200 20-50% MLE reasonable but uncertain - 200-1000 10-20% MLE reliable - > 1000 < 10% MLE highly accurate - ============ ================ ===================================== - -Model Selection -^^^^^^^^^^^^^^^ - -Choosing between single and linear dispersion models: - -**Use Single Dispersion When**: - -- Dataset is small (< 1000 regions) -- Coverage is relatively uniform -- Quick analysis is needed - -**Use Linear Dispersion When**: - -- Large dataset (> 10,000 regions) -- Wide range of coverage values -- Systematic coverage-dispersion relationship suspected - -Model comparison can be done via AIC/BIC: - -.. math:: - - \text{AIC} = 2k - 2\ell(\hat{\theta}) - -where :math:`k` is the number of parameters (1 for single, 2 for linear). +The MLE optimizer may have trouble when :math:`\rho` is very close to 0 or 1, +or when extreme outliers dominate the pooled likelihood. WASP2 bounds +:math:`\rho \in (10^{-6}, 1-10^{-6})` and clips logit values in the linear +model. Persistent convergence failures usually indicate data issues (CNVs, +mapping artifacts) rather than optimizer pathology. See Also -------- -- :doc:`fdr_correction` - Multiple testing correction after estimation +- :doc:`statistical_models` — how :math:`\rho` enters the LRT +- :doc:`fdr_correction` — multiple-testing correction downstream References ---------- +.. [Kumasaka2016] Kumasaka N, Knights AJ, Gaffney DJ (2016). Fine-mapping + cellular QTLs with RASQUAL and ATAC-seq. *Nature Genetics* 48:206-213. + .. [Robinson2010] Robinson MD, Smyth GK (2010). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. - *Biostatistics* 9:321-332. + *Biostatistics* 9:321-332. *Analogous NB-dispersion literature.* .. [Yu2013] Yu D, Huber W, Vitek O (2013). Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. - *Bioinformatics* 29:1275-1282. + *Bioinformatics* 29:1275-1282. *Analogous NB-dispersion literature.* diff --git a/docs/source/methods/fdr_correction.rst b/docs/source/methods/fdr_correction.rst index 75894bc0..50532fac 100644 --- a/docs/source/methods/fdr_correction.rst +++ b/docs/source/methods/fdr_correction.rst @@ -1,246 +1,59 @@ False Discovery Rate Correction ================================ -This document describes the multiple testing correction methods used in WASP2 -to control false positive rates when testing many genomic regions. - -.. contents:: Contents - :local: - :depth: 2 - -The Multiple Testing Problem ----------------------------- - -When testing thousands of genomic regions for allelic imbalance, even a small -per-test false positive rate leads to many false discoveries: - -.. math:: - - E[\text{false positives}] = m \cdot \alpha - -For :math:`m = 10{,}000` tests at :math:`\alpha = 0.05`, we expect 500 false -positives by chance alone. - -**Example**: Testing 20,000 genes for ASE - -- At α = 0.05: ~1,000 expected false positives -- At α = 0.01: ~200 expected false positives -- Even stringent thresholds yield many false discoveries - -Error Rate Definitions ----------------------- - -Family-Wise Error Rate (FWER) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The probability of making **any** false discovery: - -.. math:: - - \text{FWER} = P(V \geq 1) - -where :math:`V` is the number of false positives. - -FWER control (e.g., Bonferroni) is very conservative for genomic studies. - -False Discovery Rate (FDR) -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The expected **proportion** of false discoveries among rejections: - -.. math:: - - \text{FDR} = E\left[\frac{V}{R}\right] - -where :math:`V` is false positives and :math:`R` is total rejections. - -FDR is more appropriate for discovery-oriented genomic studies where: - -- Some false positives are acceptable -- The goal is to prioritize candidates for follow-up -- The number of tests is very large - -Benjamini-Hochberg Procedure ----------------------------- - -WASP2 uses the Benjamini-Hochberg (BH) procedure [Benjamini1995]_ for FDR control. - -Algorithm -^^^^^^^^^ - -Given :math:`m` p-values :math:`p_1, p_2, \ldots, p_m`: - -1. Sort p-values: :math:`p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}` -2. Find the largest :math:`k` such that :math:`p_{(k)} \leq \frac{k}{m} \cdot q` -3. Reject all hypotheses with :math:`p_{(i)} \leq p_{(k)}` - -where :math:`q` is the target FDR level (typically 0.05 or 0.1). - -**Adjusted P-Values (q-values)**: - -WASP2 reports BH-adjusted p-values: - -.. math:: - - q_i = \min_{j \geq i} \left\{ \frac{m \cdot p_{(j)}}{j} \right\} - -These can be interpreted as the minimum FDR at which the hypothesis would be -rejected. - -Implementation -^^^^^^^^^^^^^^ +WASP2 corrects for multiple testing using the Benjamini-Hochberg (BH) procedure +[Benjamini1995]_, applied via :func:`scipy.stats.false_discovery_control`. .. code-block:: python from scipy.stats import false_discovery_control - # WASP2 uses the BH method fdr_pvals = false_discovery_control(pvals, method='bh') - # Equivalent manual implementation - def benjamini_hochberg(pvals): - n = len(pvals) - ranked = np.argsort(pvals) - adjusted = np.empty(n) - cummin = 1.0 - for i in range(n - 1, -1, -1): - idx = ranked[i] - adjusted[idx] = min(cummin, pvals[idx] * n / (i + 1)) - cummin = adjusted[idx] - return adjusted - -Properties -^^^^^^^^^^ +Each region receives both a raw p-value (``pval``) from the likelihood ratio test +and a BH-adjusted p-value (``fdr_pval``). Significance is typically declared at +``fdr_pval < 0.05``. -**Advantages**: +BH controls FDR under independence or positive regression dependence (PRDS). +The chi-squared p-values produced by WASP2's LRT satisfy these assumptions when +regions are analyzed independently. If regions share SNPs or otherwise induce +negative dependence, the more conservative Benjamini-Yekutieli procedure +(``method='by'``) can be substituted. -- Controls FDR at level :math:`q` under independence -- More powerful than FWER methods (fewer false negatives) -- Simple to compute and interpret -- Works well with continuous p-values +.. warning:: -**Assumptions**: + ``scipy.stats.false_discovery_control`` raises on NaN p-values, but a manual + BH implementation based on ``np.minimum.accumulate`` silently propagates NaN + through the cumulative minimum. Always drop or impute NaN p-values before BH + correction. -- P-values under the null are uniformly distributed -- Independence or positive regression dependence (PRDS) +For studies requiring higher-power FDR control with an estimated proportion of +true nulls, :math:`\pi_0`, see Storey's q-value [Storey2003]_. WASP2 does not +apply this by default because BH is sufficient for typical ASE workloads and +requires no tuning. -The chi-squared p-values from WASP2's likelihood ratio test satisfy these -assumptions when regions are independent. +Reporting +--------- -Alternative Methods -------------------- +When reporting results, state the correction method, threshold, and number +of tests: -While WASP2 uses BH by default, researchers may consider alternatives -for specific scenarios. + "BH correction was applied to N = ... chi-squared p-values; regions with + ``fdr_pval < 0.05`` were declared significant." -Benjamini-Yekutieli (BY) -^^^^^^^^^^^^^^^^^^^^^^^^ +Output columns +-------------- -For arbitrary dependence between tests: - -.. math:: - - p_{(k)} \leq \frac{k}{m \cdot c(m)} \cdot q - -where :math:`c(m) = \sum_{i=1}^{m} 1/i \approx \ln(m) + 0.577`. - -BY is more conservative but valid under any dependence structure. - -.. code-block:: python - - # Available in scipy - fdr_pvals = false_discovery_control(pvals, method='by') - -Storey's q-value -^^^^^^^^^^^^^^^^ - -Estimates the proportion of true nulls (:math:`\pi_0`) for improved power: - -.. math:: - - \hat{\pi}_0 = \frac{\#\{p_i > \lambda\}}{m(1 - \lambda)} - -The q-value procedure is more powerful when many tests are true discoveries. - -.. code-block:: python - - # Requires qvalue package - # pip install qvalue - from qvalue import qvalue - q = qvalue(pvals) - -Alternative Correction Methods ------------------------------- - -**Discrete FDR and Mid-P Adjustments** - -For exact tests with discrete p-values (Fisher's exact, exact binomial), -specialized methods like Gilbert's procedure [Gilbert2005]_ or mid-p -adjustments can reduce conservativeness. - -However, WASP2's likelihood ratio test produces continuous p-values from -the chi-squared distribution, so **standard BH is appropriate** and these -discrete methods are not needed. - -Practical Guidelines --------------------- - -Choosing an FDR Threshold -^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. table:: FDR Threshold Guidelines - :widths: 20 40 40 - - ======== ================================= ================================ - FDR Use Case Interpretation - ======== ================================= ================================ - 0.01 High-confidence discoveries ~1% false among significant - 0.05 Standard exploratory analysis ~5% false among significant - 0.10 Liberal discovery ~10% false, maximize sensitivity - 0.20 Hypothesis generation For follow-up validation - ======== ================================= ================================ - -When to Use Stricter Control -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Consider stricter FDR or FWER control when: - -- Results will directly inform clinical decisions -- Follow-up validation is expensive -- The number of true positives is expected to be small -- Independence assumptions may be violated - -Reporting Results -^^^^^^^^^^^^^^^^^ - -When reporting FDR-corrected results: - -1. **Report the method**: "FDR correction was performed using the - Benjamini-Hochberg procedure" -2. **Report the threshold**: "Significance was declared at FDR < 0.05" -3. **Report both p-values**: Include raw and adjusted p-values in supplements -4. **Report the number of tests**: "Among 15,234 tested regions..." - -Output Format -------------- - -WASP2 reports both raw and adjusted p-values: - -.. table:: P-value Columns in Output +.. table:: :widths: 20 20 60 ========= ======== ==================================================== Column Type Description ========= ======== ==================================================== pval float Raw p-value from likelihood ratio test - fdr_pval float BH-adjusted p-value (q-value) + fdr_pval float BH-adjusted p-value ========= ======== ==================================================== -**Interpretation**: - -- ``pval < 0.05``: Nominally significant (not corrected) -- ``fdr_pval < 0.05``: Significant after multiple testing correction - References ---------- @@ -251,8 +64,3 @@ References .. [Storey2003] Storey JD, Tibshirani R (2003). Statistical significance for genomewide studies. *Proceedings of the National Academy of Sciences* 100:9440-9445. - -.. [Gilbert2005] Gilbert PB (2005). A modified false discovery rate - multiple-comparisons procedure for discrete data, applied to human - immunodeficiency virus genetics. *Journal of the Royal Statistical - Society C* 54:143-158. diff --git a/docs/source/methods/mapping_filter.rst b/docs/source/methods/mapping_filter.rst index e44cc5b3..c196d0b4 100644 --- a/docs/source/methods/mapping_filter.rst +++ b/docs/source/methods/mapping_filter.rst @@ -65,12 +65,19 @@ A read passes the WASP filter if and only if: This criterion ensures that the read would have mapped identically regardless of which allele it carried, eliminating differential mappability. -**Theorem**: After WASP filtering, the probability of mapping is equal for -reads from either allele: +Under the assumption that the aligner's scoring is deterministic and +depends only on the read sequence and reference coordinate — the standard +case for seed-and-extend aligners at typical parameter settings — the +mapping probability becomes approximately equal for either allele after +filtering: .. math:: - P(\text{map} | \text{ref allele}) = P(\text{map} | \text{alt allele}) + P(\text{map} \mid \text{ref allele}) \approx P(\text{map} \mid \text{alt allele}) + +See [vandeGeijn2015]_ §Methods for the original argument. The equality is +exact under deterministic mapping and approximate for aligners with +stochastic tie-breaking or heuristic multi-mapper resolution. Implementation Details ---------------------- @@ -216,38 +223,29 @@ The Rust implementation provides: - **Streaming comparison**: Memory-efficient position matching - **htslib integration**: Native BAM format support -Performance Characteristics ---------------------------- - -.. table:: WASP Filter Performance - :widths: 30 35 35 - - =================== ======================== ======================== - Reads to Filter Python Implementation Rust Implementation - =================== ======================== ======================== - 1 million ~5 minutes ~30 seconds - 10 million ~50 minutes ~5 minutes - 100 million ~8 hours ~50 minutes - =================== ======================== ======================== +Performance +----------- -*Benchmarks with coordinate-sorted BAM, single thread.* +The Rust implementation is substantially faster than a pure-Python filter +path (roughly an order of magnitude on coordinate-sorted BAMs of 1–100M +reads in internal testing). Exact throughput depends on storage type, +thread count, and read length; run on your own hardware for accurate +numbers. Expected Filter Rates ^^^^^^^^^^^^^^^^^^^^^ -Typical filter rates depend on data type and variant density: +Filter rates vary with data type, variant density, aligner, and tolerance +settings. As a rough developer-experience guide only: -.. table:: Typical Filter Rates by Data Type - :widths: 25 25 50 +- RNA-seq: on the order of a few percent to ~15%; higher near splice + junctions and in indel-dense regions. +- ATAC-seq / ChIP-seq: on the order of a few percent. +- WGS: typically lower than RNA-seq at comparable variant density. - ============ ============ ========================================== - Data Type Filter Rate Notes - ============ ============ ========================================== - RNA-seq 5-15% Higher near splice junctions - ATAC-seq 2-8% Lower due to shorter reads - ChIP-seq 3-10% Depends on peak locations - WGS 1-5% Lowest filter rate - ============ ============ ========================================== +Outlier filter rates (well above these ranges) usually indicate a mismatch +between the original and remap aligner versions/parameters, CNV-rich +regions, or errors in the variant call set — not a WASP failure. Limitations and Considerations ------------------------------ @@ -316,4 +314,12 @@ The typical WASP2 workflow: See Also -------- -- :doc:`counting_algorithm` - Allele counting after WASP filtering +- :doc:`counting_algorithm` — Allele counting after WASP filtering + +References +---------- + +.. [vandeGeijn2015] van de Geijn B, McVicker G, Gilad Y, Pritchard JK (2015). + WASP: allele-specific software for robust molecular quantitative trait + locus discovery. *Nature Methods* 12:1061-1063. + https://doi.org/10.1038/nmeth.3582 diff --git a/docs/source/methods/statistical_models.rst b/docs/source/methods/statistical_models.rst index 6db7552c..b3037f8c 100644 --- a/docs/source/methods/statistical_models.rst +++ b/docs/source/methods/statistical_models.rst @@ -1,307 +1,173 @@ Beta-Binomial Model for Allelic Imbalance ========================================== -This document describes the statistical framework WASP2 uses to detect -allelic imbalance from allele count data. +WASP2 detects allelic imbalance via a beta-binomial likelihood-ratio test. +The beta-binomial accommodates overdispersion from biological variation, +PCR amplification, and aggregation of counts across SNPs within a region — +sources of variance that a simple binomial model cannot absorb. -.. contents:: Contents - :local: - :depth: 2 +Model +----- -Motivation ----------- - -Why Not Use the Binomial? -^^^^^^^^^^^^^^^^^^^^^^^^^ - -The simplest model for allele counts is the binomial distribution. If reads -are sampled independently from two alleles with equal probability: - -.. math:: - - X \sim \text{Binomial}(N, 0.5) - -where :math:`X` is the reference count and :math:`N` is the total count. - -However, real allele count data exhibits **overdispersion**: the variance -exceeds the binomial expectation. Sources of overdispersion include: - -- **Biological variation**: True allelic imbalance varies across cells -- **Technical noise**: PCR amplification introduces correlated errors -- **Aggregation effects**: Combining counts across SNPs within a region -- **Sampling from a population**: Different individuals have different AI - -The Beta-Binomial Model ------------------------ - -The beta-binomial distribution naturally accommodates overdispersion by -modeling the success probability as a random variable: +With :math:`X` the reference count out of :math:`N` total at a site: .. math:: p &\sim \text{Beta}(\alpha, \beta) \\ - X | p &\sim \text{Binomial}(N, p) - -Marginalizing over :math:`p` gives the beta-binomial: - -.. math:: - - P(X = k) = \binom{N}{k} \frac{B(k + \alpha, N - k + \beta)}{B(\alpha, \beta)} - -where :math:`B(\cdot, \cdot)` is the beta function. - -Parameterization -^^^^^^^^^^^^^^^^ - -WASP2 uses the **mean-dispersion parameterization**: - -.. math:: - - \mu &= \frac{\alpha}{\alpha + \beta} \\ - \rho &= \frac{1}{\alpha + \beta + 1} - -The dispersion parameter :math:`\rho \in (0, 1)` controls overdispersion: - -- :math:`\rho \to 0`: Approaches binomial (no overdispersion) -- :math:`\rho \to 1`: Maximum overdispersion (all probability at 0 or N) - -The inverse transformation: - -.. math:: - - \alpha &= \mu \cdot \frac{1 - \rho}{\rho} \\ - \beta &= (1 - \mu) \cdot \frac{1 - \rho}{\rho} - -Variance Structure -^^^^^^^^^^^^^^^^^^ - -The beta-binomial variance is: - -.. math:: - - \text{Var}(X) = N\mu(1-\mu) \left[ 1 + (N-1)\rho \right] - -The factor :math:`[1 + (N-1)\rho]` is the **variance inflation factor**. -For typical values (:math:`\rho \approx 0.01`, :math:`N \approx 100`), this -gives roughly 2x the binomial variance. - -Hypothesis Testing ------------------- - -WASP2 tests for allelic imbalance using a likelihood ratio test: - -**Null Hypothesis** :math:`H_0`: No imbalance (:math:`\mu = 0.5`) - -**Alternative Hypothesis** :math:`H_1`: Imbalance present (:math:`\mu \neq 0.5`) - -Likelihood Functions -^^^^^^^^^^^^^^^^^^^^ + X \mid p &\sim \text{Binomial}(N, p) -Under the null hypothesis: +WASP2 uses the mean-dispersion parameterization, :math:`\mu = \alpha/(\alpha+\beta)` +and :math:`\rho = 1/(\alpha+\beta+1)`: .. math:: - \mathcal{L}_0 = \prod_{i=1}^{n} P(X_i | N_i, \mu=0.5, \rho) + \alpha = \mu \cdot \frac{1 - \rho}{\rho}, \qquad + \beta = (1 - \mu) \cdot \frac{1 - \rho}{\rho} -Under the alternative: +The variance, .. math:: - \mathcal{L}_1 = \prod_{i=1}^{n} P(X_i | N_i, \hat{\mu}_{\text{MLE}}, \rho) + \text{Var}(X) = N\mu(1-\mu)\left[1 + (N-1)\rho\right], -where the product is over SNPs within a region (peak, gene). +recovers the binomial as :math:`\rho \to 0`. Estimation of :math:`\rho` is +described in :doc:`dispersion_estimation`. -Likelihood Ratio Test -^^^^^^^^^^^^^^^^^^^^^ +Hypothesis Test +--------------- -The test statistic: +For each region, WASP2 tests .. math:: - \Lambda = -2 \left[ \log \mathcal{L}_0 - \log \mathcal{L}_1 \right] + H_0 : \mu = 0.5 \qquad \text{vs.} \qquad H_1 : \mu \neq 0.5 -Under the null hypothesis, :math:`\Lambda` follows a chi-squared distribution -with 1 degree of freedom: +with the likelihood ratio statistic .. math:: - \Lambda \sim \chi^2_1 + \Lambda = -2\left[\log \mathcal{L}_0 - \log \mathcal{L}_1\right], \qquad + \Lambda \sim \chi^2_1 \text{ under } H_0, -The p-value is: +where :math:`\mathcal{L}_1` is maximized over :math:`\mu` with :math:`\rho` +held at its null-model MLE (profile likelihood in :math:`\mu`). The p-value +is :math:`P(\chi^2_1 > \Lambda)`. -.. math:: - - p = P(\chi^2_1 > \Lambda) - -MLE Estimation -^^^^^^^^^^^^^^ - -The MLE for :math:`\mu` under the alternative is found by numerical optimization: +MLE for :math:`\mu` under :math:`H_1`: .. code-block:: python from scipy.optimize import minimize_scalar from scipy.stats import betabinom + import numpy as np def neg_log_likelihood(mu, ref_counts, n_counts, rho): alpha = mu * (1 - rho) / rho beta = (1 - mu) * (1 - rho) / rho return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) - result = minimize_scalar( + mu_mle = minimize_scalar( neg_log_likelihood, args=(ref_counts, n_counts, rho), method='bounded', - bounds=(0, 1) - ) - mu_mle = result.x - -Implementation in WASP2 ------------------------ + bounds=(0, 1), + ).x -Single Dispersion Model -^^^^^^^^^^^^^^^^^^^^^^^ +Dispersion models +----------------- -The default model estimates a single :math:`\rho` for all data: +WASP2 provides a single pooled :math:`\rho` (default) and a coverage-dependent +``linear`` model (logit-linear in :math:`N`). See :doc:`dispersion_estimation` +for the choice criteria and code. .. code-block:: python - from analysis.as_analysis import single_model + from analysis.as_analysis import single_model, linear_model results = single_model(df, region_col="region", phased=False) - -This assumes homogeneous overdispersion across regions, which is often -reasonable for moderately-sized datasets. - -Linear Dispersion Model -^^^^^^^^^^^^^^^^^^^^^^^ - -For large datasets, WASP2 offers a linear model where dispersion varies -with total count: - -.. math:: - - \text{logit}(\rho) = \beta_0 + \beta_1 \cdot N - -This captures the observation that regions with higher coverage often show -different dispersion characteristics: - -.. code-block:: python - - from analysis.as_analysis import linear_model - results = linear_model(df, region_col="region", phased=False) -Phased vs Unphased Analysis ---------------------------- - -WASP2 supports both phased and unphased genotype data. - -Unphased Model -^^^^^^^^^^^^^^ - -When genotype phase is unknown, WASP2 uses a mixture model that marginalizes -over possible phase configurations: +Phased vs. unphased data +------------------------ -For a region with multiple SNPs, if we don't know which haplotype each -ref allele belongs to, we sum over phase assignments using dynamic programming: +**Phased** genotypes reduce to a direct beta-binomial per SNP, with +:math:`\mu` or :math:`1-\mu` depending on which haplotype carries the +reference allele: .. math:: - P(\mathbf{X}) = \sum_{\phi \in \text{phases}} P(\mathbf{X} | \phi) \cdot P(\phi) - -where :math:`\phi` indexes phase configurations with prior :math:`P(\phi) = 0.5^{n-1}`. - -Phased Model -^^^^^^^^^^^^ + P(X_i \mid \text{hap}_i) = \text{BetaBinom}(X_i; N_i, \mu_{\text{hap}_i}, \rho) -With phased genotypes (e.g., from read-backed phasing), the model is simpler: +**Unphased** analysis marginalizes over phase configurations with a uniform +prior across the :math:`2^{n-1}` phase assignments for a region of :math:`n` +SNPs, using dynamic programming over SNPs (approach introduced in +[Mayba2014]_): .. math:: - P(X_i | \text{hap}_i) = \text{BetaBinom}(X_i; N_i, \mu_{\text{hap}_i}, \rho) - -where :math:`\mu_{\text{hap}_i}` is :math:`\mu` or :math:`1-\mu` depending on -which haplotype the reference allele belongs to. - -Output Interpretation ---------------------- + P(\mathbf{X}) = \sum_{\phi} P(\mathbf{X} \mid \phi) \, P(\phi), + \qquad P(\phi) = 0.5^{n-1}. -WASP2 returns the following statistics for each region: +Output +------ -.. table:: Beta-Binomial Analysis Output +.. table:: Columns emitted by the analysis pipeline :widths: 20 15 65 ============ ======== ====================================================== Column Type Interpretation ============ ======== ====================================================== - null_ll float Log-likelihood under null (μ=0.5) - alt_ll float Log-likelihood under alternative (μ=MLE) - mu float MLE of imbalance proportion - lrt float Likelihood ratio test statistic - pval float p-value from χ² distribution - fdr_pval float FDR-corrected p-value (BH method) + null_ll float Log-likelihood under :math:`H_0` (:math:`\mu=0.5`) + alt_ll float Log-likelihood under :math:`H_1` (:math:`\mu=\hat\mu_{\text{MLE}}`) + mu float MLE of the reference-allele proportion + lrt float Likelihood ratio statistic + pval float :math:`\chi^2_1` p-value + fdr_pval float BH-adjusted p-value (see :doc:`fdr_correction`) ============ ======== ====================================================== -**Interpreting μ (mu)**: +:math:`\mu > 0.5` indicates a reference-allele preference; :math:`\mu < 0.5` +indicates an alternate preference; :math:`|\mu - 0.5|` is the effect size. -- :math:`\mu = 0.5`: No imbalance (equal allele expression) -- :math:`\mu > 0.5`: Reference allele preference -- :math:`\mu < 0.5`: Alternate allele preference -- :math:`|\mu - 0.5|`: Effect size (magnitude of imbalance) +Practical notes +--------------- -Practical Considerations ------------------------- - -Pseudocounts -^^^^^^^^^^^^ - -WASP2 adds pseudocounts to avoid log(0) issues: - -.. math:: +**Pseudocounts.** WASP2 adds :math:`c = 1` to both allele counts by default, +shrinking estimates slightly toward 0.5. This gives conservative inference and +avoids log-zero issues at extreme counts. - X' = X + c, \quad N' = N + 2c +**Minimum counts.** Regions with :math:`N < \text{min\_count}` (default 10) are +dropped — low coverage yields poor power and unstable MLEs. -Default :math:`c = 1` (Laplace smoothing). This slightly shrinks estimates -toward 0.5, providing conservative inference. +**Region aggregation.** Analyzing at the region level (peaks, genes) rather +than individual SNPs increases power, reduces the multiple-testing burden, and +better captures regulatory effects that act across a region. -Minimum Count Threshold -^^^^^^^^^^^^^^^^^^^^^^^ - -Regions with low total counts have poor statistical power. WASP2 filters -regions with :math:`N < \text{min\_count}` (default: 10). - -Power depends on coverage and effect size: - -.. table:: Approximate Power (α=0.05) - :widths: 20 20 20 20 20 - - ========= ======== ======== ======== ======== - Total N μ=0.55 μ=0.60 μ=0.65 μ=0.70 - ========= ======== ======== ======== ======== - 20 5% 10% 20% 35% - 50 8% 25% 50% 75% - 100 15% 45% 80% 95% - 200 30% 75% 95% 99% - ========= ======== ======== ======== ======== - -Region Aggregation -^^^^^^^^^^^^^^^^^^ - -Analyzing at the region level (genes, peaks) rather than individual SNPs: - -- **Increases power**: More counts per test -- **Reduces multiple testing burden**: Fewer tests to correct -- **Captures regulatory effects**: ASE often affects entire genes +**Power.** Power depends jointly on :math:`N`, :math:`\mu`, and :math:`\rho`. +Because WASP2 does not compute analytic power estimates at runtime, we do not +tabulate values here — power simulations for a given dataset should be run at +the dataset's own estimated :math:`\hat\rho` rather than relying on generic +tables. See Also -------- -- :doc:`dispersion_estimation` - Estimating the dispersion parameter -- :doc:`fdr_correction` - Multiple testing correction methods +- :doc:`dispersion_estimation` — estimating :math:`\rho` +- :doc:`fdr_correction` — multiple-testing correction +- :doc:`counting_algorithm` — how counts are produced +- :doc:`mapping_filter` — WASP remap-and-filter producing the BAM References ---------- -.. [Mayba2014] Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: allele-specific - expression detection in cancer tissues and cell lines. *Genome Biology* 15:405. +.. [Mayba2014] Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: + allele-specific expression detection in cancer tissues and cell lines. + *Genome Biology* 15:405. *Introduces the phase-marginalized mixture model.* + +.. [Kumasaka2016] Kumasaka N, Knights AJ, Gaffney DJ (2016). Fine-mapping + cellular QTLs with RASQUAL and ATAC-seq. *Nature Genetics* 48:206-213. + *Beta-binomial LRT framework in which WASP2's analysis sits.* + +.. [vandeGeijn2015] van de Geijn B, McVicker G, Gilad Y, Pritchard JK (2015). + WASP: allele-specific software for robust molecular quantitative trait + locus discovery. *Nature Methods* 12:1061-1063. *The mapping-bias + correction upstream of the count step.* diff --git a/docs/source/tutorials/atac_seq_workflow.ipynb b/docs/source/tutorials/atac_seq_workflow.ipynb deleted file mode 100644 index 27e016ca..00000000 --- a/docs/source/tutorials/atac_seq_workflow.ipynb +++ /dev/null @@ -1,944 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# ATAC-seq Allelic Imbalance Analysis Tutorial\n", - "\n", - "**Estimated time:** ~30 minutes\n", - "\n", - "This tutorial demonstrates a complete WASP2 workflow for detecting allelic imbalance in chromatin accessibility from ATAC-seq data. You will learn to:\n", - "\n", - "1. Prepare ATAC-seq peak files for analysis\n", - "2. Count alleles at heterozygous SNPs within accessibility peaks\n", - "3. Detect significant allelic imbalance using beta-binomial statistical testing\n", - "4. Visualize results with volcano plots and effect size distributions\n", - "5. Integrate with caQTL/eQTL data for biological interpretation\n", - "\n", - "## Background\n", - "\n", - "**Allelic Imbalance in Chromatin Accessibility**\n", - "\n", - "ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) measures open chromatin regions genome-wide. When a heterozygous individual shows unequal accessibility between maternal and paternal alleles at a regulatory region, this indicates **allelic imbalance in chromatin accessibility**.\n", - "\n", - "Such imbalance often reflects:\n", - "- *cis*-regulatory variants affecting transcription factor binding\n", - "- Chromatin accessibility QTLs (caQTLs)\n", - "- Haplotype-specific enhancer activity\n", - "\n", - "WASP2 uses a **beta-binomial model** to detect significant departures from the expected 50:50 allelic ratio while accounting for overdispersion in sequencing data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": "## Prerequisites\n\n### Software\n\n- **WASP2** (`pip install wasp2`)\n- **Python 3.10+** with pandas, matplotlib, numpy\n- **samtools** (for BAM operations)\n- **tabix** (for VCF indexing)\n\n### Input Data\n\n| File | Description | Format |\n|------|-------------|--------|\n| `sample.bam` | ATAC-seq aligned reads | BAM (indexed) |\n| `variants.vcf.gz` | Phased heterozygous variants | VCF (indexed) |\n| `peaks.bed` | ATAC-seq peaks from MACS2/SEACR | BED or narrowPeak |\n\n**Note:** For best results, use WASP-filtered BAM files to correct reference mapping bias. See the [mapping documentation](../user_guide/mapping.rst) for details." - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Setup" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from pathlib import Path\n", - "\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "\n", - "# Configure plotting\n", - "plt.style.use(\"seaborn-v0_8-whitegrid\")\n", - "plt.rcParams[\"figure.figsize\"] = (10, 6)\n", - "plt.rcParams[\"font.size\"] = 11\n", - "\n", - "# Define paths (modify these for your data)\n", - "DATA_DIR = Path(\"data\")\n", - "RESULTS_DIR = Path(\"results\")\n", - "RESULTS_DIR.mkdir(exist_ok=True)\n", - "\n", - "# Input files\n", - "BAM_FILE = DATA_DIR / \"atac_sample.bam\"\n", - "VCF_FILE = DATA_DIR / \"phased_variants.vcf.gz\"\n", - "PEAKS_FILE = DATA_DIR / \"atac_peaks.narrowPeak\"\n", - "SAMPLE_ID = \"SAMPLE1\" # Sample name in VCF\n", - "\n", - "print(\"WASP2 ATAC-seq Tutorial\")\n", - "print(\"=\" * 40)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Section 1: Data Loading and Preparation\n", - "\n", - "### 1.1 Inspect Peak File Format\n", - "\n", - "ATAC-seq peaks from MACS2 are typically in **narrowPeak** format (BED6+4). WASP2 accepts both BED and narrowPeak formats." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Load and inspect peaks\n", - "peak_columns = [\n", - " \"chr\",\n", - " \"start\",\n", - " \"end\",\n", - " \"name\",\n", - " \"score\",\n", - " \"strand\",\n", - " \"signalValue\",\n", - " \"pValue\",\n", - " \"qValue\",\n", - " \"peak\",\n", - "]\n", - "peaks_df = pd.read_csv(PEAKS_FILE, sep=\"\\t\", header=None, names=peak_columns)\n", - "\n", - "print(f\"Total peaks: {len(peaks_df):,}\")\n", - "print(\"\\nPeak size distribution:\")\n", - "peaks_df[\"size\"] = peaks_df[\"end\"] - peaks_df[\"start\"]\n", - "print(peaks_df[\"size\"].describe())\n", - "\n", - "print(\"\\nFirst 5 peaks:\")\n", - "peaks_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.2 Verify BAM and VCF Files\n", - "\n", - "Check that your input files are properly formatted and indexed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$SAMPLE_ID\"\n", - "BAM_FILE=$1\n", - "VCF_FILE=$2\n", - "SAMPLE_ID=$3\n", - "\n", - "echo \"=== BAM File Check ===\"\n", - "echo \"File: $BAM_FILE\"\n", - "samtools view -H \"$BAM_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n", - "\n", - "echo \"\"\n", - "echo \"=== VCF File Check ===\"\n", - "echo \"File: $VCF_FILE\"\n", - "echo \"Checking for sample: $SAMPLE_ID\"\n", - "bcftools query -l \"$VCF_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n", - "\n", - "echo \"\"\n", - "echo \"=== Index Check ===\"\n", - "ls -la \"${BAM_FILE}.bai\" 2>/dev/null || echo \"BAM index (.bai): Using example paths\"\n", - "ls -la \"${VCF_FILE}.tbi\" 2>/dev/null || echo \"VCF index (.tbi): Using example paths\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Section 2: Allele Counting at Peaks\n", - "\n", - "WASP2 counts reads supporting reference and alternate alleles at each heterozygous SNP within ATAC-seq peaks. The `--region` parameter restricts counting to SNPs overlapping your peaks.\n", - "\n", - "### 2.1 Run Allele Counting\n", - "\n", - "**Key Parameters:**\n", - "- `--region`: Peak file to restrict SNPs to accessible regions\n", - "- `--samples`: Sample ID for genotype filtering (heterozygous sites only)\n", - "- `--out_file`: Output path for count results" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define output file\n", - "COUNTS_FILE = RESULTS_DIR / \"atac_allele_counts.tsv\"\n", - "\n", - "# Build the command\n", - "count_cmd = f\"\"\"\n", - "wasp2-count count-variants \\\\\n", - " {BAM_FILE} \\\\\n", - " {VCF_FILE} \\\\\n", - " --region {PEAKS_FILE} \\\\\n", - " --samples {SAMPLE_ID} \\\\\n", - " --out_file {COUNTS_FILE}\n", - "\"\"\"\n", - "\n", - "print(\"Command to run:\")\n", - "print(count_cmd)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$PEAKS_FILE\" \"$SAMPLE_ID\" \"$COUNTS_FILE\"\n", - "# Uncomment to run (requires actual data files)\n", - "# wasp2-count count-variants \\\n", - "# \"$1\" \\\n", - "# \"$2\" \\\n", - "# --region \"$3\" \\\n", - "# --samples \"$4\" \\\n", - "# --out_file \"$5\"\n", - "\n", - "echo \"Note: Uncomment the command above to run with your data\"\n", - "echo \"For this tutorial, we'll use simulated example output.\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.2 Inspect Count Results\n", - "\n", - "The output contains per-SNP allele counts with peak annotations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# For demonstration, create example data\n", - "# (Replace with: counts_df = pd.read_csv(COUNTS_FILE, sep='\\t') for real data)\n", - "\n", - "np.random.seed(42)\n", - "n_snps = 5000\n", - "\n", - "# Simulate realistic ATAC-seq allele counts\n", - "counts_df = pd.DataFrame(\n", - " {\n", - " \"chr\": np.random.choice([\"chr1\", \"chr2\", \"chr3\", \"chr4\", \"chr5\"], n_snps),\n", - " \"pos\": np.random.randint(1e6, 2e8, n_snps),\n", - " \"ref\": np.random.choice([\"A\", \"C\", \"G\", \"T\"], n_snps),\n", - " \"alt\": np.random.choice([\"A\", \"C\", \"G\", \"T\"], n_snps),\n", - " \"region_id\": [f\"peak_{i}\" for i in np.random.randint(0, 1500, n_snps)],\n", - " }\n", - ")\n", - "\n", - "# Generate counts with some true imbalanced regions\n", - "total_depth = np.random.negative_binomial(5, 0.3, n_snps) + 5\n", - "imbalance_prob = np.where(\n", - " np.random.random(n_snps) < 0.1, # 10% truly imbalanced\n", - " np.random.choice([0.3, 0.7], n_snps), # Imbalanced allele freq\n", - " 0.5, # Balanced\n", - ")\n", - "counts_df[\"ref_count\"] = np.random.binomial(total_depth, imbalance_prob)\n", - "counts_df[\"alt_count\"] = total_depth - counts_df[\"ref_count\"]\n", - "counts_df[\"other_count\"] = 0\n", - "\n", - "print(f\"Total SNPs counted: {len(counts_df):,}\")\n", - "print(f\"Unique peaks with SNPs: {counts_df['region_id'].nunique():,}\")\n", - "print(\"\\nCount statistics:\")\n", - "print(counts_df[[\"ref_count\", \"alt_count\"]].describe())\n", - "counts_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.3 Quality Control: Count Distribution\n", - "\n", - "ATAC-seq typically has **lower coverage per peak** than RNA-seq genes. Check the distribution to set appropriate filtering thresholds." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Calculate total counts per SNP\n", - "counts_df[\"total\"] = counts_df[\"ref_count\"] + counts_df[\"alt_count\"]\n", - "\n", - "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", - "\n", - "# Total count distribution\n", - "ax = axes[0]\n", - "ax.hist(counts_df[\"total\"], bins=50, edgecolor=\"black\", alpha=0.7)\n", - "ax.axvline(10, color=\"red\", linestyle=\"--\", label=\"min_count=10\")\n", - "ax.axvline(5, color=\"orange\", linestyle=\"--\", label=\"min_count=5\")\n", - "ax.set_xlabel(\"Total Read Count per SNP\")\n", - "ax.set_ylabel(\"Number of SNPs\")\n", - "ax.set_title(\"Read Depth Distribution\")\n", - "ax.legend()\n", - "ax.set_xlim(0, 100)\n", - "\n", - "# Allele ratio distribution\n", - "ax = axes[1]\n", - "ratio = counts_df[\"ref_count\"] / counts_df[\"total\"]\n", - "ax.hist(ratio[counts_df[\"total\"] >= 10], bins=50, edgecolor=\"black\", alpha=0.7)\n", - "ax.axvline(0.5, color=\"red\", linestyle=\"--\", label=\"Expected (0.5)\")\n", - "ax.set_xlabel(\"Reference Allele Frequency\")\n", - "ax.set_ylabel(\"Number of SNPs\")\n", - "ax.set_title(\"Allele Ratio Distribution (depth ≥10)\")\n", - "ax.legend()\n", - "\n", - "plt.tight_layout()\n", - "plt.savefig(RESULTS_DIR / \"count_qc.png\", dpi=150)\n", - "plt.show()\n", - "\n", - "# Summary statistics\n", - "print(\n", - " f\"\\nSNPs with depth ≥5: {(counts_df['total'] >= 5).sum():,} ({100 * (counts_df['total'] >= 5).mean():.1f}%)\"\n", - ")\n", - "print(\n", - " f\"SNPs with depth ≥10: {(counts_df['total'] >= 10).sum():,} ({100 * (counts_df['total'] >= 10).mean():.1f}%)\"\n", - ")\n", - "print(\n", - " f\"SNPs with depth ≥20: {(counts_df['total'] >= 20).sum():,} ({100 * (counts_df['total'] >= 20).mean():.1f}%)\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Section 3: Statistical Testing (Beta-Binomial)\n", - "\n", - "WASP2's `find-imbalance` command uses a **beta-binomial model** to test for allelic imbalance:\n", - "\n", - "- **Null hypothesis (H₀):** Reference allele frequency = 0.5 (balanced)\n", - "- **Alternative (H₁):** Reference allele frequency ≠ 0.5 (imbalanced)\n", - "\n", - "The beta-binomial distribution accounts for **overdispersion** - the extra variability beyond binomial sampling that's common in sequencing data.\n", - "\n", - "### 3.1 Run Imbalance Detection" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Define output file\n", - "IMBALANCE_FILE = RESULTS_DIR / \"atac_imbalance_results.tsv\"\n", - "\n", - "# Build the command\n", - "# Note: Using --min 5 for ATAC-seq (lower coverage than RNA-seq)\n", - "# The --model single option uses a single dispersion parameter for all regions\n", - "analysis_cmd = f\"\"\"\n", - "wasp2-analyze find-imbalance \\\\\n", - " {COUNTS_FILE} \\\\\n", - " --min 5 \\\\\n", - " --model single \\\\\n", - " --output {IMBALANCE_FILE}\n", - "\"\"\"\n", - "\n", - "print(\"Command to run:\")\n", - "print(analysis_cmd)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": "%%bash -s \"$COUNTS_FILE\" \"$IMBALANCE_FILE\"\n# Uncomment to run (requires actual count file)\n# wasp2-analyze find-imbalance \\\n# \"$1\" \\\n# --min 5 \\\n# --model single \\\n# --output \"$2\"\n\necho \"Note: Uncomment the command above to run with your data\"" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.2 Simulate Results for Demonstration\n", - "\n", - "For this tutorial, we simulate realistic analysis results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Aggregate counts by peak (region)\n", - "peak_counts = (\n", - " counts_df.groupby(\"region_id\")\n", - " .agg(\n", - " {\n", - " \"chr\": \"first\",\n", - " \"pos\": [\"min\", \"max\"],\n", - " \"ref_count\": \"sum\",\n", - " \"alt_count\": \"sum\",\n", - " }\n", - " )\n", - " .reset_index()\n", - ")\n", - "peak_counts.columns = [\"region\", \"chr\", \"start\", \"end\", \"ref_count\", \"alt_count\"]\n", - "\n", - "# Calculate statistics\n", - "peak_counts[\"total\"] = peak_counts[\"ref_count\"] + peak_counts[\"alt_count\"]\n", - "peak_counts[\"mu\"] = peak_counts[\"ref_count\"] / peak_counts[\"total\"]\n", - "# Add pseudocount (+1) to avoid log(0) and stabilize ratios for low-count regions\n", - "peak_counts[\"effect_size\"] = np.log2(\n", - " (peak_counts[\"ref_count\"] + 1) / (peak_counts[\"alt_count\"] + 1)\n", - ")\n", - "\n", - "# Simulate p-values (truly imbalanced peaks get low p-values)\n", - "# Note: This simulation uses binomial for simplicity. Real ATAC-seq data exhibits\n", - "# overdispersion, which is why WASP2 uses the beta-binomial model.\n", - "np.random.seed(42)\n", - "is_imbalanced = np.abs(peak_counts[\"mu\"] - 0.5) > 0.15\n", - "peak_counts[\"p_value\"] = np.where(\n", - " is_imbalanced,\n", - " 10 ** (-np.random.uniform(2, 10, len(peak_counts))), # Significant\n", - " np.random.uniform(0.05, 1, len(peak_counts)), # Not significant\n", - ")\n", - "\n", - "# FDR correction (Benjamini-Hochberg)\n", - "# Note: This manual BH implementation is for demonstration.\n", - "# WASP2 internally uses scipy.stats.false_discovery_control()\n", - "peak_counts = peak_counts.sort_values(\"p_value\")\n", - "n_tests = len(peak_counts)\n", - "peak_counts[\"rank\"] = range(1, n_tests + 1)\n", - "peak_counts[\"fdr_pval\"] = np.minimum(peak_counts[\"p_value\"] * n_tests / peak_counts[\"rank\"], 1.0)\n", - "peak_counts[\"fdr_pval\"] = peak_counts[\"fdr_pval\"][::-1].cummin()[::-1]\n", - "\n", - "# Filter to testable peaks\n", - "results_df = peak_counts[peak_counts[\"total\"] >= 5].copy()\n", - "results_df = results_df.drop(\"rank\", axis=1)\n", - "\n", - "print(f\"Peaks tested: {len(results_df):,}\")\n", - "print(f\"Significant (FDR < 0.05): {(results_df['fdr_pval'] < 0.05).sum():,}\")\n", - "print(f\"Significant (FDR < 0.01): {(results_df['fdr_pval'] < 0.01).sum():,}\")\n", - "results_df.head(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Section 4: Result Interpretation and Visualization\n", - "\n", - "### 4.1 Volcano Plot\n", - "\n", - "The volcano plot shows effect size (x-axis) vs. statistical significance (y-axis), helping identify peaks with both strong and significant allelic imbalance." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, ax = plt.subplots(figsize=(10, 8))\n", - "\n", - "# Calculate -log10(p-value) for plotting\n", - "results_df[\"neg_log10_pval\"] = -np.log10(results_df[\"p_value\"].clip(lower=1e-50))\n", - "\n", - "# Define significance thresholds\n", - "sig_mask = results_df[\"fdr_pval\"] < 0.05\n", - "effect_mask = np.abs(results_df[\"effect_size\"]) > 0.5 # |log2FC| > 0.5\n", - "\n", - "# Plot non-significant points\n", - "ns = ~sig_mask\n", - "ax.scatter(\n", - " results_df.loc[ns, \"effect_size\"],\n", - " results_df.loc[ns, \"neg_log10_pval\"],\n", - " c=\"lightgray\",\n", - " s=15,\n", - " alpha=0.6,\n", - " label=f\"Not significant (n={ns.sum():,})\",\n", - ")\n", - "\n", - "# Plot significant but small effect\n", - "sig_small = sig_mask & ~effect_mask\n", - "ax.scatter(\n", - " results_df.loc[sig_small, \"effect_size\"],\n", - " results_df.loc[sig_small, \"neg_log10_pval\"],\n", - " c=\"steelblue\",\n", - " s=25,\n", - " alpha=0.7,\n", - " label=f\"FDR<0.05, small effect (n={sig_small.sum():,})\",\n", - ")\n", - "\n", - "# Plot significant and large effect\n", - "sig_large = sig_mask & effect_mask\n", - "ax.scatter(\n", - " results_df.loc[sig_large, \"effect_size\"],\n", - " results_df.loc[sig_large, \"neg_log10_pval\"],\n", - " c=\"firebrick\",\n", - " s=40,\n", - " alpha=0.8,\n", - " label=f\"FDR<0.05, |log2FC|>0.5 (n={sig_large.sum():,})\",\n", - ")\n", - "\n", - "# Add threshold lines\n", - "ax.axhline(-np.log10(0.05), color=\"black\", linestyle=\"--\", alpha=0.3, linewidth=1)\n", - "ax.axvline(0.5, color=\"gray\", linestyle=\":\", alpha=0.5)\n", - "ax.axvline(-0.5, color=\"gray\", linestyle=\":\", alpha=0.5)\n", - "ax.axvline(0, color=\"black\", linestyle=\"-\", alpha=0.2)\n", - "\n", - "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\", fontsize=12)\n", - "ax.set_ylabel(\"-log₁₀(p-value)\", fontsize=12)\n", - "ax.set_title(\"ATAC-seq Allelic Imbalance\\nVolcano Plot\", fontsize=14)\n", - "ax.legend(loc=\"upper right\", fontsize=9)\n", - "\n", - "plt.tight_layout()\n", - "plt.savefig(RESULTS_DIR / \"volcano_plot.png\", dpi=150)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4.2 Effect Size Distribution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", - "\n", - "# All peaks\n", - "ax = axes[0]\n", - "ax.hist(results_df[\"effect_size\"], bins=50, edgecolor=\"black\", alpha=0.7, color=\"steelblue\")\n", - "ax.axvline(0, color=\"red\", linestyle=\"--\", linewidth=2)\n", - "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\")\n", - "ax.set_ylabel(\"Number of Peaks\")\n", - "ax.set_title(\"All Tested Peaks\")\n", - "\n", - "# Significant peaks only\n", - "ax = axes[1]\n", - "sig_effects = results_df.loc[sig_mask, \"effect_size\"]\n", - "ax.hist(sig_effects, bins=30, edgecolor=\"black\", alpha=0.7, color=\"firebrick\")\n", - "ax.axvline(0, color=\"black\", linestyle=\"--\", linewidth=2)\n", - "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\")\n", - "ax.set_ylabel(\"Number of Peaks\")\n", - "ax.set_title(f\"Significant Peaks (FDR < 0.05, n={sig_mask.sum():,})\")\n", - "\n", - "plt.tight_layout()\n", - "plt.savefig(RESULTS_DIR / \"effect_size_distribution.png\", dpi=150)\n", - "plt.show()\n", - "\n", - "# Summary statistics\n", - "print(\"Effect size statistics (significant peaks):\")\n", - "print(f\" Mean: {sig_effects.mean():.3f}\")\n", - "print(f\" Median: {sig_effects.median():.3f}\")\n", - "print(f\" Std: {sig_effects.std():.3f}\")\n", - "print(f\" Ref-biased (log2FC > 0): {(sig_effects > 0).sum()}\")\n", - "print(f\" Alt-biased (log2FC < 0): {(sig_effects < 0).sum()}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4.3 Top Imbalanced Peaks" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Get top hits by significance\n", - "top_hits = results_df[results_df[\"fdr_pval\"] < 0.05].nsmallest(20, \"fdr_pval\")\n", - "\n", - "print(\"Top 20 Peaks with Allelic Imbalance\")\n", - "print(\"=\" * 80)\n", - "display_cols = [\n", - " \"region\",\n", - " \"chr\",\n", - " \"ref_count\",\n", - " \"alt_count\",\n", - " \"mu\",\n", - " \"effect_size\",\n", - " \"p_value\",\n", - " \"fdr_pval\",\n", - "]\n", - "top_hits[display_cols].round(4)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Section 5: QTL Overlap Analysis\n", - "\n", - "Peaks with allelic imbalance often harbor **chromatin accessibility QTLs (caQTLs)** or overlap with **expression QTLs (eQTLs)**. Integrating your results with published QTL databases helps validate findings and identify regulatory mechanisms.\n", - "\n", - "### 5.1 Prepare BED File for Overlap Analysis" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Export significant peaks as BED for overlap analysis\n", - "sig_peaks = results_df[results_df[\"fdr_pval\"] < 0.05][\n", - " [\"chr\", \"start\", \"end\", \"region\", \"effect_size\", \"fdr_pval\"]\n", - "].copy()\n", - "sig_peaks.to_csv(RESULTS_DIR / \"significant_peaks.bed\", sep=\"\\t\", header=False, index=False)\n", - "\n", - "print(f\"Exported {len(sig_peaks)} significant peaks to: {RESULTS_DIR / 'significant_peaks.bed'}\")\n", - "print(\"\\nUse this file for overlap analysis with:\")\n", - "print(\" - GTEx eQTLs (https://gtexportal.org)\")\n", - "print(\" - ENCODE cCREs (https://www.encodeproject.org)\")\n", - "print(\" - Published caQTL datasets\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.2 Example: GTEx eQTL Overlap\n", - "\n", - "This example shows how to intersect your imbalanced peaks with GTEx eQTL SNPs to identify potential regulatory relationships." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Example overlap analysis with bedtools (requires eQTL BED file)\n", - "overlap_cmd = \"\"\"\n", - "# Download GTEx eQTLs for your tissue of interest\n", - "# Example: Brain cortex significant eQTLs\n", - "\n", - "# Intersect imbalanced peaks with eQTL positions\n", - "bedtools intersect \\\\\n", - " -a results/significant_peaks.bed \\\\\n", - " -b gtex_brain_cortex_eqtls.bed \\\\\n", - " -wa -wb \\\\\n", - " > results/peak_eqtl_overlap.bed\n", - "\n", - "# Count overlaps\n", - "wc -l results/peak_eqtl_overlap.bed\n", - "\"\"\"\n", - "\n", - "print(\"Example bedtools command for eQTL overlap:\")\n", - "print(overlap_cmd)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Simulate overlap results for demonstration\n", - "np.random.seed(123)\n", - "\n", - "# Create simulated eQTL overlap data\n", - "n_overlap = 150\n", - "overlap_df = pd.DataFrame(\n", - " {\n", - " \"peak\": [f\"peak_{i}\" for i in np.random.choice(range(1500), n_overlap, replace=False)],\n", - " \"eqtl_gene\": [f\"GENE{i}\" for i in np.random.randint(1, 500, n_overlap)],\n", - " \"eqtl_pval\": 10 ** (-np.random.uniform(3, 15, n_overlap)),\n", - " \"tissue\": np.random.choice(\n", - " [\"Brain_Cortex\", \"Brain_Hippocampus\", \"Liver\", \"Heart\"], n_overlap\n", - " ),\n", - " }\n", - ")\n", - "\n", - "print(f\"Peaks overlapping eQTLs: {len(overlap_df)}\")\n", - "print(\"\\nOverlap by tissue:\")\n", - "print(overlap_df[\"tissue\"].value_counts())\n", - "\n", - "# Enrichment analysis\n", - "n_sig_peaks = sig_mask.sum()\n", - "n_total_peaks = len(results_df)\n", - "n_eqtl_overlap = len(overlap_df)\n", - "\n", - "# Fisher's exact test for enrichment\n", - "from scipy.stats import fisher_exact\n", - "\n", - "# Assume 10% of all peaks overlap eQTLs by chance\n", - "expected_overlap = int(n_sig_peaks * 0.10)\n", - "contingency = [\n", - " [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap],\n", - " [expected_overlap, n_sig_peaks - expected_overlap],\n", - "]\n", - "odds_ratio, p_value = fisher_exact(contingency)\n", - "\n", - "print(\"\\nEnrichment Analysis:\")\n", - "print(f\" Imbalanced peaks: {n_sig_peaks}\")\n", - "print(f\" Overlapping eQTLs: {n_eqtl_overlap}\")\n", - "print(f\" Expected by chance: ~{expected_overlap}\")\n", - "print(f\" Fold enrichment: {n_eqtl_overlap / max(expected_overlap, 1):.2f}x\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.3 Visualization: eQTL Overlap" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", - "\n", - "# Pie chart: Overlap proportion\n", - "ax = axes[0]\n", - "overlap_counts = [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap]\n", - "labels = [\"Overlap with eQTL\", \"No eQTL overlap\"]\n", - "colors = [\"#e74c3c\", \"#95a5a6\"]\n", - "ax.pie(overlap_counts, labels=labels, colors=colors, autopct=\"%1.1f%%\", startangle=90)\n", - "ax.set_title(\"Imbalanced Peaks Overlapping eQTLs\")\n", - "\n", - "# Bar chart: Overlap by tissue\n", - "ax = axes[1]\n", - "tissue_counts = overlap_df[\"tissue\"].value_counts()\n", - "colors = plt.cm.Set2(range(len(tissue_counts)))\n", - "bars = ax.bar(tissue_counts.index, tissue_counts.values, color=colors, edgecolor=\"black\")\n", - "ax.set_xlabel(\"Tissue\")\n", - "ax.set_ylabel(\"Number of Overlapping eQTLs\")\n", - "ax.set_title(\"eQTL Overlap by Tissue\")\n", - "ax.tick_params(axis=\"x\", rotation=45)\n", - "\n", - "plt.tight_layout()\n", - "plt.savefig(RESULTS_DIR / \"eqtl_overlap.png\", dpi=150)\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Section 6: Downstream Analysis Hints\n", - "\n", - "### 6.1 Motif Enrichment Analysis\n", - "\n", - "Imbalanced peaks may disrupt transcription factor binding sites. Use tools like HOMER or MEME-ChIP for motif analysis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "motif_cmd = \"\"\"\n", - "# Extract sequences around imbalanced SNPs\n", - "bedtools slop -i significant_peaks.bed -g genome.chrom.sizes -b 50 | \\\\\n", - "bedtools getfasta -fi genome.fa -bed - -fo imbalanced_seqs.fa\n", - "\n", - "# Run HOMER motif analysis\n", - "findMotifsGenome.pl significant_peaks.bed hg38 motif_results/ -size 200\n", - "\n", - "# Alternative: MEME-ChIP\n", - "meme-chip -oc meme_results imbalanced_seqs.fa\n", - "\"\"\"\n", - "\n", - "print(\"Example commands for motif enrichment analysis:\")\n", - "print(motif_cmd)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 6.2 Gene Ontology Enrichment\n", - "\n", - "Identify biological processes associated with genes near imbalanced peaks." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "go_cmd = \"\"\"\n", - "# Annotate peaks with nearest genes using GREAT or bedtools\n", - "bedtools closest -a significant_peaks.bed -b genes.bed -d > peak_gene_assignments.bed\n", - "\n", - "# Extract gene list\n", - "cut -f8 peak_gene_assignments.bed | sort -u > imbalanced_genes.txt\n", - "\n", - "# Use DAVID, Enrichr, or clusterProfiler for GO enrichment\n", - "# Web interface: https://david.ncifcrf.gov/\n", - "# Web interface: https://maayanlab.cloud/Enrichr/\n", - "\"\"\"\n", - "\n", - "print(\"Example commands for GO enrichment analysis:\")\n", - "print(go_cmd)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 6.3 Single-Cell ATAC-seq Extension\n", - "\n", - "For single-cell ATAC-seq (scATAC-seq), use WASP2's single-cell workflow to detect cell-type-specific allelic imbalance." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "sc_cmd = \"\"\"\n", - "# Count alleles in single-cell ATAC-seq\n", - "wasp2-count count-variants-sc \\\\\n", - " scatac_possorted.bam \\\\\n", - " variants.vcf.gz \\\\\n", - " barcodes.tsv \\\\\n", - " --samples SAMPLE_ID \\\\\n", - " --feature peaks.bed \\\\\n", - " --out_file scatac_counts.h5ad\n", - "\n", - "# Detect imbalance per cell type\n", - "wasp2-analyze find-imbalance-sc \\\\\n", - " scatac_counts.h5ad \\\\\n", - " barcode_celltype_map.tsv \\\\\n", - " --sample SAMPLE_ID \\\\\n", - " --min 5 \\\\\n", - " --phased\n", - "\n", - "# Compare imbalance between cell types\n", - "wasp2-analyze compare-imbalance \\\\\n", - " scatac_counts.h5ad \\\\\n", - " barcode_celltype_map.tsv \\\\\n", - " --groups \"excitatory,inhibitory\" \\\\\n", - " --sample SAMPLE_ID\n", - "\"\"\"\n", - "\n", - "print(\"Commands for single-cell ATAC-seq analysis:\")\n", - "print(sc_cmd)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "In this tutorial, you learned to:\n", - "\n", - "1. **Prepare ATAC-seq data** - Load peaks and verify input file formats\n", - "2. **Count alleles** - Use `wasp2-count count-variants` with peak regions\n", - "3. **Detect imbalance** - Apply beta-binomial testing with `wasp2-analyze find-imbalance`\n", - "4. **Visualize results** - Create volcano plots and effect size distributions\n", - "5. **Integrate with QTLs** - Overlap with eQTL databases for biological validation\n", - "\n", - "### Key Takeaways\n", - "\n", - "- ATAC-seq has **lower coverage per peak** than RNA-seq; use `--min-count 5` instead of 10\n", - "- **FDR correction** is essential for multiple testing across thousands of peaks\n", - "- Consider **effect size** alongside significance for biological relevance\n", - "- **QTL overlap** helps validate findings and identify causal variants\n", - "\n", - "### Next Steps\n", - "\n", - "- [Comparative Imbalance Tutorial](./comparative_imbalance.rst) - Compare imbalance between conditions\n", - "- [Single-Cell Tutorial](./scrna_seq.rst) - Cell-type-specific analysis\n", - "- [Statistical Methods](../methods/statistical_models.rst) - Deep dive into the beta-binomial model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Troubleshooting\n", - "\n", - "### Common Issues\n", - "\n", - "**Low SNP counts in peaks:**\n", - "- Ensure VCF contains heterozygous variants for your sample\n", - "- Check that peak coordinates use the same reference genome as VCF\n", - "- Verify `--samples` matches the sample name in VCF header\n", - "\n", - "**Memory errors with large datasets:**\n", - "- Process chromosomes separately with `--region chr1_peaks.bed`, etc.\n", - "- Use `WASP2_RUST_THREADS=4` to limit parallel processing\n", - "\n", - "**No significant results:**\n", - "- Check read depth (may need deeper sequencing)\n", - "- Verify WASP filtering was applied to remove mapping bias\n", - "- Consider lowering `--min-count` threshold (with caution)\n", - "\n", - "### Diagnostic Commands\n", - "\n", - "```bash\n", - "# Check VCF sample names\n", - "bcftools query -l variants.vcf.gz\n", - "\n", - "# Count heterozygous SNPs in your sample\n", - "bcftools view -s SAMPLE_ID variants.vcf.gz | bcftools view -g het | wc -l\n", - "\n", - "# Check BAM read depth at a peak\n", - "samtools depth -r chr1:1000000-1001000 sample.bam | awk '{sum+=$3} END {print sum/NR}'\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Save final results\n", - "results_df.to_csv(RESULTS_DIR / \"final_imbalance_results.tsv\", sep=\"\\t\", index=False)\n", - "print(f\"Results saved to: {RESULTS_DIR / 'final_imbalance_results.tsv'}\")\n", - "print(\"\\nAnalysis complete!\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.0" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/source/tutorials/bulk_workflow.rst b/docs/source/tutorials/bulk_workflow.rst new file mode 100644 index 00000000..674f2726 --- /dev/null +++ b/docs/source/tutorials/bulk_workflow.rst @@ -0,0 +1,175 @@ +Bulk Workflow (RNA-seq / ATAC-seq) +=================================== + +End-to-end walkthrough of WASP2 on bulk sequencing data: WASP mapping filter, +allele counting at regions, and the beta-binomial LRT for allelic imbalance. +The same pipeline works for RNA-seq at genes (GTF) and ATAC-seq at peaks +(BED); data-type differences are called out where they matter. + +Inputs: + +- Aligned BAM (coordinate-sorted, indexed) +- VCF with the donor's heterozygous variants (phased for RNA-seq haplotypes) +- Region file: GTF/GFF for RNA-seq genes, BED for ATAC-seq peaks + +Step 1 — WASP mapping filter +---------------------------- + +Reference alleles map more easily than alternate alleles, biasing ref counts +upward. The WASP remap-and-filter step removes reads whose mapping position +depends on which allele they carry. The test is "would this read map to the +same location if we swapped ref↔alt at the SNP?" — only reads that do are +kept. After filtering, + +.. math:: + + P(\text{map} \mid \text{ref}) \approx P(\text{map} \mid \text{alt}) + +under the usual assumption that the aligner's scoring is deterministic and +position-dependent (see :doc:`/methods/mapping_filter` and [vandeGeijn2015]_). + +.. code-block:: bash + + # Step 1a: produce allele-swapped FASTQ + BAMs + wasp2-map make-reads sample.bam variants.vcf.gz \ + --samples SAMPLE1 \ + --out_dir wasp_output + + # Step 1b: remap with the SAME aligner and parameters used originally + bwa mem -M -t 8 genome.fa \ + wasp_output/sample_swapped_alleles_r1.fq \ + wasp_output/sample_swapped_alleles_r2.fq \ + | samtools sort -o wasp_output/sample_remapped.bam - + samtools index wasp_output/sample_remapped.bam + + # Step 1c: drop reads whose swapped version mapped elsewhere + wasp2-map filter-remapped \ + wasp_output/sample_remapped.bam \ + --wasp_data_json wasp_output/sample_wasp_data_files.json \ + --out_bam wasp_output/sample_wasp_filtered.bam + + # Step 1d: merge with reads that never overlapped a SNP + samtools merge -f wasp_output/sample_final.bam \ + wasp_output/sample_wasp_filtered.bam \ + wasp_output/sample_keep.bam + samtools index wasp_output/sample_final.bam + +.. important:: + + The remap step **must** use the same aligner, version, and parameters as + the original alignment. A different scoring function invalidates the + filter contract. + +Filter rates are **approximately** 1–10% of SNP-overlapping reads dropped +(developer experience; varies by sequencing platform, aligner, and variant +density). Outlier rates should be investigated — they typically signal +CNVs, aligner-version mismatch, or variant-call errors rather than a WASP +failure. + +Step 2 — Count alleles +---------------------- + +**RNA-seq at genes** (GTF): + +.. code-block:: bash + + wasp2-count count-variants \ + wasp_output/sample_final.bam \ + variants.vcf.gz \ + --samples SAMPLE1 \ + --region genes.gtf \ + --out_file allele_counts.tsv + +**ATAC-seq at peaks** (BED): + +.. code-block:: bash + + wasp2-count count-variants \ + wasp_output/sample_final.bam \ + variants.vcf.gz \ + --samples SAMPLE1 \ + --region peaks.bed \ + --out_file allele_counts.tsv + +Output columns: ``chrom``, ``pos``, ``ref``, ``alt``, ``ref_count``, +``alt_count``, ``other_count``, plus ``region`` (peak name) or +``gene_id``/``gene_name`` (GTF). See :doc:`/user_guide/counting` for region +filtering and multi-sample options. + +Step 3 — Test for allelic imbalance +----------------------------------- + +.. code-block:: bash + + wasp2-analyze find-imbalance \ + allele_counts.tsv \ + --min 10 \ + --pseudocount 1 \ + --phased \ + --out_file ai_results.tsv + +The beta-binomial LRT tests :math:`\mu = 0.5` against :math:`\mu \neq 0.5` +per region, correcting for overdispersion and multiple testing via +Benjamini–Hochberg (:doc:`/methods/statistical_models`, +:doc:`/methods/fdr_correction`). + +``--phased`` applies only when the VCF GT field uses ``0|1`` / ``1|0``; pass +it off for unphased data. + +Output columns: ``region``, ``num_snps``, aggregated ``ref_count`` / +``alt_count``, ``pval``, ``fdr_pval``, ``mu``, ``effect_size`` +(:math:`\log_2(\text{ref/alt})`). + +Filtering and interpretation +---------------------------- + +.. code-block:: bash + + # FDR < 0.05 and |log2FC| > 1 + awk -F'\t' 'NR==1 || ($5 < 0.05 && ($6 > 1 || $6 < -1))' \ + ai_results.tsv > significant_ase.tsv + +Monoallelic expression (ref fraction > 0.9 or < 0.1) is a common hallmark of +genomic imprinting on RNA-seq and of strong cis-regulatory variants on ATAC. + +eQTL integration (RNA-seq only) +------------------------------- + +.. code-block:: python + + import pandas as pd + + ase = pd.read_csv('ai_results.tsv', sep='\t') + eqtl = pd.read_csv('gtex_eqtl.tsv', sep='\t') + + merged = ase.merge(eqtl, left_on='region', right_on='gene_id') + + # ASE effect_size > 0 => REF allele more expressed. + # eQTL slope > 0 => ALT allele increases expression. + # Concordance = opposite signs. + merged['concordant'] = (merged['effect_size'] > 0) != (merged['slope'] > 0) + +Troubleshooting +--------------- + +**Few SNPs found.** Confirm the VCF sample column matches ``--samples``, the +VCF has heterozygous genotypes for this donor, and BAM+VCF use the same +reference build. + +**Low power / few significant results.** Increase depth, lower ``--min`` +(with caution), or aggregate at a coarser region level. + +**Too many significant results.** Confirm the WASP filter was applied, check +for batch effects, CNVs, or aligner-version drift, and tighten FDR. + +See Also +-------- + +- :doc:`/user_guide/mapping` — full mapping CLI reference +- :doc:`/user_guide/counting` — counting CLI reference +- :doc:`/user_guide/analysis` — analysis CLI reference +- :doc:`/methods/mapping_filter` — canonical WASP filter contract +- :doc:`/methods/statistical_models` — the LRT and beta-binomial model +- :doc:`/tutorials/scatac_workflow` — single-cell ATAC-seq +- :doc:`/tutorials/scrna_seq` — single-cell RNA-seq +- :doc:`/tutorials/comparative_imbalance` — comparing groups diff --git a/docs/source/tutorials/comparative_imbalance.rst b/docs/source/tutorials/comparative_imbalance.rst index d8d65f2f..4324ee22 100644 --- a/docs/source/tutorials/comparative_imbalance.rst +++ b/docs/source/tutorials/comparative_imbalance.rst @@ -1,212 +1,85 @@ -Comparative Imbalance Analysis Tutorial -======================================= +Comparative Imbalance Analysis +============================== -This tutorial provides a comprehensive guide to detecting **differential allelic imbalance** -between cell types, conditions, or biological groups using WASP2's comparative analysis module. +Compare allelic imbalance between biological groups — cell types, conditions, +sex, treatment status — using ``wasp2-analyze compare-imbalance``. -Overview --------- - -**What is Comparative Imbalance Analysis?** - -While standard allelic imbalance (AI) analysis detects whether a genomic region shows -preferential expression of one allele, comparative imbalance analysis asks a different -question: **Does the degree of imbalance differ between groups?** - -This is powerful for identifying: - -* **Cell-type-specific regulatory variation** - Regions where different cell types show - distinct allelic preferences -* **Condition-dependent effects** - Treatment-induced changes in allelic regulation -* **Sex differences** - Chromatin regions with sex-biased allelic accessibility -* **Developmental dynamics** - Stage-specific changes in allelic regulation - -**Statistical Approach** - -WASP2 uses a **likelihood ratio test (LRT)** to compare two hypotheses: - -.. code-block:: text - - Null Hypothesis (H0): Both groups share the same allelic imbalance (μ_combined) - Alternative Hypothesis (H1): Groups have different imbalance (μ₁ ≠ μ₂) - - Test Statistic: LRT = -2 × (log L_null - log L_alt) - P-value: P(χ²(df=1) > LRT) - -The test accounts for overdispersion using beta-binomial modeling and applies -Benjamini-Hochberg FDR correction for multiple testing. - -Prerequisites -------------- - -**Software:** - -* WASP2 (``pip install wasp2``) -* Python with scanpy/pandas for visualization (optional) -* R with Seurat for cell type annotation (optional) - -**Data Requirements:** +What this tests +--------------- -* AnnData count matrix (``.h5ad``) with allele counts per cell per SNP -* Barcode-to-group mapping file (TSV) -* Groups can be: cell types, conditions, sex, treatment status, etc. - -Input Data Format ------------------ - -Count Matrix (.h5ad) -~~~~~~~~~~~~~~~~~~~~ - -Your AnnData object should have this structure: +Standard allelic-imbalance analysis asks whether a region shows a preference +for one allele. Comparative analysis asks whether the *degree* of imbalance +differs between groups: .. code-block:: text - AnnData object (n_snps × n_cells) - ├── .obs # SNP metadata (rows) - │ ├── index # SNP identifiers - │ └── [sample_name] # Genotypes: '0|1', '1|0', '0/1', '1/0' - │ - ├── .var # Cell metadata (columns) - │ └── group # Cell type/group assignment - │ - ├── .layers - │ ├── "ref" # Reference allele counts (sparse matrix) - │ └── "alt" # Alternate allele counts (sparse matrix) - │ - └── .uns - ├── feature # DataFrame: SNP → region mapping - └── samples # List of sample names - -**Create counts from BAM + VCF:** - -.. code-block:: bash + H0: Both groups share the same allelic imbalance (mu_combined) + H1: Groups have different imbalance (mu_1 != mu_2) - wasp2-count count-variants-sc \ - aligned.bam \ - phased_variants.vcf.gz \ - barcodes.txt \ - --samples SAMPLE_ID \ - --feature peaks.bed \ - --out_file allele_counts.h5ad - -Barcode Map (TSV) -~~~~~~~~~~~~~~~~~ - -A two-column tab-separated file (no header) mapping cell barcodes to groups: - -.. code-block:: text + Test statistic: LRT = -2 * (log L_null - log L_alt) + P-value: P(chi^2(df=1) > LRT) - AAACGAACAGTCAGTT-1 excitatory_neurons - AAACGAAGTCGCTCTA-1 inhibitory_neurons - AAACGAAGTGAACCTA-1 excitatory_neurons - AAAGGATCATCGATGT-1 astrocytes - AAAGGATGTGCAACGA-1 microglia +See :doc:`/methods/statistical_models` for the full model and +:doc:`/user_guide/single_cell` for input-data formats. -**Important:** Barcodes must exactly match those in the count matrix (including any ``-1`` suffix). +Inputs +------ -Tutorial 1: Cell Type Comparison --------------------------------- +Two files are required: -This tutorial demonstrates comparing allelic imbalance between neuronal subtypes. +1. **AnnData count matrix** (``.h5ad``) with ``ref`` and ``alt`` layers, a + genotype column in ``.obs`` for your sample, and a ``group`` column in + ``.var`` identifying the per-cell group assignment. Produced by + ``wasp2-count count-variants-sc``. -Step 1: Prepare Input Files -~~~~~~~~~~~~~~~~~~~~~~~~~~~ +2. **Barcode-to-group TSV** — two columns, no header, tab-separated. -**Export cell type annotations from Seurat:** - -.. code-block:: r - - library(Seurat) - - # Load your analyzed Seurat object - seurat_obj <- readRDS("brain_snATAC.rds") - - # Create barcode-to-celltype mapping - barcode_df <- data.frame( - barcode = colnames(seurat_obj), - celltype = Idents(seurat_obj) - ) - - # Write without header - write.table( - barcode_df, - "barcode_celltype_map.tsv", - sep = "\t", quote = FALSE, - row.names = FALSE, col.names = FALSE - ) - -**Verify the barcode file:** - -.. code-block:: bash + .. code-block:: text - # Check format - head barcode_celltype_map.tsv - # AAACGAACAGTCAGTT-1 excitatory_neurons - # AAACGAAGTCGCTCTA-1 inhibitory_neurons + AAACGAACAGTCAGTT-1 excitatory_neurons + AAACGAAGTCGCTCTA-1 inhibitory_neurons + AAAGGATCATCGATGT-1 astrocytes - # Count cells per type - cut -f2 barcode_celltype_map.tsv | sort | uniq -c | sort -rn - # 2500 excitatory_neurons - # 1800 inhibitory_neurons - # 1200 astrocytes - # 800 oligodendrocytes + Barcodes must match those in the count matrix exactly (including any + ``-1`` suffix). For exporting barcode maps from Seurat or Scanpy, see + :doc:`/user_guide/single_cell`. -Step 2: Run Per-Group Imbalance Analysis -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Example: comparing cell types +----------------------------- -First, analyze imbalance within each cell type: +With counts in ``allele_counts.h5ad`` and cell-type assignments in +``barcode_celltype_map.tsv``, run per-group imbalance first, then compare: .. code-block:: bash + # Per-group imbalance within each cell type wasp2-analyze find-imbalance-sc \ allele_counts.h5ad \ barcode_celltype_map.tsv \ - --sample SAMPLE_ID \ - --phased \ - --min 10 \ - -z 3 - -This produces per-celltype result files (e.g., ``ai_results_excitatory_neurons.tsv``). - -Step 3: Run Comparative Analysis -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + --sample SAMPLE_ID --phased --min 10 -z 3 -Compare imbalance between specific cell types: - -.. code-block:: bash - - # Compare excitatory vs inhibitory neurons + # Pairwise comparison between two named groups wasp2-analyze compare-imbalance \ allele_counts.h5ad \ barcode_celltype_map.tsv \ --sample SAMPLE_ID \ --groups "excitatory_neurons,inhibitory_neurons" \ - --phased \ - --min 15 - -**Compare all pairwise combinations:** - -.. code-block:: bash + --phased --min 15 - # Omit --groups to compare all cell types + # Omit --groups to compare all cell types pairwise wasp2-analyze compare-imbalance \ allele_counts.h5ad \ barcode_celltype_map.tsv \ - --sample SAMPLE_ID \ - --phased \ - --min 15 - -This produces output files for each pairwise comparison: + --sample SAMPLE_ID --phased --min 15 -* ``ai_results_excitatory_neurons_inhibitory_neurons.tsv`` -* ``ai_results_excitatory_neurons_astrocytes.tsv`` -* ``ai_results_inhibitory_neurons_astrocytes.tsv`` -* ... +Pairwise output files are named ``ai_results__.tsv``. -Step 4: Interpret Results -~~~~~~~~~~~~~~~~~~~~~~~~~ +Other common comparisons — sex, treatment, developmental stage — use the +same command with a different ``barcode_map.tsv``. The biology varies; the +command does not. -**Output columns explained:** +Output columns +-------------- .. list-table:: :header-rows: 1 @@ -215,331 +88,101 @@ Step 4: Interpret Results * - Column - Description * - ``region`` - - Genomic region (peak or gene) identifier + - Genomic region identifier * - ``num_snps`` - - Number of shared heterozygous SNPs used for comparison + - Shared heterozygous SNPs used for the comparison * - ``combined_mu`` - - Reference allele frequency under null hypothesis (shared between groups) - * - ``mu1`` - - Reference allele frequency in group 1 (e.g., excitatory neurons) - * - ``mu2`` - - Reference allele frequency in group 2 (e.g., inhibitory neurons) - * - ``null_ll`` - - Log-likelihood under null hypothesis (shared μ) - * - ``alt_ll`` - - Log-likelihood under alternative hypothesis (separate μ values) + - Reference-allele frequency under H0 (pooled across groups) + * - ``mu1``, ``mu2`` + - Per-group reference-allele frequencies + * - ``null_ll``, ``alt_ll`` + - Log-likelihoods under H0 / H1 * - ``pval`` - - Likelihood ratio test p-value + - Likelihood-ratio-test p-value * - ``fdr_pval`` - - FDR-corrected p-value (Benjamini-Hochberg) - -**Filtering significant results:** - -.. code-block:: bash - - # Significant differential imbalance (FDR < 0.05) - awk -F'\t' 'NR==1 || $9 < 0.05' ai_results_excitatory_neurons_inhibitory_neurons.tsv \ - > significant_differential_AI.tsv - - # Large effect size (>15% difference in allele frequency) - awk -F'\t' 'NR==1 || ($4 - $5 > 0.15 || $5 - $4 > 0.15)' significant_differential_AI.tsv \ - > large_effect_differential_AI.tsv + - Benjamini–Hochberg adjusted p-value -**Interpret μ values:** +Interpretation: ``|mu1 - mu2| > 0.1`` is a meaningful difference in allele +preference; ``fdr_pval < 0.05`` declares significance after multiple-testing +correction. -* ``mu < 0.5``: Alternate allele favored -* ``mu > 0.5``: Reference allele favored -* ``|mu1 - mu2| > 0.1``: Meaningful difference (~20% shift in allele preference) - -Tutorial 2: Sex Differences Analysis ------------------------------------- - -Identify regions with sex-biased allelic imbalance in chromatin accessibility. - -Step 1: Create Sex-Labeled Barcode Map -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: python - - import pandas as pd - import scanpy as sc - - # Load your annotated data - adata = sc.read_h5ad("processed_snATAC.h5ad") - - # Create barcode-to-sex mapping - barcode_df = pd.DataFrame({ - 'barcode': adata.obs_names, - 'sex': adata.obs['donor_sex'] # 'male' or 'female' - }) - - # Write without header - barcode_df.to_csv('barcode_sex_map.tsv', sep='\t', header=False, index=False) - -Step 2: Run Comparative Analysis -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - wasp2-analyze compare-imbalance \ - allele_counts.h5ad \ - barcode_sex_map.tsv \ - --sample SAMPLE_ID \ - --groups "male,female" \ - --phased \ - --min 20 \ - --out_file ai_results_sex_comparison.tsv - -Step 3: Identify Sex-Biased Regions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - # Extract significant sex differences - awk -F'\t' 'NR==1 || $9 < 0.01' ai_results_sex_comparison.tsv > sex_biased_regions.tsv - - # Count by chromosome (expect enrichment on X) - cut -f1 sex_biased_regions.tsv | grep -E "^chr" | cut -d: -f1 | sort | uniq -c - -Tutorial 3: Treatment vs Control --------------------------------- - -Compare allelic imbalance before and after drug treatment. - -Step 1: Prepare Condition Labels -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: python - - import pandas as pd - - # Load metadata with treatment status - metadata = pd.read_csv("sample_metadata.csv") - - # Create barcode-to-condition mapping - barcode_df = pd.DataFrame({ - 'barcode': metadata['cell_barcode'], - 'condition': metadata['treatment_status'] # 'treated' or 'control' - }) - - barcode_df.to_csv('barcode_treatment_map.tsv', sep='\t', header=False, index=False) - -Step 2: Run Analysis -~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - wasp2-analyze compare-imbalance \ - allele_counts.h5ad \ - barcode_treatment_map.tsv \ - --sample SAMPLE_ID \ - --groups "treated,control" \ - --min 15 \ - --out_file ai_results_treatment.tsv - -Step 3: Identify Treatment-Responsive Regions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: python - - import pandas as pd - - # Load results - results = pd.read_csv('ai_results_treated_control.tsv', sep='\t') - - # Significant treatment effects - significant = results[results['fdr_pval'] < 0.05] - print(f"Found {len(significant)} treatment-responsive regions") - - # Direction of change - treatment_gain = significant[significant['mu1'] > significant['mu2'] + 0.1] - treatment_loss = significant[significant['mu2'] > significant['mu1'] + 0.1] - - print(f"Regions with increased ref allele in treatment: {len(treatment_gain)}") - print(f"Regions with decreased ref allele in treatment: {len(treatment_loss)}") - -Visualization Examples ----------------------- +Visualization +------------- -Volcano Plot -~~~~~~~~~~~~ +A minimal volcano plot: .. code-block:: python import pandas as pd - import matplotlib.pyplot as plt import numpy as np + import matplotlib.pyplot as plt - # Load results - results = pd.read_csv('ai_results_excitatory_neurons_inhibitory_neurons.tsv', sep='\t') - - # Calculate effect size (difference in mu) - results['effect_size'] = results['mu1'] - results['mu2'] - results['-log10_pval'] = -np.log10(results['pval'] + 1e-300) - - # Create volcano plot - fig, ax = plt.subplots(figsize=(10, 8)) - - # Non-significant points - ns = results['fdr_pval'] >= 0.05 - ax.scatter(results.loc[ns, 'effect_size'], - results.loc[ns, '-log10_pval'], - c='gray', alpha=0.5, s=10, label='Not significant') + results = pd.read_csv('ai_results_excitatory_inhibitory.tsv', sep='\t') + results['effect'] = results['mu1'] - results['mu2'] + results['neglog10p'] = -np.log10(results['pval'].clip(lower=1e-300)) - # Significant points + fig, ax = plt.subplots(figsize=(8, 6)) sig = results['fdr_pval'] < 0.05 - ax.scatter(results.loc[sig, 'effect_size'], - results.loc[sig, '-log10_pval'], - c='red', alpha=0.7, s=20, label='FDR < 0.05') - - ax.axhline(-np.log10(0.05), color='black', linestyle='--', alpha=0.5) - ax.axvline(0, color='black', linestyle='-', alpha=0.3) - - ax.set_xlabel('Effect Size (μ₁ - μ₂)', fontsize=12) - ax.set_ylabel('-log₁₀(p-value)', fontsize=12) - ax.set_title('Differential Allelic Imbalance:\nExcitatory vs Inhibitory Neurons', fontsize=14) + ax.scatter(results.loc[~sig, 'effect'], results.loc[~sig, 'neglog10p'], + c='lightgrey', s=8, label='FDR >= 0.05') + ax.scatter(results.loc[sig, 'effect'], results.loc[sig, 'neglog10p'], + c='crimson', s=14, label='FDR < 0.05') + ax.axhline(-np.log10(0.05), ls='--', c='black', alpha=0.4) + ax.axvline(0, ls='-', c='black', alpha=0.3) + ax.set(xlabel='mu1 - mu2', ylabel='-log10(p)', + title='Excitatory vs. inhibitory neurons') ax.legend() - plt.tight_layout() - plt.savefig('differential_AI_volcano.png', dpi=150) - -Heatmap of Top Hits -~~~~~~~~~~~~~~~~~~~ - -.. code-block:: python - import pandas as pd - import seaborn as sns - import matplotlib.pyplot as plt - - # Load results from multiple comparisons - comparisons = [ - ('excitatory', 'inhibitory'), - ('excitatory', 'astrocyte'), - ('inhibitory', 'astrocyte'), - ] - - # Collect mu values for top regions - all_results = {} - for g1, g2 in comparisons: - df = pd.read_csv(f'ai_results_{g1}_{g2}.tsv', sep='\t') - all_results[(g1, g2)] = df.set_index('region') - - # Find regions significant in any comparison - sig_regions = set() - for df in all_results.values(): - sig_regions.update(df[df['fdr_pval'] < 0.05].index[:20]) # Top 20 each - - # Build heatmap matrix (mu values per cell type) - celltypes = ['excitatory', 'inhibitory', 'astrocyte'] - heatmap_data = pd.DataFrame(index=list(sig_regions), columns=celltypes) - - for region in sig_regions: - for g1, g2 in comparisons: - if region in all_results[(g1, g2)].index: - row = all_results[(g1, g2)].loc[region] - heatmap_data.loc[region, g1] = row['mu1'] - heatmap_data.loc[region, g2] = row['mu2'] - - # Plot heatmap - fig, ax = plt.subplots(figsize=(8, 12)) - sns.heatmap(heatmap_data.astype(float), cmap='RdBu_r', center=0.5, - vmin=0, vmax=1, ax=ax, cbar_kws={'label': 'Ref Allele Frequency (μ)'}) - ax.set_title('Cell-Type-Specific Allelic Imbalance', fontsize=14) - plt.tight_layout() - plt.savefig('differential_AI_heatmap.png', dpi=150) +For heatmaps across many cell types or overlap with eQTL/cis-QTL tables, see +the :doc:`/user_guide/analysis` guide. -Command-Line Reference +Command-line reference ---------------------- -Full Parameter List -~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - wasp2-analyze compare-imbalance --help +.. code-block:: text Usage: wasp2-analyze compare-imbalance [OPTIONS] ADATA BARCODE_MAP - Arguments: ADATA AnnData file with allele counts (.h5ad) BARCODE_MAP TSV file mapping barcodes to groups - Options: - --groups TEXT Comma-separated groups to compare (default: all) - --min INTEGER Minimum allele count per region per group (default: 10) - --pseudocount INT Pseudocount for zero counts (default: 1) + --groups TEXT Comma-separated groups (default: all pairwise) + --min INT Minimum allele count per region per group [10] + --pseudocount INT Pseudocount for zero counts [1] --sample TEXT Sample name for genotype filtering --phased Use phased genotype information - -z, --z_cutoff INT Remove outlier SNPs above z-score threshold - --out_file TEXT Output file path + -z, --z_cutoff INT Z-score outlier filter + --out_file TEXT Output path -Best Practices +Good practices -------------- -Data Quality -~~~~~~~~~~~~ - -* **Use WASP-filtered BAMs** to remove mapping bias artifacts -* **Require sufficient counts** (``--min 15`` or higher for robust estimates) -* **Apply z-score filtering** (``-z 3``) to remove outliers from CNVs or mapping artifacts - -Statistical Power -~~~~~~~~~~~~~~~~~ - -* **Merge similar groups** if individual populations have low cell counts -* **Use phased genotypes** when available for improved power -* **Focus on regions with multiple SNPs** for more reliable estimates - -Interpretation -~~~~~~~~~~~~~~ +- Run on WASP-filtered BAMs to remove mapping-bias artifacts + (:doc:`/methods/mapping_filter`). +- Use ``--min 15`` or higher for robust estimates when per-group coverage + is moderate. +- Apply ``-z 3`` to drop SNP outliers from CNVs or mapping artifacts. +- Check that groups have comparable sequencing depth; very uneven depth can + produce false-positive differences. +- Validate top hits in an independent cohort or with an orthogonal assay. -* **Biological replication** - Validate across independent samples -* **Effect size matters** - Consider the absolute difference between μ₁ and μ₂ alongside p-values -* **Integrate with eQTL data** - Connect to known regulatory variants -* **Orthogonal validation** - Confirm top hits with targeted methods - -Common Issues +Common issues ------------- -Low Power / Few Significant Results -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -* Increase sequencing depth -* Merge similar cell types to increase counts per group -* Lower ``--min`` threshold (with caution) -* Use phased genotypes if available - -Too Many Significant Results -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -* Check for batch effects between groups -* Verify WASP filtering was applied -* Use stricter FDR threshold (e.g., 0.01) -* Check that groups have similar sequencing depth - -Memory Issues -~~~~~~~~~~~~~ - -Process chromosomes separately: - -.. code-block:: bash +**Few significant results.** Increase coverage, merge similar populations to +increase per-group counts, or use phased genotypes if available. - for chr in chr{1..22}; do - wasp2-count count-variants-sc \ - sample.bam variants.vcf.gz barcodes.tsv \ - --region peaks_${chr}.bed \ - --out_file counts_${chr}.h5ad +**Too many significant results.** Check for batch effects between groups, +confirm WASP filtering was applied, and use a stricter FDR threshold. - wasp2-analyze compare-imbalance \ - counts_${chr}.h5ad \ - barcode_celltype_map.tsv \ - --out_file results_${chr}.tsv - done +**Memory.** For large cohorts, process chromosomes separately and concatenate +results. See Also -------- -* :doc:`/user_guide/analysis` - Statistical methods and parameters -* :doc:`/user_guide/single_cell` - Single-cell data formats -* :doc:`/tutorials/scrna_seq` - Basic scRNA-seq tutorial +- :doc:`/user_guide/analysis` — analysis-CLI reference and parameters +- :doc:`/user_guide/single_cell` — input data formats, barcode exports +- :doc:`/tutorials/scrna_seq` — basic single-cell workflow +- :doc:`/methods/statistical_models` — the LRT underlying this test diff --git a/docs/source/tutorials/quickstart_mapping.rst b/docs/source/tutorials/quickstart_mapping.rst deleted file mode 100644 index 41317a03..00000000 --- a/docs/source/tutorials/quickstart_mapping.rst +++ /dev/null @@ -1,257 +0,0 @@ -Quickstart: WASP Mapping Filter -================================ - -Learn WASP2's mapping bias correction in 5 minutes. - -.. contents:: Contents - :local: - :depth: 2 - -Overview --------- - -**Goal:** Understand and apply the WASP mapping filter to remove reference bias from your alignment data. - -**Time:** ~5 minutes to read, ~30 minutes to run on typical data - -**Prerequisites:** - -* WASP2 installed (``pip install wasp2``) -* Aligned BAM file (coordinate-sorted) -* VCF file with heterozygous variants - -The Problem: Reference Mapping Bias ------------------------------------ - -When reads are aligned to a reference genome, there's an inherent asymmetry: - -.. code-block:: text - - Reference: ...ACGT[A]CGTA... (reference allele: A) - Read (ref): ...ACGT[A]CGTA... → Perfect match (0 mismatches) - Read (alt): ...ACGT[G]CGTA... → 1 mismatch penalty - -**Result**: Reads carrying the alternate allele are more likely to: - -- Fail to map entirely -- Map with lower quality scores -- Map to incorrect locations - -This causes **inflated reference allele counts**, leading to false positive ASE signals. - -The Solution: WASP Remap-and-Filter ------------------------------------ - -WASP corrects this by testing whether each read would map identically -regardless of which allele it carries: - -1. **Identify**: Find reads overlapping heterozygous SNPs -2. **Swap**: Create versions with alleles swapped (ref→alt, alt→ref) -3. **Remap**: Align swapped reads with the same aligner -4. **Filter**: Keep only reads that map to the **same location** after swapping - -After filtering, the probability of mapping is equal for both alleles: - -.. math:: - - P(\text{map} | \text{ref allele}) = P(\text{map} | \text{alt allele}) - -Quick Workflow --------------- - -Step 1: Create Swapped Reads -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Identify reads overlapping heterozygous SNPs and generate allele-swapped versions: - -.. code-block:: bash - - wasp2-map make-reads sample.bam variants.vcf.gz \ - --samples SAMPLE1 \ - --out_dir wasp_output - -This produces (where ``sample`` is your BAM file prefix): - -* ``wasp_output/sample_to_remap.bam``: Original reads needing remapping -* ``wasp_output/sample_keep.bam``: Reads not overlapping variants (kept as-is) -* ``wasp_output/sample_swapped_alleles_r1.fq``: Allele-swapped read 1 -* ``wasp_output/sample_swapped_alleles_r2.fq``: Allele-swapped read 2 -* ``wasp_output/sample_wasp_data_files.json``: Metadata for filter step - -Step 2: Remap Swapped Reads -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -**Critical**: Use the **same aligner and parameters** as your original mapping! - -.. code-block:: bash - - # Example with BWA (replace 'sample' with your BAM file prefix) - bwa mem -M -t 8 genome.fa \ - wasp_output/sample_swapped_alleles_r1.fq \ - wasp_output/sample_swapped_alleles_r2.fq | \ - samtools sort -o wasp_output/sample_remapped.bam - - - samtools index wasp_output/sample_remapped.bam - -Using different alignment parameters will invalidate the WASP correction. - -Step 3: Filter Remapped Reads -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The WASP filter compares original and remapped positions: - -.. code-block:: bash - - wasp2-map filter-remapped \ - wasp_output/sample_remapped.bam \ - --wasp_data_json wasp_output/sample_wasp_data_files.json \ - --out_bam wasp_output/sample_wasp_filtered.bam - -Understanding Filter Statistics -------------------------------- - -The WASP filter reports three key metrics: - -.. table:: WASP Filter Metrics - :widths: 20 50 30 - - ================= ============================================ ============== - Metric Description Typical Value - ================= ============================================ ============== - **Kept reads** Reads that passed the filter 90-99% - **Removed (moved)** Reads that mapped to different locations 1-8% - **Removed (missing)** Reads that failed to remap <1% - ================= ============================================ ============== - -Interpreting Filter Rates -~~~~~~~~~~~~~~~~~~~~~~~~~ - -* **95-99% kept**: Good - typical for most data types -* **90-95% kept**: Acceptable - may indicate difficult regions -* **<90% kept**: Investigate - check data quality or variant calls - -Before/After Example --------------------- - -At a site with mapping bias: - -.. table:: Example: Before and After WASP - :widths: 30 35 35 - - =============== =============== =============== - Metric Before WASP After WASP - =============== =============== =============== - Reference reads 150 95 - Alternate reads 80 85 - Ref fraction 0.65 0.53 - =============== =============== =============== - -The biased site (0.65 ref fraction) is corrected to near-balanced (0.53). - -Complete Workflow Script ------------------------- - -.. code-block:: bash - - #!/bin/bash - set -e - - # Input files - BAM="sample.bam" - VCF="variants.vcf.gz" - SAMPLE="SAMPLE1" - GENOME="genome.fa" - OUTDIR="wasp_output" - - # Extract BAM prefix (filename without .bam extension) - PREFIX=$(basename $BAM .bam) - - mkdir -p $OUTDIR - - # Step 1: Create allele-swapped reads - echo "Step 1: Creating swapped reads..." - wasp2-map make-reads $BAM $VCF \ - --samples $SAMPLE \ - --out_dir $OUTDIR - - # Step 2: Remap with same aligner - echo "Step 2: Remapping swapped reads..." - bwa mem -M -t 8 $GENOME \ - $OUTDIR/${PREFIX}_swapped_alleles_r1.fq \ - $OUTDIR/${PREFIX}_swapped_alleles_r2.fq | \ - samtools sort -o $OUTDIR/${PREFIX}_remapped.bam - - samtools index $OUTDIR/${PREFIX}_remapped.bam - - # Step 3: Filter biased reads - echo "Step 3: Filtering biased reads..." - wasp2-map filter-remapped \ - $OUTDIR/${PREFIX}_remapped.bam \ - --wasp_data_json $OUTDIR/${PREFIX}_wasp_data_files.json \ - --out_bam $OUTDIR/${PREFIX}_wasp_filtered.bam - - # Step 4: Merge with non-overlapping reads - echo "Step 4: Merging final BAM..." - samtools merge -f $OUTDIR/${PREFIX}_final.bam \ - $OUTDIR/${PREFIX}_wasp_filtered.bam \ - $OUTDIR/${PREFIX}_keep.bam - samtools index $OUTDIR/${PREFIX}_final.bam - - echo "Done! WASP-filtered BAM: $OUTDIR/${PREFIX}_final.bam" - -Rust Acceleration ------------------ - -WASP2 includes a high-performance Rust backend that accelerates the filter step: - -.. table:: Performance Comparison - :widths: 30 35 35 - - ============= =============== =============== - Dataset Size Python Rust - ============= =============== =============== - 1M reads ~5 minutes ~30 seconds - 10M reads ~50 minutes ~5 minutes - 100M reads ~8 hours ~50 minutes - ============= =============== =============== - -The Rust backend is used automatically when available. - -Next Steps ----------- - -After WASP filtering: - -1. **Count alleles** on the filtered BAM: - - .. code-block:: bash - - wasp2-count count-variants wasp_filtered.bam variants.vcf - -2. **Analyze allelic imbalance**: - - .. code-block:: bash - - wasp2-analyze find-imbalance counts.tsv - -See Also --------- - -* :doc:`/user_guide/mapping` - Detailed mapping module documentation -* :doc:`/methods/mapping_filter` - Algorithm details and mathematics -* :doc:`/tutorials/scrna_seq` - Single-cell RNA-seq workflow - -Summary -------- - -.. table:: Key Takeaways - :widths: 25 75 - - ============ =============================================== - Concept Key Point - ============ =============================================== - **Problem** Reference bias inflates ref allele counts - **Solution** WASP remap-and-filter removes biased reads - **Workflow** make-reads → remap → filter-remapped - **Expected** 90-99% reads pass filter - **Result** Unbiased allele counts for ASE analysis - ============ =============================================== diff --git a/docs/source/tutorials/rna_seq.rst b/docs/source/tutorials/rna_seq.rst deleted file mode 100644 index c5fd0876..00000000 --- a/docs/source/tutorials/rna_seq.rst +++ /dev/null @@ -1,203 +0,0 @@ -RNA-seq Allelic Imbalance Tutorial -=================================== - -This tutorial demonstrates a complete workflow for detecting allele-specific expression (ASE) -in bulk RNA-seq data using WASP2. - -**Estimated time:** ~30 minutes - -Overview --------- - -The tutorial covers the complete RNA-seq allelic imbalance analysis pipeline: - -1. **Data Loading** - BAM, VCF, and gene annotations (GTF) -2. **Allele Counting** - Count reads at heterozygous SNPs within genes/exons -3. **Statistical Testing** - Beta-binomial model for allelic imbalance detection -4. **ASE Visualization** - Volcano plots and allele ratio distributions -5. **Imprinting Detection** - Identify monoallelic expression patterns -6. **eQTL Integration** - Connect ASE signals to regulatory variants - -Prerequisites -------------- - -**Software:** - -* WASP2 (``pip install wasp2``) -* Python packages: pandas, numpy, matplotlib, seaborn, scipy - -**Data:** - -* Aligned BAM file (coordinate-sorted, indexed) -* Phased VCF file with heterozygous variants -* Gene annotation file (GTF format, e.g., GENCODE) - -Workflow Summary ----------------- - -Step 1: Count Alleles at Genes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Use ``wasp2-count count-variants`` to count allele-specific reads at heterozygous SNPs: - -.. code-block:: bash - - wasp2-count count-variants \ - sample.bam \ - variants.vcf.gz \ - --samples SAMPLE_ID \ - --region genes.gtf \ - --out_file allele_counts.tsv - -This produces a TSV file with columns: - -* ``chr``, ``pos``: SNP location -* ``ref``, ``alt``: Alleles -* ``ref_count``, ``alt_count``: Read counts per allele -* ``other_count``: Reads supporting other alleles (non-ref, non-alt) -* ``gene_id``, ``gene_name``: Overlapping gene annotation -* ``feature``: Feature type (exon, intron, etc.) when using GTF - -Step 2: Test for Allelic Imbalance -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Use ``wasp2-analyze find-imbalance`` to detect significant ASE: - -.. code-block:: bash - - wasp2-analyze find-imbalance \ - allele_counts.tsv \ - --min 10 \ - --pseudocount 1 \ - --phased \ - --out_file ai_results.tsv - -The beta-binomial model tests for deviation from 50:50 allele ratios, accounting for -biological overdispersion. - -**Output columns:** - -* ``region``: Gene identifier -* ``ref_count``, ``alt_count``: Aggregated counts -* ``p_value``: Likelihood ratio test p-value -* ``fdr_pval``: FDR-corrected p-value -* ``effect_size``: Log2 fold change (ref/alt) - -Step 3: Identify Significant ASE -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Filter results by significance and effect size: - -.. code-block:: bash - - # Significant ASE (FDR < 0.05, |log2FC| > 1) - # Column indices: $5 = fdr_pval, $6 = effect_size - awk -F'\t' 'NR==1 || ($5 < 0.05 && ($6 > 1 || $6 < -1))' ai_results.tsv > significant_ase.tsv - -Step 4: Detect Imprinting Patterns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Monoallelic expression (>90:10 allele ratio) may indicate genomic imprinting: - -.. code-block:: python - - import pandas as pd - - results = pd.read_csv('ai_results.tsv', sep='\t') - total = results['ref_count'] + results['alt_count'] - results['ref_ratio'] = results['ref_count'] / total - - # Monoallelic genes - monoallelic = results[ - (results['ref_ratio'] > 0.9) | (results['ref_ratio'] < 0.1) - ] - -Step 5: Integrate with eQTL Data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Cross-reference ASE results with eQTL databases (e.g., GTEx): - -.. code-block:: python - - import pandas as pd - - ase = pd.read_csv('ai_results.tsv', sep='\t') - eqtl = pd.read_csv('gtex_eqtl.tsv', sep='\t') - - # Merge on gene ID - integrated = ase.merge(eqtl, left_on='region', right_on='gene_id') - - # Check direction concordance between ASE and eQTL - # ASE effect_size > 0 means REFERENCE allele is more expressed - # eQTL slope > 0 means ALTERNATE allele INCREASES expression - # Therefore, concordance = OPPOSITE signs (ref high in ASE = alt low in eQTL) - integrated['concordant'] = ( - (integrated['effect_size'] > 0) != (integrated['slope'] > 0) - ) - -Key Concepts ------------- - -Beta-Binomial Model -~~~~~~~~~~~~~~~~~~~ - -WASP2 uses a beta-binomial distribution to model allele counts: - -* Accounts for **overdispersion** (biological variation beyond binomial sampling) -* Models **technical noise** from PCR amplification and sequencing -* Aggregates information across **multiple SNPs** per gene - -The null hypothesis is equal expression from both alleles (p = 0.5). - -Effect Size Interpretation -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -* **log2FC > 1**: Reference allele 2x more expressed (strong ASE) -* **log2FC > 2**: Reference allele 4x more expressed (very strong ASE) -* **log2FC near 0**: Balanced expression (no ASE) - -Significance Thresholds -~~~~~~~~~~~~~~~~~~~~~~~ - -* **FDR < 0.05**: Standard significance threshold -* **FDR < 0.01**: Stringent threshold for high-confidence hits -* Combine with effect size filters to focus on biologically meaningful results - -Troubleshooting ---------------- - -Low SNP Counts -~~~~~~~~~~~~~~ - -If few heterozygous SNPs are detected: - -* Verify VCF contains heterozygous genotypes: - - - **Phased format**: GT = ``0|1`` or ``1|0`` (pipe separator) - - **Unphased format**: GT = ``0/1`` or ``1/0`` (slash separator) - - Use ``--phased`` flag only with phased genotypes - -* Check sample ID matches VCF sample column -* Ensure BAM and VCF use the same reference genome - -No Significant Results -~~~~~~~~~~~~~~~~~~~~~~ - -* Increase sequencing depth (more reads = more power) -* Lower ``--min`` threshold (but interpret with caution) -* Check for batch effects or technical artifacts - -Too Many Significant Results -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -* Verify WASP mapping bias correction was applied -* Check for copy number variation (CNV) artifacts -* Use stricter FDR threshold (e.g., 0.01) - -See Also --------- - -* :doc:`/user_guide/counting` - Detailed counting options -* :doc:`/user_guide/analysis` - Statistical methods and parameters -* :doc:`/tutorials/scrna_seq` - Single-cell RNA-seq tutorial -* :doc:`/tutorials/comparative_imbalance` - Comparing ASE between groups diff --git a/docs/source/tutorials/scrna_seq.rst b/docs/source/tutorials/scrna_seq.rst index 1ee8d54b..35bdb071 100644 --- a/docs/source/tutorials/scrna_seq.rst +++ b/docs/source/tutorials/scrna_seq.rst @@ -1,160 +1,22 @@ 10X scRNA-seq Tutorial ====================== -This tutorial walks through a complete WASP2 workflow for detecting allele-specific expression in 10X Genomics single-cell RNA-seq data. +End-to-end allele-specific expression (ASE) workflow for 10X Chromium +scRNA-seq. Assumes a Cell Ranger ``possorted_genome_bam.bam`` with cell +barcodes in the ``CB`` tag, a phased VCF for the donor, and a cell-type +annotation from Seurat or Scanpy. -Overview --------- - -**Goal:** Identify genes with allele-specific expression (ASE) in different cell types from 10X Chromium scRNA-seq data. - -**Input Data:** - -* Cell Ranger output (BAM + filtered barcodes) -* Phased VCF file with heterozygous variants -* Cell type annotations from Seurat/Scanpy - -Prerequisites -------------- - -**Software:** - -* WASP2 (``pip install wasp2``) -* Cell Ranger output (v3+) -* R with Seurat or Python with Scanpy - -**Data:** - -* Aligned BAM file with cell barcodes (CB tag) -* Phased VCF for your sample -* Completed cell type annotation +Inputs +------ -Step 1: Prepare Input Data --------------------------- +- Cell Ranger BAM + index +- Phased VCF/BCF/PGEN for the sample +- A barcode-to-group TSV (see :doc:`/user_guide/single_cell` for Seurat / + Scanpy export code and exact format) +- GTF gene annotation -Cell Ranger Output Structure -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -After running Cell Ranger, your output directory contains: - -.. code-block:: text - - cellranger_output/ - └── outs/ - ├── possorted_genome_bam.bam - ├── possorted_genome_bam.bam.bai - └── filtered_feature_bc_matrix/ - ├── barcodes.tsv.gz - ├── features.tsv.gz - └── matrix.mtx.gz - -The BAM file contains cell barcodes in the ``CB`` tag: - -.. code-block:: bash - - # View CB tags in BAM - samtools view possorted_genome_bam.bam | head -1 | tr '\t' '\n' | grep CB - # Output: CB:Z:AAACCCAAGAAACACT-1 - -Step 2: Generate Barcode File ------------------------------ - -From Seurat Analysis -~~~~~~~~~~~~~~~~~~~~ - -After running Seurat clustering and annotation: - -.. code-block:: r - - library(Seurat) - - # Load your analyzed Seurat object - seurat_obj <- readRDS("seurat_analyzed.rds") - - # Check available metadata columns - head(seurat_obj@meta.data) - - # Extract barcodes and cell types - barcode_df <- data.frame( - barcode = colnames(seurat_obj), - cell_type = seurat_obj$celltype_annotation # Your annotation column - ) - - # Preview the data - head(barcode_df) - #> barcode cell_type - #> 1 AAACCCAAGAAACACT-1 B_cell - #> 2 AAACCCAAGAAACTGT-1 B_cell - #> 3 AAACCCAAGAAAGCGA-1 CD4_T_cell - - # Write barcode file (no header, tab-separated) - write.table( - barcode_df, - file = "barcodes_celltype.tsv", - sep = "\t", - quote = FALSE, - row.names = FALSE, - col.names = FALSE - ) - -From Scanpy Analysis -~~~~~~~~~~~~~~~~~~~~ - -After running Scanpy clustering and annotation: - -.. code-block:: python - - import scanpy as sc - import pandas as pd - - # Load your analyzed AnnData object - adata = sc.read_h5ad("scanpy_analyzed.h5ad") - - # Check available annotations - print(adata.obs.columns) - - # Extract barcodes and cell types - barcode_df = pd.DataFrame({ - 'barcode': adata.obs_names, - 'cell_type': adata.obs['leiden_annotation'] # Your annotation column - }) - - # Preview the data - print(barcode_df.head()) - # barcode cell_type - # 0 AAACCCAAGAAACACT-1 B_cell - # 1 AAACCCAAGAAACTGT-1 B_cell - # 2 AAACCCAAGAAAGCGA-1 CD4_T_cell - - # Write barcode file (no header, tab-separated) - barcode_df.to_csv( - 'barcodes_celltype.tsv', - sep='\t', - header=False, - index=False - ) - -Verify Barcode Format -~~~~~~~~~~~~~~~~~~~~~ - -.. code-block:: bash - - # Check barcode file format - head barcodes_celltype.tsv - # AAACCCAAGAAACACT-1 B_cell - # AAACCCAAGAAACTGT-1 B_cell - - # Count cells per type - cut -f2 barcodes_celltype.tsv | sort | uniq -c | sort -rn - # 1500 CD4_T_cell - # 1200 B_cell - # 800 Monocyte - # ... - -Step 3: Count Allele-Specific Reads ------------------------------------ - -Run the single-cell allele counting: +Step 1 — Count alleles per cell per gene +----------------------------------------- .. code-block:: bash @@ -166,20 +28,12 @@ Run the single-cell allele counting: --samples SAMPLE_ID \ --out_file allele_counts.h5ad -**Parameters:** +Output: an AnnData ``.h5ad`` with ``ref`` / ``alt`` / ``other`` layers, +genotype columns in ``.obs``, and cell-type assignments in ``.var``. See +:doc:`/user_guide/single_cell` for the full schema. -* ``barcodes_celltype.tsv``: Your barcode file with cell type annotations (positional) -* ``--region``: Gene annotation file (GTF/GFF) or peak file (BED) -* ``--samples``: Sample ID matching VCF sample column -* ``--out_file``: Output AnnData file - -Step 4: Analyze Allelic Imbalance ---------------------------------- - -Cell-Type-Specific Analysis -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Analyze imbalance within each cell type: +Step 2 — Per-cell-type imbalance +-------------------------------- .. code-block:: bash @@ -189,20 +43,11 @@ Analyze imbalance within each cell type: --sample SAMPLE_ID \ --out_file imbalance_by_celltype.tsv -**Output columns:** - -* ``region``: Gene or genomic region -* ``cell_type``: Cell type from barcode file -* ``ref_count``: Total reference allele counts -* ``alt_count``: Total alternate allele counts -* ``p_value``: Statistical significance -* ``fdr_pval``: FDR-corrected p-value -* ``effect_size``: Log2 fold change +Output columns: ``region``, ``cell_type``, aggregated ``ref_count`` / +``alt_count``, ``pval``, ``fdr_pval``, ``effect_size`` (log₂ ref/alt). -Compare Between Cell Types -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Find differential allelic imbalance between cell types: +Step 3 — Compare cell types (optional) +-------------------------------------- .. code-block:: bash @@ -212,122 +57,56 @@ Find differential allelic imbalance between cell types: --groups "CD4_T_cell,CD8_T_cell" \ --out_file differential_imbalance.tsv -Step 5: Interpret Results -------------------------- +For more on comparative analysis (multiple groups, all-pairs, volcano +plots), see :doc:`comparative_imbalance`. -Load and explore results in Python: +Interpreting results +-------------------- .. code-block:: python import pandas as pd - import matplotlib.pyplot as plt - # Load results results = pd.read_csv('imbalance_by_celltype.tsv', sep='\t') + sig = results[results['fdr_pval'] < 0.05] - # Filter significant results (FDR < 0.05) - significant = results[results['fdr_pval'] < 0.05] - print(f"Found {len(significant)} significant ASE events") - - # Top genes per cell type - top_genes = (significant - .groupby('cell_type') - .apply(lambda x: x.nsmallest(10, 'fdr_pval')) - .reset_index(drop=True)) - - print(top_genes[['region', 'cell_type', 'effect_size', 'fdr_pval']]) - - # Visualize effect sizes - fig, ax = plt.subplots(figsize=(10, 6)) - significant.boxplot(column='effect_size', by='cell_type', ax=ax) - plt.title('Allelic Imbalance by Cell Type') - plt.ylabel('Log2 Fold Change (Ref/Alt)') - plt.savefig('ase_by_celltype.png') + top = (sig.groupby('cell_type') + .apply(lambda x: x.nsmallest(10, 'fdr_pval')) + .reset_index(drop=True)) -Example Output --------------- - -.. code-block:: text - - region cell_type ref_count alt_count fdr_pval effect_size - ENSG00000123456 B_cell 245 89 0.001 1.46 - ENSG00000234567 CD4_T_cell 156 312 0.003 -1.00 - ENSG00000345678 Monocyte 423 198 0.012 1.09 + print(top[['region', 'cell_type', 'effect_size', 'fdr_pval']]) Troubleshooting --------------- -No Cells Matched -~~~~~~~~~~~~~~~~ - -If you see "0 barcodes matched": - -.. code-block:: bash - - # Check BAM barcode format - samtools view your.bam | head -1000 | grep -o 'CB:Z:[^\t]*' | head - - # Compare with your barcode file - head barcodes.tsv - - # Common issues: - # - Missing -1 suffix in barcode file - # - Barcode file has header (should not) - # - Different barcode versions (v2 vs v3) - -**Quick Diagnostic:** +**Zero barcodes matched.** Confirm barcode format in the BAM vs. the TSV — +the ``CB:Z:...`` tag often has a ``-1`` suffix that your export must match: .. code-block:: bash - # Compare BAM barcodes with file - samtools view your.bam | head -10000 | grep -o 'CB:Z:[^\t]*' | cut -d: -f3 | sort -u > bam_bc.txt + samtools view your.bam | head -10000 | grep -o 'CB:Z:[^[:space:]]*' \ + | cut -d: -f3 | sort -u > bam_bc.txt cut -f1 barcodes.tsv | sort -u > file_bc.txt - comm -12 bam_bc.txt file_bc.txt | wc -l # Should be > 0 + comm -12 bam_bc.txt file_bc.txt | wc -l # should be > 0 - # Fix suffix mismatch if needed - awk -F'\t' '{print $1"-1\t"$2}' barcodes_no_suffix.tsv > barcodes.tsv - -Low Read Counts -~~~~~~~~~~~~~~~ - -Single-cell data is sparse. Consider: - -* Using pseudo-bulk aggregation by cell type -* Lowering ``--min`` / ``--min_count`` threshold -* Focusing on highly expressed genes - -Memory Issues -~~~~~~~~~~~~~ - -For large datasets, process chromosomes separately by filtering your region file: +Fix a missing suffix with: .. code-block:: bash - # Process autosomes separately (add chrX, chrY if needed) - for chr in chr{1..22}; do - # Extract regions for this chromosome - grep "^${chr}\t" genes.bed > genes_${chr}.bed - - wasp2-count count-variants-sc \ - sample.bam \ - variants.vcf.gz \ - barcodes.tsv \ - --region genes_${chr}.bed \ - --out_file counts_${chr}.h5ad - done + awk -F'\t' '{print $1"-1\t"$2}' barcodes_no_suffix.tsv > barcodes.tsv -Next Steps ----------- +**Sparse counts.** Single-cell data is sparse. Consider pseudobulk +aggregation by cell type, lower ``--min`` / ``--min_count``, or focus on +highly expressed genes. -* Integrate with eQTL databases (GTEx, eQTLGen) -* Correlate ASE with gene expression levels -* Validate top hits with allele-specific primers -* Compare across conditions or timepoints +**Memory.** For large cohorts, split the region file by chromosome and +process chunks, then concatenate results. See Also -------- -* :doc:`/user_guide/single_cell` - Barcode file format reference -* :doc:`/user_guide/analysis` - Statistical methods -* `Seurat `_ - R toolkit for scRNA-seq -* `Scanpy `_ - Python toolkit for scRNA-seq +- :doc:`/user_guide/single_cell` — barcode file format, Seurat/Scanpy export +- :doc:`/user_guide/analysis` — analysis CLI reference +- :doc:`/methods/statistical_models` — beta-binomial LRT +- :doc:`scatac_workflow` — sibling tutorial for scATAC-seq +- :doc:`comparative_imbalance` — comparing groups diff --git a/docs/source/tutorials/statistical_methods_tutorial.ipynb b/docs/source/tutorials/statistical_methods_tutorial.ipynb deleted file mode 100644 index b7e6c8f0..00000000 --- a/docs/source/tutorials/statistical_methods_tutorial.ipynb +++ /dev/null @@ -1,1516 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "intro-header", - "metadata": {}, - "source": [ - "# Statistical Methods Tutorial: Understanding WASP2's Beta-Binomial Framework\n", - "\n", - "**Estimated time:** ~45 minutes\n", - "\n", - "This interactive tutorial provides a deep dive into the statistical methods used by WASP2 for detecting allelic imbalance. You will learn:\n", - "\n", - "1. **Why overdispersion matters** - Why the simple binomial model fails for sequencing data\n", - "2. **Beta-binomial distributions** - The statistical foundation of WASP2\n", - "3. **Dispersion estimation** - MLE vs Method of Moments approaches\n", - "4. **QQ plots** - Visualizing model fit and calibration\n", - "5. **FDR correction** - Benjamini-Hochberg and alternatives\n", - "\n", - "## Prerequisites\n", - "\n", - "This tutorial assumes familiarity with:\n", - "- Basic probability distributions (binomial)\n", - "- Hypothesis testing concepts (p-values)\n", - "- Python data analysis (numpy, pandas, matplotlib)\n", - "\n", - "No prior knowledge of beta-binomial distributions or overdispersion is required.\n", - "\n", - "## Relationship to WASP2 Source Code\n", - "\n", - "The functions in this tutorial mirror the implementations in:\n", - "- `src/analysis/as_analysis.py` - Core statistical functions (`clamp_rho`, `opt_prob`, `opt_linear`)\n", - "- `src/analysis/compare_ai.py` - Comparative analysis between groups\n", - "- `src/analysis/as_analysis_sc.py` - Single-cell extensions" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "setup-imports", - "metadata": {}, - "outputs": [], - "source": [ - "import warnings\n", - "\n", - "import matplotlib.pyplot as plt\n", - "import numpy as np\n", - "import pandas as pd\n", - "from numpy.typing import NDArray\n", - "from scipy import stats\n", - "from scipy.optimize import minimize, minimize_scalar\n", - "from scipy.special import expit\n", - "from scipy.stats import betabinom, binom, chi2, false_discovery_control\n", - "\n", - "# Configure plotting\n", - "plt.style.use(\"seaborn-v0_8-whitegrid\")\n", - "plt.rcParams[\"figure.figsize\"] = (10, 6)\n", - "plt.rcParams[\"font.size\"] = 11\n", - "np.random.seed(42)\n", - "\n", - "print(\"Statistical Methods Tutorial\")\n", - "print(\"=\" * 40)\n", - "print(f\"NumPy version: {np.__version__}\")\n", - "print(f\"SciPy version: {stats.scipy.__version__ if hasattr(stats, 'scipy') else 'N/A'}\")" - ] - }, - { - "cell_type": "markdown", - "id": "constants-header", - "metadata": {}, - "source": [ - "## Critical Constants and Helper Functions\n", - "\n", - "WASP2 defines critical numerical constants to prevent division by zero and numerical overflow. These are defined in `src/analysis/as_analysis.py` (Issue #228).\n", - "\n", - "**Why this matters:** The beta-binomial parameterization uses $\\alpha = \\mu \\cdot \\frac{1-\\rho}{\\rho}$, which:\n", - "- Causes **division by zero** when $\\rho = 0$\n", - "- Produces **zero alpha/beta** when $\\rho = 1$\n", - "\n", - "WASP2 clamps $\\rho$ to $(\\epsilon, 1-\\epsilon)$ where $\\epsilon = 10^{-10}$." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "constants-definition", - "metadata": {}, - "outputs": [], - "source": [ - "# =============================================================================\n", - "# BETA-BINOMIAL RHO PARAMETER BOUNDS (Issue #228)\n", - "# =============================================================================\n", - "# Matches WASP2's src/analysis/as_analysis.py:33\n", - "# =============================================================================\n", - "\n", - "RHO_EPSILON: float = 1e-10\n", - "\n", - "\n", - "def clamp_rho(rho: float | NDArray[np.float64]) -> float | NDArray[np.float64]:\n", - " \"\"\"\n", - " Clamp dispersion parameter rho to safe range (epsilon, 1-epsilon).\n", - "\n", - " This function mirrors WASP2's src/analysis/as_analysis.py:36-50.\n", - "\n", - " The beta-binomial parameterization uses alpha = mu * (1-rho) / rho, which\n", - " causes division by zero when rho=0 and produces zero alpha/beta when rho=1.\n", - " This function prevents these boundary issues.\n", - "\n", - " Args:\n", - " rho: Dispersion parameter (scalar or array), expected in [0, 1]\n", - "\n", - " Returns:\n", - " Clamped rho in range (RHO_EPSILON, 1 - RHO_EPSILON)\n", - "\n", - " Example:\n", - " >>> clamp_rho(0.0) # Would cause division by zero\n", - " 1e-10\n", - " >>> clamp_rho(1.0) # Would produce zero alpha/beta\n", - " 0.9999999999\n", - " \"\"\"\n", - " return np.clip(rho, RHO_EPSILON, 1.0 - RHO_EPSILON)\n", - "\n", - "\n", - "# Demonstrate the importance of clamping\n", - "print(\"Demonstrating rho clamping (Issue #228):\")\n", - "print(f\" RHO_EPSILON = {RHO_EPSILON}\")\n", - "print(f\" clamp_rho(0.0) = {clamp_rho(0.0)}\")\n", - "print(f\" clamp_rho(1.0) = {clamp_rho(1.0)}\")\n", - "print(f\" clamp_rho(0.05) = {clamp_rho(0.05)} # No change needed\")" - ] - }, - { - "cell_type": "markdown", - "id": "section1-header", - "metadata": {}, - "source": [ - "---\n", - "\n", - "## Section 1: Understanding Overdispersion\n", - "\n", - "### 1.1 The Naive Binomial Model\n", - "\n", - "When counting alleles at a heterozygous SNP, the simplest model assumes each read is an independent coin flip with probability 0.5 of coming from each allele:\n", - "\n", - "$$X \\sim \\text{Binomial}(N, p=0.5)$$\n", - "\n", - "where $X$ is the reference allele count and $N$ is the total read count.\n", - "\n", - "**Expected variance:** $\\text{Var}(X) = N \\cdot p \\cdot (1-p) = N/4$\n", - "\n", - "Let's simulate what this looks like:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "binomial-simulation", - "metadata": {}, - "outputs": [], - "source": [ - "# Simulate ideal binomial data (balanced, no overdispersion)\n", - "n_snps = 1000\n", - "read_depth = 50 # Total reads per SNP\n", - "\n", - "# Input validation\n", - "assert n_snps > 0, \"n_snps must be positive\"\n", - "assert read_depth > 0, \"read_depth must be positive\"\n", - "\n", - "# Perfect binomial sampling\n", - "ideal_ref_counts = np.random.binomial(n=read_depth, p=0.5, size=n_snps)\n", - "ideal_ratios = ideal_ref_counts / read_depth\n", - "\n", - "# Calculate observed vs expected variance\n", - "expected_var = read_depth * 0.5 * 0.5\n", - "observed_var = np.var(ideal_ref_counts)\n", - "\n", - "print(f\"Binomial Model (N={read_depth}, p=0.5)\")\n", - "print(f\" Expected variance: {expected_var:.2f}\")\n", - "print(f\" Observed variance: {observed_var:.2f}\")\n", - "print(f\" Ratio (observed/expected): {observed_var / expected_var:.3f}\")" - ] - }, - { - "cell_type": "markdown", - "id": "overdispersion-intro", - "metadata": {}, - "source": [ - "### 1.2 The Problem: Real Data Shows Overdispersion\n", - "\n", - "In real sequencing data, the observed variance is typically **larger** than expected from a binomial model. This is called **overdispersion**.\n", - "\n", - "**Sources of overdispersion in allele counting:**\n", - "1. **PCR amplification bias** - Some fragments amplify better than others\n", - "2. **Library preparation effects** - Batch effects during sample prep\n", - "3. **Technical variability** - Across lanes, flowcells, and sequencing runs\n", - "4. **Mapping bias** - Reference allele may map slightly better (even after WASP correction)\n", - "\n", - "Let's simulate data with overdispersion:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "overdispersed-simulation", - "metadata": {}, - "outputs": [], - "source": [ - "def simulate_overdispersed(n_snps: int, read_depth: int, rho: float = 0.05) -> NDArray[np.int64]:\n", - " \"\"\"\n", - " Simulate overdispersed allele counts using beta-binomial.\n", - "\n", - " Args:\n", - " n_snps: Number of SNPs to simulate\n", - " read_depth: Total read depth per SNP\n", - " rho: Dispersion parameter (0 = binomial, higher = more overdispersed)\n", - "\n", - " Returns:\n", - " Array of reference allele counts\n", - "\n", - " Raises:\n", - " ValueError: If parameters are out of valid range\n", - " \"\"\"\n", - " # Input validation\n", - " if n_snps <= 0:\n", - " raise ValueError(f\"n_snps must be positive, got {n_snps}\")\n", - " if read_depth <= 0:\n", - " raise ValueError(f\"read_depth must be positive, got {read_depth}\")\n", - " if not 0 < rho < 1:\n", - " raise ValueError(f\"rho must be in (0, 1), got {rho}\")\n", - "\n", - " mu = 0.5 # Balanced (no true imbalance)\n", - " rho = clamp_rho(rho) # Apply WASP2's clamping\n", - "\n", - " alpha = mu * (1 - rho) / rho\n", - " beta = (1 - mu) * (1 - rho) / rho\n", - "\n", - " return betabinom.rvs(n=read_depth, a=alpha, b=beta, size=n_snps)\n", - "\n", - "\n", - "# Simulate with different dispersion levels\n", - "rho_values = [0.001, 0.02, 0.05, 0.10]\n", - "overdispersed_data = {rho: simulate_overdispersed(n_snps, read_depth, rho) for rho in rho_values}\n", - "\n", - "# Compare variances\n", - "print(f\"Expected binomial variance: {expected_var:.2f}\")\n", - "print(\"\\nObserved variances by dispersion (rho):\")\n", - "for rho, data in overdispersed_data.items():\n", - " obs_var = np.var(data)\n", - " print(f\" rho={rho:.3f}: variance={obs_var:.2f} (ratio={obs_var / expected_var:.2f}x)\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "overdispersion-visual", - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize the effect of overdispersion\n", - "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n", - "\n", - "for idx, (rho, data) in enumerate(overdispersed_data.items()):\n", - " ax = axes.flat[idx]\n", - " ratios = data / read_depth\n", - "\n", - " # Histogram of observed ratios\n", - " ax.hist(\n", - " ratios,\n", - " bins=30,\n", - " density=True,\n", - " alpha=0.7,\n", - " color=\"steelblue\",\n", - " edgecolor=\"black\",\n", - " label=\"Observed\",\n", - " )\n", - "\n", - " # Overlay expected binomial distribution\n", - " x = np.arange(0, read_depth + 1)\n", - " binomial_pmf = binom.pmf(x, read_depth, 0.5)\n", - " ax.plot(x / read_depth, binomial_pmf * read_depth, \"r-\", lw=2, label=\"Expected (Binomial)\")\n", - "\n", - " obs_var = np.var(data)\n", - " ax.set_title(f\"rho = {rho:.3f}\\nVariance ratio: {obs_var / expected_var:.2f}x\", fontsize=11)\n", - " ax.set_xlabel(\"Reference Allele Frequency\")\n", - " ax.set_ylabel(\"Density\")\n", - " ax.legend(fontsize=9)\n", - " ax.set_xlim(0, 1)\n", - "\n", - "plt.suptitle(\"Effect of Overdispersion on Allele Ratio Distributions\", fontsize=14, y=1.02)\n", - "plt.tight_layout()\n", - "plt.show()\n", - "\n", - "print(\"\\nKey observation: As rho increases, the distribution becomes wider\")\n", - "print(\"than expected from binomial sampling alone.\")" - ] - }, - { - "cell_type": "markdown", - "id": "consequences-header", - "metadata": {}, - "source": [ - "### 1.3 Consequences of Ignoring Overdispersion\n", - "\n", - "If we use a binomial model on overdispersed data, we will:\n", - "\n", - "1. **Underestimate variance** → p-values too small\n", - "2. **Inflate false positive rate** → Many spurious \"significant\" results\n", - "3. **Poor calibration** → QQ plots show massive inflation\n", - "\n", - "Let's demonstrate this problem:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "false-positive-demo", - "metadata": {}, - "outputs": [], - "source": [ - "def binomial_test_pvalue(ref_count: int, total_count: int) -> float:\n", - " \"\"\"\n", - " Two-sided binomial test for deviation from 0.5.\n", - "\n", - " Args:\n", - " ref_count: Number of reference allele reads\n", - " total_count: Total number of reads\n", - "\n", - " Returns:\n", - " Two-sided p-value\n", - " \"\"\"\n", - " # Input validation\n", - " if ref_count < 0 or ref_count > total_count:\n", - " raise ValueError(f\"Invalid counts: ref={ref_count}, total={total_count}\")\n", - " if total_count <= 0:\n", - " raise ValueError(f\"total_count must be positive, got {total_count}\")\n", - "\n", - " result = stats.binomtest(ref_count, total_count, p=0.5, alternative=\"two-sided\")\n", - " return result.pvalue\n", - "\n", - "\n", - "# Test on overdispersed data (rho=0.05) - no TRUE imbalance!\n", - "test_data = overdispersed_data[0.05]\n", - "pvalues_binomial = [binomial_test_pvalue(int(k), read_depth) for k in test_data]\n", - "\n", - "# Count \"significant\" results at different thresholds\n", - "pvalues_binomial = np.array(pvalues_binomial)\n", - "\n", - "print(\"False positive rates using binomial test on overdispersed data:\")\n", - "print(\"(Remember: there is NO true imbalance in this data!)\\n\")\n", - "for alpha in [0.05, 0.01, 0.001]:\n", - " fp_rate = (pvalues_binomial < alpha).mean()\n", - " print(f\" Alpha = {alpha:.3f}: {fp_rate * 100:.1f}% significant (expected: {alpha * 100:.1f}%)\")\n", - "\n", - "print(f\"\\nThis represents a {(pvalues_binomial < 0.05).mean() / 0.05:.1f}x inflation!\")" - ] - }, - { - "cell_type": "markdown", - "id": "section2-header", - "metadata": {}, - "source": [ - "---\n", - "\n", - "## Section 2: The Beta-Binomial Distribution\n", - "\n", - "### 2.1 Mathematical Foundation\n", - "\n", - "The **beta-binomial distribution** extends the binomial by allowing the success probability $p$ to vary according to a Beta distribution:\n", - "\n", - "$$p \\sim \\text{Beta}(\\alpha, \\beta)$$\n", - "$$X | p \\sim \\text{Binomial}(N, p)$$\n", - "\n", - "The marginal distribution of $X$ is the beta-binomial.\n", - "\n", - "**WASP2's parameterization** uses mean ($\\mu$) and dispersion ($\\rho$):\n", - "\n", - "$$\\alpha = \\mu \\cdot \\frac{1 - \\rho}{\\rho}, \\quad \\beta = (1 - \\mu) \\cdot \\frac{1 - \\rho}{\\rho}$$\n", - "\n", - "**Key properties:**\n", - "- Mean: $E[X] = N \\cdot \\mu$ (same as binomial)\n", - "- Variance: $\\text{Var}(X) = N \\cdot \\mu \\cdot (1-\\mu) \\cdot [1 + (N-1) \\cdot \\rho]$ (inflated!)\n", - "\n", - "When $\\rho \\to 0$, the beta-binomial converges to the binomial." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "betabinom-params", - "metadata": {}, - "outputs": [], - "source": [ - "def mu_rho_to_alpha_beta(mu: float, rho: float) -> tuple[float, float]:\n", - " \"\"\"\n", - " Convert WASP2's (mu, rho) parameterization to (alpha, beta).\n", - "\n", - " This mirrors WASP2's parameterization in src/analysis/as_analysis.py:104-105.\n", - "\n", - " Args:\n", - " mu: Mean parameter (allele frequency), in (0, 1)\n", - " rho: Dispersion parameter, in (0, 1)\n", - "\n", - " Returns:\n", - " Tuple of (alpha, beta) parameters for scipy.stats.betabinom\n", - "\n", - " Warning:\n", - " When rho is near 0 or 1, numerical instability can occur.\n", - " Use clamp_rho() to ensure safe values.\n", - " \"\"\"\n", - " # Apply WASP2's clamping to prevent numerical issues\n", - " rho = clamp_rho(rho)\n", - "\n", - " # Validate mu\n", - " if not 0 < mu < 1:\n", - " raise ValueError(f\"mu must be in (0, 1), got {mu}\")\n", - "\n", - " alpha = mu * (1 - rho) / rho\n", - " beta = (1 - mu) * (1 - rho) / rho\n", - " return alpha, beta\n", - "\n", - "\n", - "def betabinom_variance(n: int, mu: float, rho: float) -> float:\n", - " \"\"\"\n", - " Compute beta-binomial variance.\n", - "\n", - " Args:\n", - " n: Number of trials (total read count)\n", - " mu: Mean parameter (allele frequency)\n", - " rho: Dispersion parameter\n", - "\n", - " Returns:\n", - " Variance of the beta-binomial distribution\n", - " \"\"\"\n", - " return n * mu * (1 - mu) * (1 + (n - 1) * rho)\n", - "\n", - "\n", - "# Demonstrate variance inflation\n", - "N = 50\n", - "mu = 0.5\n", - "binomial_var = N * mu * (1 - mu)\n", - "\n", - "print(f\"Variance comparison (N={N}, mu={mu}):\\n\")\n", - "print(f\"Binomial variance: {binomial_var:.2f}\")\n", - "print(\"\\nBeta-binomial variance by rho:\")\n", - "for rho in [0.001, 0.01, 0.05, 0.10, 0.20]:\n", - " bb_var = betabinom_variance(N, mu, rho)\n", - " inflation = bb_var / binomial_var\n", - " print(f\" rho={rho:.3f}: {bb_var:.2f} ({inflation:.2f}x inflation)\")" - ] - }, - { - "cell_type": "markdown", - "id": "edge-cases-header", - "metadata": {}, - "source": [ - "### 2.1.1 Edge Cases and Numerical Stability\n", - "\n", - "**Warning:** The beta-binomial parameterization has dangerous edge cases that can cause numerical errors. This section demonstrates why WASP2's `clamp_rho()` function is essential." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "edge-cases-demo", - "metadata": {}, - "outputs": [], - "source": [ - "# Demonstrate edge cases that clamp_rho() prevents\n", - "print(\"Edge Cases in Beta-Binomial Parameterization\")\n", - "print(\"=\" * 50)\n", - "\n", - "\n", - "def unsafe_mu_rho_to_alpha_beta(mu, rho):\n", - " \"\"\"Unsafe version WITHOUT clamping - for demonstration only.\"\"\"\n", - " alpha = mu * (1 - rho) / rho\n", - " beta = (1 - mu) * (1 - rho) / rho\n", - " return alpha, beta\n", - "\n", - "\n", - "# Case 1: rho = 0 causes division by zero\n", - "print(\"\\n1. rho = 0 (division by zero):\")\n", - "try:\n", - " with warnings.catch_warnings():\n", - " warnings.simplefilter(\"ignore\")\n", - " alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 0.0)\n", - " print(f\" alpha = {alpha}, beta = {beta}\")\n", - " print(\" Result: inf values - causes downstream errors!\")\n", - "except Exception as e:\n", - " print(f\" Error: {e}\")\n", - "\n", - "# Case 2: rho = 1 produces zero alpha/beta\n", - "print(\"\\n2. rho = 1 (zero parameters):\")\n", - "alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1.0)\n", - "print(f\" alpha = {alpha}, beta = {beta}\")\n", - "print(\" Result: zero values - invalid for Beta distribution!\")\n", - "\n", - "# Case 3: Very small rho produces huge alpha/beta\n", - "print(\"\\n3. rho = 1e-15 (numerical overflow risk):\")\n", - "alpha, beta = unsafe_mu_rho_to_alpha_beta(0.5, 1e-15)\n", - "print(f\" alpha = {alpha:.2e}, beta = {beta:.2e}\")\n", - "print(\" Result: huge values - may overflow in gamma functions!\")\n", - "\n", - "# Safe version with clamping\n", - "print(\"\\n\" + \"=\" * 50)\n", - "print(\"With WASP2's clamp_rho():\")\n", - "print(\"=\" * 50)\n", - "\n", - "for rho_input in [0.0, 1.0, 1e-15]:\n", - " rho_safe = clamp_rho(rho_input)\n", - " alpha, beta = mu_rho_to_alpha_beta(0.5, rho_safe)\n", - " print(f\"\\nInput rho={rho_input} -> clamped to {rho_safe}\")\n", - " print(f\" alpha = {alpha:.4f}, beta = {beta:.4f}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "betabinom-visual", - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize beta-binomial vs binomial PMFs\n", - "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n", - "\n", - "N = 50\n", - "x = np.arange(0, N + 1)\n", - "\n", - "# Panel 1: Different rho values (mu=0.5)\n", - "ax = axes[0]\n", - "ax.plot(x, binom.pmf(x, N, 0.5), \"k-\", lw=2, label=\"Binomial\")\n", - "for rho, color in [(0.02, \"blue\"), (0.05, \"orange\"), (0.10, \"red\")]:\n", - " alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n", - " ax.plot(x, betabinom.pmf(x, N, alpha, beta), \"--\", lw=2, color=color, label=f\"rho={rho}\")\n", - "ax.set_xlabel(\"Reference Count\")\n", - "ax.set_ylabel(\"Probability\")\n", - "ax.set_title(\"Effect of Dispersion (mu=0.5)\")\n", - "ax.legend()\n", - "\n", - "# Panel 2: Different mu values (rho=0.05)\n", - "ax = axes[1]\n", - "rho = 0.05\n", - "for mu, color in [(0.5, \"gray\"), (0.6, \"blue\"), (0.7, \"orange\"), (0.8, \"red\")]:\n", - " alpha, beta = mu_rho_to_alpha_beta(mu, rho)\n", - " ax.plot(x, betabinom.pmf(x, N, alpha, beta), \"-\", lw=2, color=color, label=f\"mu={mu}\")\n", - "ax.set_xlabel(\"Reference Count\")\n", - "ax.set_ylabel(\"Probability\")\n", - "ax.set_title(f\"Effect of Imbalance (rho={rho})\")\n", - "ax.legend()\n", - "\n", - "# Panel 3: Log-scale tails comparison\n", - "ax = axes[2]\n", - "ax.semilogy(x, binom.pmf(x, N, 0.5), \"k-\", lw=2, label=\"Binomial\")\n", - "for rho, color in [(0.02, \"blue\"), (0.10, \"red\")]:\n", - " alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n", - " ax.semilogy(x, betabinom.pmf(x, N, alpha, beta), \"--\", lw=2, color=color, label=f\"rho={rho}\")\n", - "ax.set_xlabel(\"Reference Count\")\n", - "ax.set_ylabel(\"Probability (log scale)\")\n", - "ax.set_title(\"Tail Behavior (log scale)\")\n", - "ax.legend()\n", - "ax.set_ylim(1e-15, 1)\n", - "\n", - "plt.tight_layout()\n", - "plt.show()\n", - "\n", - "print(\"Key insight: Beta-binomial has heavier tails than binomial,\")\n", - "print(\"making extreme counts more likely under the null hypothesis.\")" - ] - }, - { - "cell_type": "markdown", - "id": "lrt-header", - "metadata": {}, - "source": [ - "### 2.2 Likelihood Ratio Test for Imbalance\n", - "\n", - "WASP2 uses a **likelihood ratio test (LRT)** to detect allelic imbalance:\n", - "\n", - "- **Null hypothesis ($H_0$):** $\\mu = 0.5$ (balanced)\n", - "- **Alternative ($H_1$):** $\\mu \\neq 0.5$ (imbalanced)\n", - "\n", - "The test statistic is:\n", - "\n", - "$$\\Lambda = -2 \\left[ \\log L(H_0) - \\log L(H_1) \\right] \\sim \\chi^2_1$$\n", - "\n", - "This follows a chi-squared distribution with 1 degree of freedom under $H_0$.\n", - "\n", - "**WASP2 Implementation:** See `src/analysis/as_analysis.py:322-323`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "lrt-implementation", - "metadata": {}, - "outputs": [], - "source": [ - "def betabinom_loglik(ref_count: int, total_count: int, mu: float, rho: float) -> float:\n", - " \"\"\"\n", - " Compute beta-binomial log-likelihood.\n", - "\n", - " Mirrors WASP2's src/analysis/as_analysis.py:81-112 (opt_prob function).\n", - "\n", - " Args:\n", - " ref_count: Reference allele count\n", - " total_count: Total read count\n", - " mu: Mean parameter (allele frequency)\n", - " rho: Dispersion parameter\n", - "\n", - " Returns:\n", - " Log-likelihood value\n", - " \"\"\"\n", - " alpha, beta = mu_rho_to_alpha_beta(mu, rho)\n", - " return betabinom.logpmf(ref_count, total_count, alpha, beta)\n", - "\n", - "\n", - "def find_mle_mu(ref_count: int, total_count: int, rho: float) -> float:\n", - " \"\"\"\n", - " Find MLE of mu given fixed rho.\n", - "\n", - " Mirrors WASP2's src/analysis/as_analysis.py:184-248 (parse_opt function).\n", - "\n", - " Args:\n", - " ref_count: Reference allele count\n", - " total_count: Total read count\n", - " rho: Fixed dispersion parameter\n", - "\n", - " Returns:\n", - " Maximum likelihood estimate of mu\n", - " \"\"\"\n", - "\n", - " def neg_loglik(mu):\n", - " return -betabinom_loglik(ref_count, total_count, mu, rho)\n", - "\n", - " # Use bounded optimization with safe mu range\n", - " result = minimize_scalar(neg_loglik, bounds=(0.001, 0.999), method=\"bounded\")\n", - " return result.x\n", - "\n", - "\n", - "def likelihood_ratio_test(\n", - " ref_count: int, total_count: int, rho: float\n", - ") -> tuple[float, float, float]:\n", - " \"\"\"\n", - " Perform LRT for allelic imbalance.\n", - "\n", - " Mirrors WASP2's calculation in src/analysis/as_analysis.py:322-323.\n", - "\n", - " Args:\n", - " ref_count: Reference allele count\n", - " total_count: Total read count\n", - " rho: Dispersion parameter\n", - "\n", - " Returns:\n", - " Tuple of (lrt_statistic, p_value, mle_mu)\n", - " \"\"\"\n", - " # Input validation\n", - " if ref_count < 0 or ref_count > total_count:\n", - " raise ValueError(f\"Invalid: ref_count={ref_count}, total_count={total_count}\")\n", - " if total_count <= 0:\n", - " raise ValueError(f\"total_count must be positive, got {total_count}\")\n", - "\n", - " # Null likelihood (mu = 0.5)\n", - " null_ll = betabinom_loglik(ref_count, total_count, 0.5, rho)\n", - "\n", - " # Alternative likelihood (mu = MLE)\n", - " mle_mu = find_mle_mu(ref_count, total_count, rho)\n", - " alt_ll = betabinom_loglik(ref_count, total_count, mle_mu, rho)\n", - "\n", - " # LRT statistic (matches WASP2: -2 * (null_ll - alt_ll))\n", - " lrt = -2 * (null_ll - alt_ll)\n", - " lrt = max(0, lrt) # Ensure non-negative due to numerical precision\n", - "\n", - " # P-value from chi-squared distribution (df=1)\n", - " pvalue = chi2.sf(lrt, df=1)\n", - "\n", - " return lrt, pvalue, mle_mu\n", - "\n", - "\n", - "# Example calculation\n", - "ref, total = 35, 50 # Observed: 35 ref out of 50 total\n", - "rho = 0.05\n", - "\n", - "lrt, pval, mu_hat = likelihood_ratio_test(ref, total, rho)\n", - "\n", - "print(\"Example LRT calculation:\")\n", - "print(f\" Observed: {ref} ref / {total} total (ratio = {ref / total:.2f})\")\n", - "print(f\" Dispersion (rho): {rho}\")\n", - "print(f\" MLE of mu: {mu_hat:.3f}\")\n", - "print(f\" LRT statistic: {lrt:.3f}\")\n", - "print(f\" P-value: {pval:.4f}\")" - ] - }, - { - "cell_type": "markdown", - "id": "section3-header", - "metadata": {}, - "source": [ - "---\n", - "\n", - "## Section 3: Dispersion Estimation\n", - "\n", - "A critical step is estimating the dispersion parameter $\\rho$ from the data. WASP2 offers two approaches:\n", - "\n", - "1. **Single dispersion model** - One $\\rho$ for the entire dataset\n", - "2. **Linear dispersion model** - $\\rho$ varies with read depth: $\\text{logit}(\\rho) = \\beta_0 + \\beta_1 \\cdot N$\n", - "\n", - "### 3.1 Maximum Likelihood Estimation (MLE)\n", - "\n", - "MLE finds the $\\rho$ that maximizes the likelihood under $H_0$ (balanced).\n", - "\n", - "**WASP2 Implementation:** See `src/analysis/as_analysis.py:251-325` (single_model function)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "mle-dispersion", - "metadata": {}, - "outputs": [], - "source": [ - "def estimate_dispersion_mle(\n", - " ref_counts: NDArray[np.int64], total_counts: NDArray[np.int64]\n", - ") -> float:\n", - " \"\"\"\n", - " Estimate single dispersion parameter via MLE.\n", - "\n", - " Mirrors WASP2's single_model in src/analysis/as_analysis.py:251-325.\n", - " Assumes mu=0.5 (null hypothesis) for all observations.\n", - "\n", - " Args:\n", - " ref_counts: Array of reference allele counts\n", - " total_counts: Array of total read counts\n", - "\n", - " Returns:\n", - " MLE estimate of dispersion parameter rho\n", - "\n", - " Raises:\n", - " ValueError: If arrays have different lengths or invalid values\n", - " \"\"\"\n", - " # Input validation\n", - " ref_counts = np.asarray(ref_counts)\n", - " total_counts = np.asarray(total_counts)\n", - "\n", - " if len(ref_counts) != len(total_counts):\n", - " raise ValueError(\"ref_counts and total_counts must have same length\")\n", - " if len(ref_counts) == 0:\n", - " raise ValueError(\"Arrays must not be empty\")\n", - " if np.any(ref_counts < 0) or np.any(ref_counts > total_counts):\n", - " raise ValueError(\"Invalid counts: ref_count must be in [0, total_count]\")\n", - "\n", - " def neg_loglik(rho):\n", - " rho = clamp_rho(rho) # Apply WASP2's clamping\n", - " alpha, beta = mu_rho_to_alpha_beta(0.5, rho)\n", - " ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))\n", - " return -ll\n", - "\n", - " result = minimize_scalar(\n", - " neg_loglik,\n", - " bounds=(RHO_EPSILON, 0.5), # Use RHO_EPSILON as lower bound\n", - " method=\"bounded\",\n", - " )\n", - " return result.x\n", - "\n", - "\n", - "# Estimate dispersion from our simulated data\n", - "true_rho = 0.05\n", - "test_data = overdispersed_data[true_rho]\n", - "total_counts = np.full(len(test_data), read_depth)\n", - "\n", - "estimated_rho = estimate_dispersion_mle(test_data, total_counts)\n", - "\n", - "print(\"Dispersion estimation (MLE):\")\n", - "print(f\" True rho: {true_rho:.4f}\")\n", - "print(f\" Estimated rho: {estimated_rho:.4f}\")\n", - "print(f\" Relative error: {abs(estimated_rho - true_rho) / true_rho * 100:.1f}%\")" - ] - }, - { - "cell_type": "markdown", - "id": "mom-header", - "metadata": {}, - "source": [ - "### 3.2 Method of Moments (MoM)\n", - "\n", - "An alternative is the **Method of Moments**, which matches the observed variance to the expected beta-binomial variance:\n", - "\n", - "$$\\hat{\\rho}_{\\text{MoM}} = \\frac{\\text{Var}(X/N) - \\mu(1-\\mu)/\\bar{N}}{\\mu(1-\\mu)(1 - 1/\\bar{N})}$$\n", - "\n", - "MoM is faster but may be less efficient than MLE." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "mom-dispersion", - "metadata": {}, - "outputs": [], - "source": [ - "def estimate_dispersion_mom(\n", - " ref_counts: NDArray[np.int64], total_counts: NDArray[np.int64]\n", - ") -> float:\n", - " \"\"\"\n", - " Estimate dispersion using Method of Moments.\n", - "\n", - " Args:\n", - " ref_counts: Array of reference allele counts\n", - " total_counts: Array of total read counts\n", - "\n", - " Returns:\n", - " MoM estimate of dispersion parameter rho\n", - " \"\"\"\n", - " # Input validation\n", - " ref_counts = np.asarray(ref_counts)\n", - " total_counts = np.asarray(total_counts)\n", - "\n", - " if len(ref_counts) != len(total_counts):\n", - " raise ValueError(\"Arrays must have same length\")\n", - "\n", - " ratios = ref_counts / total_counts\n", - " mu = 0.5 # Assume balanced under null\n", - "\n", - " # Observed variance of ratios\n", - " obs_var = np.var(ratios)\n", - "\n", - " # Expected binomial variance (if rho=0)\n", - " mean_n = np.mean(total_counts)\n", - " binom_var = mu * (1 - mu) / mean_n\n", - "\n", - " # Solve for rho with safeguards\n", - " numerator = obs_var - binom_var\n", - " denominator = mu * (1 - mu) * (1 - 1 / mean_n)\n", - "\n", - " # Handle edge case where denominator is near zero\n", - " if abs(denominator) < 1e-10:\n", - " return RHO_EPSILON\n", - "\n", - " rho_mom = numerator / denominator\n", - "\n", - " # Clamp to valid range using WASP2's constant\n", - " return float(clamp_rho(rho_mom))\n", - "\n", - "\n", - "# Compare MLE vs MoM\n", - "rho_mom = estimate_dispersion_mom(test_data, total_counts)\n", - "\n", - "print(\"Comparison of estimation methods:\")\n", - "print(f\" True rho: {true_rho:.4f}\")\n", - "print(f\" MLE estimate: {estimated_rho:.4f}\")\n", - "print(f\" MoM estimate: {rho_mom:.4f}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "mle-mom-comparison", - "metadata": {}, - "outputs": [], - "source": [ - "# Systematic comparison across different true rho values\n", - "comparison_results = []\n", - "\n", - "for true_rho in [0.01, 0.02, 0.05, 0.10, 0.15]:\n", - " # Simulate data\n", - " alpha, beta = mu_rho_to_alpha_beta(0.5, true_rho)\n", - " sim_data = betabinom.rvs(n=50, a=alpha, b=beta, size=1000)\n", - " sim_totals = np.full(1000, 50)\n", - "\n", - " # Estimate with both methods\n", - " mle_est = estimate_dispersion_mle(sim_data, sim_totals)\n", - " mom_est = estimate_dispersion_mom(sim_data, sim_totals)\n", - "\n", - " comparison_results.append(\n", - " {\n", - " \"true_rho\": true_rho,\n", - " \"mle_estimate\": mle_est,\n", - " \"mom_estimate\": mom_est,\n", - " \"mle_error\": abs(mle_est - true_rho) / true_rho * 100,\n", - " \"mom_error\": abs(mom_est - true_rho) / true_rho * 100,\n", - " }\n", - " )\n", - "\n", - "comparison_df = pd.DataFrame(comparison_results)\n", - "print(\"MLE vs MoM Comparison:\")\n", - "print(comparison_df.round(4).to_string(index=False))" - ] - }, - { - "cell_type": "markdown", - "id": "linear-model-header", - "metadata": {}, - "source": [ - "### 3.3 Linear Dispersion Model\n", - "\n", - "In some datasets, dispersion varies with read depth. The **linear model** parameterizes:\n", - "\n", - "$$\\text{logit}(\\rho_i) = \\beta_0 + \\beta_1 \\cdot N_i$$\n", - "\n", - "This allows higher-depth regions to have different dispersion than lower-depth regions.\n", - "\n", - "**WASP2 Implementation:** See `src/analysis/as_analysis.py:53-78` (opt_linear function)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "linear-model-demo", - "metadata": {}, - "outputs": [], - "source": [ - "def estimate_linear_dispersion(\n", - " ref_counts: NDArray[np.int64], total_counts: NDArray[np.int64]\n", - ") -> tuple[float, float]:\n", - " \"\"\"\n", - " Estimate depth-dependent dispersion: logit(rho) = beta0 + beta1 * N\n", - "\n", - " Mirrors WASP2's opt_linear in src/analysis/as_analysis.py:53-78.\n", - "\n", - " Args:\n", - " ref_counts: Array of reference allele counts\n", - " total_counts: Array of total read counts\n", - "\n", - " Returns:\n", - " Tuple of (beta0, beta1) parameters\n", - " \"\"\"\n", - "\n", - " def neg_loglik(params):\n", - " beta0, beta1 = params\n", - "\n", - " # Compute rho for each observation\n", - " linear_pred = beta0 + beta1 * total_counts\n", - " # Clip to prevent overflow (matches WASP2's approach)\n", - " linear_pred = np.clip(linear_pred, -10, 10)\n", - " rho = expit(linear_pred)\n", - " rho = clamp_rho(rho) # Apply WASP2's clamping\n", - "\n", - " # Compute likelihood\n", - " alpha = 0.5 * (1 - rho) / rho\n", - " beta = 0.5 * (1 - rho) / rho\n", - "\n", - " ll = np.sum(betabinom.logpmf(ref_counts, total_counts, alpha, beta))\n", - " return -ll\n", - "\n", - " result = minimize(neg_loglik, x0=[-3, 0], method=\"Nelder-Mead\")\n", - " return tuple(result.x)\n", - "\n", - "\n", - "# Simulate data with depth-dependent dispersion\n", - "np.random.seed(42)\n", - "n_snps = 2000\n", - "depths = np.random.choice([20, 50, 100, 200], size=n_snps)\n", - "\n", - "# True model: logit(rho) = -3 + 0.01 * N\n", - "true_beta0, true_beta1 = -3, 0.01\n", - "true_rhos = expit(true_beta0 + true_beta1 * depths)\n", - "\n", - "# Generate counts with proper clamping\n", - "ref_counts_linear = np.array(\n", - " [\n", - " betabinom.rvs(\n", - " n=n,\n", - " a=0.5 * (1 - clamp_rho(rho)) / clamp_rho(rho),\n", - " b=0.5 * (1 - clamp_rho(rho)) / clamp_rho(rho),\n", - " )\n", - " for n, rho in zip(depths, true_rhos)\n", - " ]\n", - ")\n", - "\n", - "# Estimate\n", - "est_beta0, est_beta1 = estimate_linear_dispersion(ref_counts_linear, depths)\n", - "\n", - "print(\"Linear dispersion model estimation:\")\n", - "print(f\" True: logit(rho) = {true_beta0:.2f} + {true_beta1:.4f} * N\")\n", - "print(f\" Est: logit(rho) = {est_beta0:.2f} + {est_beta1:.4f} * N\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "linear-model-visual", - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize the linear dispersion model\n", - "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", - "\n", - "# Panel 1: Rho vs depth\n", - "ax = axes[0]\n", - "depth_range = np.linspace(10, 250, 100)\n", - "true_curve = expit(true_beta0 + true_beta1 * depth_range)\n", - "est_curve = expit(est_beta0 + est_beta1 * depth_range)\n", - "\n", - "ax.plot(depth_range, true_curve, \"b-\", lw=2, label=\"True\")\n", - "ax.plot(depth_range, est_curve, \"r--\", lw=2, label=\"Estimated\")\n", - "ax.set_xlabel(\"Read Depth (N)\")\n", - "ax.set_ylabel(\"Dispersion (rho)\")\n", - "ax.set_title(\"Linear Dispersion Model\")\n", - "ax.legend()\n", - "\n", - "# Panel 2: Show effect on variance\n", - "ax = axes[1]\n", - "binom_var = depth_range * 0.5 * 0.5\n", - "bb_var_true = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * true_curve)\n", - "bb_var_const = depth_range * 0.5 * 0.5 * (1 + (depth_range - 1) * 0.05)\n", - "\n", - "ax.plot(depth_range, binom_var, \"k-\", lw=2, label=\"Binomial\")\n", - "ax.plot(depth_range, bb_var_const, \"g--\", lw=2, label=\"Constant rho=0.05\")\n", - "ax.plot(depth_range, bb_var_true, \"b-\", lw=2, label=\"Linear model\")\n", - "ax.set_xlabel(\"Read Depth (N)\")\n", - "ax.set_ylabel(\"Variance of Ref Count\")\n", - "ax.set_title(\"Variance Scaling with Depth\")\n", - "ax.legend()\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "section4-header", - "metadata": {}, - "source": [ - "---\n", - "\n", - "## Section 4: QQ Plots for Model Calibration\n", - "\n", - "**Quantile-Quantile (QQ) plots** compare observed p-values to their expected distribution under the null. If the model is well-calibrated:\n", - "\n", - "- P-values should be uniformly distributed under $H_0$\n", - "- QQ plot should follow the diagonal line\n", - "\n", - "### 4.1 Constructing QQ Plots" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "qq-function", - "metadata": {}, - "outputs": [], - "source": [ - "def qq_plot(\n", - " pvalues: NDArray[np.float64], ax=None, label: str = \"Observed\", color: str = \"steelblue\"\n", - "):\n", - " \"\"\"\n", - " Create a QQ plot of -log10(p-values).\n", - "\n", - " Args:\n", - " pvalues: Array of p-values\n", - " ax: Matplotlib axes (created if None)\n", - " label: Label for the scatter plot\n", - " color: Color for the points\n", - "\n", - " Returns:\n", - " Matplotlib axes object\n", - " \"\"\"\n", - " if ax is None:\n", - " fig, ax = plt.subplots()\n", - "\n", - " # Input validation and cleaning\n", - " pvalues = np.asarray(pvalues, dtype=np.float64)\n", - " pvalues = pvalues[~np.isnan(pvalues)] # Remove NaN\n", - " pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)] # Valid p-values only\n", - "\n", - " if len(pvalues) == 0:\n", - " raise ValueError(\"No valid p-values after filtering\")\n", - "\n", - " pvalues = np.sort(pvalues)\n", - "\n", - " # Expected p-values under uniform distribution\n", - " n = len(pvalues)\n", - " expected = (np.arange(1, n + 1) - 0.5) / n\n", - "\n", - " # Transform to -log10 scale with safe clipping\n", - " obs_log = -np.log10(np.clip(pvalues, 1e-300, 1))\n", - " exp_log = -np.log10(expected)\n", - "\n", - " # Plot\n", - " ax.scatter(exp_log, obs_log, s=10, alpha=0.6, color=color, label=label)\n", - "\n", - " # Diagonal line\n", - " max_val = max(exp_log.max(), obs_log.max())\n", - " ax.plot([0, max_val], [0, max_val], \"r--\", lw=1, label=\"Expected\")\n", - "\n", - " ax.set_xlabel(\"Expected -log10(p)\")\n", - " ax.set_ylabel(\"Observed -log10(p)\")\n", - "\n", - " return ax\n", - "\n", - "\n", - "def genomic_inflation_factor(pvalues: NDArray[np.float64]) -> float:\n", - " \"\"\"\n", - " Calculate genomic inflation factor (lambda).\n", - "\n", - " Lambda = 1 indicates perfect calibration.\n", - " Lambda > 1 indicates inflation (too many false positives).\n", - " Lambda < 1 indicates deflation (too conservative).\n", - "\n", - " Args:\n", - " pvalues: Array of p-values\n", - "\n", - " Returns:\n", - " Genomic inflation factor\n", - " \"\"\"\n", - " pvalues = np.asarray(pvalues, dtype=np.float64)\n", - " pvalues = pvalues[~np.isnan(pvalues)]\n", - " pvalues = pvalues[(pvalues > 0) & (pvalues <= 1)]\n", - "\n", - " if len(pvalues) == 0:\n", - " return np.nan\n", - "\n", - " chi2_stats = chi2.ppf(1 - pvalues, df=1)\n", - " return np.median(chi2_stats) / chi2.ppf(0.5, df=1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "qq-comparison", - "metadata": {}, - "outputs": [], - "source": [ - "# Generate p-values using binomial vs beta-binomial tests on overdispersed data\n", - "true_rho = 0.05\n", - "test_data = overdispersed_data[true_rho]\n", - "n_total = read_depth\n", - "\n", - "# Binomial test (wrong model)\n", - "pvals_binomial = [stats.binomtest(int(k), n_total, 0.5).pvalue for k in test_data]\n", - "\n", - "# Beta-binomial test (correct model)\n", - "est_rho = estimate_dispersion_mle(test_data, np.full(len(test_data), n_total))\n", - "pvals_betabinom = [likelihood_ratio_test(int(k), n_total, est_rho)[1] for k in test_data]\n", - "\n", - "# Create QQ plots\n", - "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", - "\n", - "ax = axes[0]\n", - "qq_plot(np.array(pvals_binomial), ax=ax, label=\"Binomial\", color=\"firebrick\")\n", - "lambda_binom = genomic_inflation_factor(np.array(pvals_binomial))\n", - "ax.set_title(f\"Binomial Test\\n(lambda = {lambda_binom:.2f})\")\n", - "ax.legend()\n", - "\n", - "ax = axes[1]\n", - "qq_plot(np.array(pvals_betabinom), ax=ax, label=\"Beta-binomial\", color=\"steelblue\")\n", - "lambda_bb = genomic_inflation_factor(np.array(pvals_betabinom))\n", - "ax.set_title(f\"Beta-Binomial Test\\n(lambda = {lambda_bb:.2f})\")\n", - "ax.legend()\n", - "\n", - "plt.suptitle(f\"QQ Plots on Null Data (true rho = {true_rho})\", fontsize=14, y=1.02)\n", - "plt.tight_layout()\n", - "plt.show()\n", - "\n", - "print(\"Genomic inflation factor (lambda):\")\n", - "print(f\" Binomial test: {lambda_binom:.3f} (expected: 1.0)\")\n", - "print(f\" Beta-binomial: {lambda_bb:.3f} (expected: 1.0)\")\n", - "print(\"\\nA lambda >> 1 indicates systematic inflation (too many false positives).\")" - ] - }, - { - "cell_type": "markdown", - "id": "qq-interpretation", - "metadata": {}, - "source": [ - "### 4.2 Interpreting QQ Plots\n", - "\n", - "| Pattern | Interpretation |\n", - "|---------|----------------|\n", - "| Points on diagonal | Well-calibrated (good!) |\n", - "| Points above diagonal | Inflation (too many small p-values) |\n", - "| Points below diagonal | Deflation (test too conservative) |\n", - "| Lift at the tail only | True signal mixed with null |\n", - "\n", - "The **genomic inflation factor (lambda)** quantifies deviation:\n", - "- $\\lambda = 1$: Perfect calibration\n", - "- $\\lambda > 1$: Inflation\n", - "- $\\lambda < 1$: Deflation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "qq-with-signal", - "metadata": {}, - "outputs": [], - "source": [ - "# Demonstrate QQ plot with true signal\n", - "np.random.seed(42)\n", - "\n", - "# Generate mixed data: 90% null + 10% truly imbalanced\n", - "n_snps = 1000\n", - "n_signal = 100\n", - "rho = 0.05\n", - "\n", - "# Null data (mu = 0.5)\n", - "alpha_null, beta_null = mu_rho_to_alpha_beta(0.5, rho)\n", - "null_counts = betabinom.rvs(n=50, a=alpha_null, b=beta_null, size=n_snps - n_signal)\n", - "\n", - "# Signal data (mu = 0.7, strong imbalance)\n", - "alpha_sig, beta_sig = mu_rho_to_alpha_beta(0.7, rho)\n", - "signal_counts = betabinom.rvs(n=50, a=alpha_sig, b=beta_sig, size=n_signal)\n", - "\n", - "# Combine\n", - "mixed_counts = np.concatenate([null_counts, signal_counts])\n", - "is_signal = np.concatenate([np.zeros(n_snps - n_signal), np.ones(n_signal)]).astype(bool)\n", - "\n", - "# Test all SNPs\n", - "est_rho = estimate_dispersion_mle(mixed_counts, np.full(len(mixed_counts), 50))\n", - "mixed_pvals = np.array([likelihood_ratio_test(int(k), 50, est_rho)[1] for k in mixed_counts])\n", - "\n", - "# QQ plot\n", - "fig, ax = plt.subplots(figsize=(8, 8))\n", - "qq_plot(mixed_pvals, ax=ax)\n", - "ax.set_title(f\"QQ Plot with {n_signal} True Signals\\n(10% imbalanced, mu=0.7)\")\n", - "ax.legend()\n", - "plt.show()\n", - "\n", - "# Check detection power\n", - "detected = mixed_pvals < 0.05\n", - "print(\"Detection results (alpha = 0.05):\")\n", - "print(f\" True positives: {detected[is_signal].sum()} / {n_signal}\")\n", - "print(f\" False positives: {detected[~is_signal].sum()} / {n_snps - n_signal}\")" - ] - }, - { - "cell_type": "markdown", - "id": "section5-header", - "metadata": {}, - "source": [ - "---\n", - "\n", - "## Section 5: False Discovery Rate (FDR) Correction\n", - "\n", - "When testing thousands of SNPs, multiple testing correction is essential. WASP2 uses **Benjamini-Hochberg (BH)** FDR correction via `scipy.stats.false_discovery_control()`.\n", - "\n", - "**WASP2 Implementation:** See `src/analysis/as_analysis.py:492` which uses `false_discovery_control(pvals, method=\"bh\")`\n", - "\n", - "### 5.1 The Multiple Testing Problem\n", - "\n", - "If we test 10,000 SNPs at $\\alpha = 0.05$, we expect 500 false positives even if there's no true signal!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fdr-motivation", - "metadata": {}, - "outputs": [], - "source": [ - "# Demonstrate multiple testing problem\n", - "n_tests = 10000\n", - "alpha = 0.05\n", - "\n", - "# All null (no true signal)\n", - "null_pvalues = np.random.uniform(0, 1, n_tests)\n", - "\n", - "# Without correction\n", - "false_positives_raw = (null_pvalues < alpha).sum()\n", - "\n", - "print(f\"Multiple testing with {n_tests} tests (no true signal):\")\n", - "print(f\" Expected false positives at alpha={alpha}: {n_tests * alpha:.0f}\")\n", - "print(f\" Observed false positives: {false_positives_raw}\")\n", - "print(\"\\nThis is why we need FDR correction!\")" - ] - }, - { - "cell_type": "markdown", - "id": "bh-header", - "metadata": {}, - "source": [ - "### 5.2 Benjamini-Hochberg Procedure\n", - "\n", - "The BH procedure controls the **false discovery rate** - the expected proportion of false positives among all discoveries.\n", - "\n", - "**Algorithm:**\n", - "1. Sort p-values: $p_{(1)} \\leq p_{(2)} \\leq \\ldots \\leq p_{(m)}$\n", - "2. Find largest $k$ such that $p_{(k)} \\leq \\frac{k}{m} \\cdot q$ where $q$ is target FDR\n", - "3. Reject all hypotheses with $p_{(i)} \\leq p_{(k)}$\n", - "\n", - "**Note:** WASP2 uses `scipy.stats.false_discovery_control(pvals, method=\"bh\")` for this." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bh-implementation", - "metadata": {}, - "outputs": [], - "source": [ - "def benjamini_hochberg(pvalues: NDArray[np.float64]) -> NDArray[np.float64]:\n", - " \"\"\"\n", - " Apply Benjamini-Hochberg FDR correction.\n", - "\n", - " Note: WASP2 uses scipy.stats.false_discovery_control(pvals, method=\"bh\")\n", - " which is equivalent. This implementation is for educational purposes.\n", - "\n", - " Args:\n", - " pvalues: Array of p-values\n", - "\n", - " Returns:\n", - " Array of adjusted p-values (q-values)\n", - " \"\"\"\n", - " pvalues = np.asarray(pvalues, dtype=np.float64)\n", - " n = len(pvalues)\n", - "\n", - " if n == 0:\n", - " return np.array([])\n", - "\n", - " # Sort indices\n", - " sorted_idx = np.argsort(pvalues)\n", - " sorted_pvals = pvalues[sorted_idx]\n", - "\n", - " # BH adjustment: p_adj[i] = p[i] * n / rank[i]\n", - " adjusted = sorted_pvals * n / np.arange(1, n + 1)\n", - "\n", - " # Ensure monotonicity (cumulative minimum from the end)\n", - " adjusted = np.minimum.accumulate(adjusted[::-1])[::-1]\n", - "\n", - " # Cap at 1\n", - " adjusted = np.minimum(adjusted, 1)\n", - "\n", - " # Restore original order\n", - " result = np.empty(n)\n", - " result[sorted_idx] = adjusted\n", - "\n", - " return result\n", - "\n", - "\n", - "# Compare our implementation with scipy's\n", - "print(\"Comparing BH implementations:\")\n", - "test_pvals = np.array([0.001, 0.01, 0.02, 0.05, 0.1, 0.5])\n", - "our_adjusted = benjamini_hochberg(test_pvals)\n", - "scipy_adjusted = false_discovery_control(test_pvals, method=\"bh\")\n", - "\n", - "print(f\" Raw p-values: {test_pvals}\")\n", - "print(f\" Our BH adjusted: {our_adjusted.round(4)}\")\n", - "print(f\" SciPy BH adjusted:{scipy_adjusted.round(4)}\")\n", - "print(f\" Match: {np.allclose(our_adjusted, scipy_adjusted)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fdr-application", - "metadata": {}, - "outputs": [], - "source": [ - "# Apply FDR correction to mixed data\n", - "# Using scipy's implementation (same as WASP2)\n", - "fdr_pvals = false_discovery_control(mixed_pvals, method=\"bh\")\n", - "\n", - "# Compare results\n", - "print(\"FDR correction comparison (alpha/FDR = 0.05):\")\n", - "print(\"\\nRaw p-values:\")\n", - "print(f\" Significant: {(mixed_pvals < 0.05).sum()}\")\n", - "print(f\" True positives: {((mixed_pvals < 0.05) & is_signal).sum()}\")\n", - "print(f\" False positives: {((mixed_pvals < 0.05) & ~is_signal).sum()}\")\n", - "\n", - "print(\"\\nBH-adjusted p-values (scipy.stats.false_discovery_control):\")\n", - "print(f\" Significant: {(fdr_pvals < 0.05).sum()}\")\n", - "print(f\" True positives: {((fdr_pvals < 0.05) & is_signal).sum()}\")\n", - "print(f\" False positives: {((fdr_pvals < 0.05) & ~is_signal).sum()}\")" - ] - }, - { - "cell_type": "markdown", - "id": "fdr-alternatives", - "metadata": {}, - "source": [ - "### 5.3 Alternative FDR Methods\n", - "\n", - "WASP2 primarily uses BH, but other options exist:\n", - "\n", - "| Method | Description | Use Case |\n", - "|--------|-------------|----------|\n", - "| **Benjamini-Hochberg (BH)** | Standard FDR control | Default choice |\n", - "| **Benjamini-Yekutieli (BY)** | Controls FDR under dependency | Correlated tests |\n", - "| **Storey's q-value** | Estimates proportion of true nulls | Large-scale testing |\n", - "| **mid-p correction** | Less conservative binomial test | Discrete distributions |" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "fdr-visual", - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize FDR correction effect\n", - "fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n", - "\n", - "# Panel 1: Raw vs adjusted p-values\n", - "ax = axes[0]\n", - "sorted_idx = np.argsort(mixed_pvals)\n", - "ax.plot(mixed_pvals[sorted_idx], label=\"Raw p-value\", lw=2)\n", - "ax.plot(fdr_pvals[sorted_idx], label=\"BH-adjusted\", lw=2)\n", - "ax.axhline(0.05, color=\"red\", linestyle=\"--\", label=\"Threshold (0.05)\")\n", - "ax.set_xlabel(\"Rank\")\n", - "ax.set_ylabel(\"P-value\")\n", - "ax.set_title(\"Raw vs FDR-Adjusted P-values\")\n", - "ax.set_xlim(0, 200) # Focus on top hits\n", - "ax.legend()\n", - "\n", - "# Panel 2: Number of discoveries at different thresholds\n", - "ax = axes[1]\n", - "thresholds = np.linspace(0.001, 0.2, 50)\n", - "raw_discoveries = [(mixed_pvals < t).sum() for t in thresholds]\n", - "fdr_discoveries = [(fdr_pvals < t).sum() for t in thresholds]\n", - "\n", - "ax.plot(thresholds, raw_discoveries, label=\"Raw p-value\", lw=2)\n", - "ax.plot(thresholds, fdr_discoveries, label=\"FDR-adjusted\", lw=2)\n", - "ax.axhline(n_signal, color=\"green\", linestyle=\":\", label=f\"True signals ({n_signal})\")\n", - "ax.axvline(0.05, color=\"red\", linestyle=\"--\", alpha=0.5)\n", - "ax.set_xlabel(\"Threshold\")\n", - "ax.set_ylabel(\"Number of Discoveries\")\n", - "ax.set_title(\"Discoveries at Different Thresholds\")\n", - "ax.legend()\n", - "\n", - "plt.tight_layout()\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "id": "summary-header", - "metadata": {}, - "source": [ - "---\n", - "\n", - "## Summary\n", - "\n", - "In this tutorial, you learned:\n", - "\n", - "### Key Concepts\n", - "\n", - "1. **Overdispersion** - Sequencing data shows more variance than the binomial model predicts\n", - " - Sources: PCR bias, library prep effects, technical variability\n", - " - Consequence: Binomial test has inflated false positive rate\n", - "\n", - "2. **Beta-binomial distribution** - Naturally handles overdispersion\n", - " - Parameters: mean ($\\mu$) and dispersion ($\\rho$)\n", - " - Variance: $N\\mu(1-\\mu)[1 + (N-1)\\rho]$\n", - " - Reduces to binomial when $\\rho \\to 0$\n", - " - **Critical:** Use `clamp_rho()` to prevent numerical instability at boundaries\n", - "\n", - "3. **Dispersion estimation**\n", - " - **MLE**: More efficient, used by WASP2\n", - " - **MoM**: Faster, closed-form solution\n", - " - **Linear model**: Allows depth-dependent dispersion\n", - "\n", - "4. **QQ plots** - Diagnostic tool for model calibration\n", - " - Points on diagonal = well-calibrated\n", - " - Inflation factor ($\\lambda$) quantifies deviation\n", - "\n", - "5. **FDR correction** - Essential for multiple testing\n", - " - Benjamini-Hochberg controls false discovery rate\n", - " - WASP2 uses `scipy.stats.false_discovery_control(pvals, method=\"bh\")`\n", - "\n", - "### WASP2 Implementation Reference\n", - "\n", - "| Tutorial Function | WASP2 Source |\n", - "|-------------------|-------------|\n", - "| `clamp_rho()` | `src/analysis/as_analysis.py:36-50` |\n", - "| `mu_rho_to_alpha_beta()` | `src/analysis/as_analysis.py:104-105` |\n", - "| `betabinom_loglik()` | `src/analysis/as_analysis.py:81-112` (opt_prob) |\n", - "| `likelihood_ratio_test()` | `src/analysis/as_analysis.py:322-323` |\n", - "| `estimate_dispersion_mle()` | `src/analysis/as_analysis.py:251-325` (single_model) |\n", - "| `estimate_linear_dispersion()` | `src/analysis/as_analysis.py:53-78` (opt_linear) |\n", - "| FDR correction | `src/analysis/as_analysis.py:492` |\n", - "\n", - "### Further Reading\n", - "\n", - "- [Statistical Models Documentation](../methods/statistical_models.rst)\n", - "- [Dispersion Estimation Methods](../methods/dispersion_estimation.rst)\n", - "- Skelly et al. (2011) - Beta-binomial framework for allelic imbalance\n", - "- Benjamini & Hochberg (1995) - Controlling false discovery rate" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "final-cell", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Tutorial complete!\")\n", - "print(\"\\nYou now understand the statistical foundations of WASP2's\")\n", - "print(\"allelic imbalance detection framework.\")\n", - "print(\"\\nKey takeaways:\")\n", - "print(f\" - Always use RHO_EPSILON ({RHO_EPSILON}) to prevent numerical instability\")\n", - "print(\" - Beta-binomial handles overdispersion that binomial cannot\")\n", - "print(\" - QQ plots diagnose model calibration\")\n", - "print(\" - FDR correction is essential for genome-wide testing\")" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.0" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -}