diff --git a/docs/source/faq.rst b/docs/source/faq.rst index 9ff4048..2e517d5 100644 --- a/docs/source/faq.rst +++ b/docs/source/faq.rst @@ -117,7 +117,7 @@ calibrated for the overdispersion typical of RNA-seq count data. The standard threshold is FDR < 0.05. For discovery analyses you may want FDR < 0.1. For validation or follow-up experiments, consider FDR < 0.01. -See :doc:`methods/fdr_correction` for details on the BH procedure used. +See :doc:`methods/statistical_models` for the BH procedure and the NaN-propagation warning. **My output has very few significant sites. What's wrong?** diff --git a/docs/source/index.rst b/docs/source/index.rst index 67b8c8a..ffee6e7 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -77,12 +77,8 @@ Documentation :maxdepth: 2 :caption: Statistical Methods - methods/index - methods/counting_algorithm methods/mapping_filter methods/statistical_models - methods/dispersion_estimation - methods/fdr_correction .. toctree:: :maxdepth: 2 diff --git a/docs/source/methods/counting_algorithm.rst b/docs/source/methods/counting_algorithm.rst deleted file mode 100644 index 3839acc..0000000 --- a/docs/source/methods/counting_algorithm.rst +++ /dev/null @@ -1,193 +0,0 @@ -Allele Counting Algorithm -========================= - -This document describes how WASP2 assigns reads to reference and alternate -alleles at heterozygous variant sites. - -.. contents:: Contents - :local: - :depth: 2 - -Overview --------- - -The allele counting algorithm forms the foundation of allele-specific analysis. -For each heterozygous SNP, WASP2 examines aligned reads and counts how many -support the reference allele, alternate allele, or neither. - -Biological Rationale --------------------- - -In a diploid organism with a heterozygous site (genotype A/G), reads originating -from each chromosome should carry the corresponding allele. Under the null -hypothesis of no allelic imbalance, we expect equal representation of both -alleles: - -.. math:: - - E[\text{ref\_count}] = E[\text{alt\_count}] = \frac{N}{2} - -where :math:`N` is the total number of reads covering the variant. - -Deviations from this expectation may indicate: - -- **Allele-specific expression (ASE)**: Differential transcription between alleles -- **Allele-specific chromatin accessibility**: Differential regulatory activity -- **Allele-specific binding**: Differential protein-DNA interactions -- **Technical artifacts**: Mapping bias, amplification bias - -Algorithm Details ------------------ - -Position-Based Alignment -^^^^^^^^^^^^^^^^^^^^^^^^ - -For each variant position, WASP2 queries the BAM file to retrieve all reads -overlapping that genomic coordinate: - -1. **Coordinate lookup**: Use BAM index to efficiently retrieve reads at position -2. **CIGAR parsing**: Walk through the CIGAR string to find the query position - corresponding to the reference position -3. **Base extraction**: Extract the nucleotide at the query position - -.. code-block:: python - - # Simplified pseudocode - for read in bam.fetch(chrom, pos, pos + 1): - if read.is_unmapped: - continue # Only BAM flag filter applied; see "Canonical Filter Contract" - query_pos = find_aligned_position(read, pos) - if query_pos is not None: - base = read.query_sequence[query_pos] - if base == ref_allele: - ref_count += 1 - elif base == alt_allele: - alt_count += 1 - else: - other_count += 1 - -.. note:: - - As of v1.4.1 WASP2 applies **only** the unmapped filter during counting. - Secondary, supplementary, duplicate, QC-fail, and not-proper-pair reads - are **not** dropped. Callers that want those filters should pre-filter - BAMs (``samtools view -F 0x904``). See - :doc:`mapping_filter` for the full canonical-contract rationale. - -CIGAR-Aware Position Finding -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The alignment between query (read) and reference positions is not always 1:1 -due to insertions, deletions, and soft clipping. WASP2 uses the aligned pairs -from the CIGAR string: - -.. table:: CIGAR Operations and Position Mapping - :widths: 20 40 40 - - ==================== =========================== ======================= - CIGAR Operation Consumes Reference Consumes Query - ==================== =========================== ======================= - M (match/mismatch) Yes Yes - I (insertion) No Yes - D (deletion) Yes No - S (soft clip) No Yes - H (hard clip) No No - N (skip) Yes No - ==================== =========================== ======================= - -For a read with CIGAR ``50M2D30M``: - -- Positions 0-49 in the reference align to positions 0-49 in the query -- Positions 50-51 in the reference have no query alignment (deletion) -- Positions 52-81 in the reference align to positions 50-79 in the query - -Handling Edge Cases -^^^^^^^^^^^^^^^^^^^ - -**Deletions spanning the variant** - If the variant position falls within a deletion, the read cannot be assigned - to either allele and is counted as "other". - -**Insertions adjacent to the variant** - Insertions can shift the query position. The algorithm correctly handles - this by using aligned pairs rather than simple arithmetic. - -**Soft-clipped reads** - Soft-clipped bases at read ends do not consume reference positions but - are present in the query sequence. The algorithm accounts for this. - -Rust Acceleration ------------------ - -WASP2 uses a Rust-accelerated BAM counter for performance: - -.. code-block:: python - - from wasp2_rust import BamCounter - - counter = BamCounter("sample.bam") - regions = [("chr1", 12345, "A", "G"), ("chr1", 12400, "C", "T")] - counts = counter.count_alleles(regions, min_qual=0, threads=4) - -The Rust implementation provides: - -- **Parallel processing**: Count multiple regions simultaneously -- **Memory efficiency**: Stream through BAM without loading all reads -- **htslib integration**: Direct access to BAM index for efficient queries - -The Rust counter is substantially faster than the pure-Python pysam path -(roughly an order of magnitude on typical ATAC-seq data in internal -testing). Exact throughput depends on read length, BAM index density, -storage type, and thread count — measure on your own hardware before -planning large batches. - -Quality Filtering ------------------ - -Optional base quality filtering can exclude low-confidence base calls: - -.. math:: - - Q = -10 \log_{10}(P_{\text{error}}) - -By default, WASP2 uses ``min_qual=0`` (no filtering) to match legacy WASP -behavior. For stringent analysis, ``min_qual=20`` (1% error rate) is recommended. - -Output Format -------------- - -The counting algorithm produces a table with the following columns: - -.. table:: Count Output Columns - :widths: 20 20 60 - - ============= ======== ============================================ - Column Type Description - ============= ======== ============================================ - chrom string Chromosome name - pos int 1-based genomic position - ref string Reference allele - alt string Alternate allele - ref_count int Reads supporting reference allele - alt_count int Reads supporting alternate allele - other_count int Reads with neither allele (errors, indels) - region string Associated region (peak, gene) if provided - ============= ======== ============================================ - -Region Assignment ------------------ - -When a BED file of regions (peaks, genes) is provided, variants are associated -with overlapping regions: - -1. **Intersection**: Use bedtools or coitrees to find variant-region overlaps -2. **Aggregation**: Sum counts across all variants within each region -3. **Statistical testing**: Perform imbalance analysis at the region level - -This aggregation increases statistical power for detecting allelic imbalance, -as individual SNPs often have low coverage. - -See Also --------- - -- :doc:`statistical_models` - Statistical testing of allele counts diff --git a/docs/source/methods/dispersion_estimation.rst b/docs/source/methods/dispersion_estimation.rst deleted file mode 100644 index 9b1e4e9..0000000 --- a/docs/source/methods/dispersion_estimation.rst +++ /dev/null @@ -1,126 +0,0 @@ -Dispersion Parameter Estimation -================================ - -The beta-binomial overdispersion parameter :math:`\rho \in (0, 1)` quantifies -variance inflation beyond the binomial expectation: - -.. math:: - - \text{Var}(X) = N\mu(1-\mu)\left[1 + (N-1)\rho\right] - -Dispersion arises from biological variation, PCR amplification, and -aggregation of counts across SNPs within a region. Estimation approaches -follow the beta-binomial ASE literature [Kumasaka2016]_; analogous methods -for negative-binomial dispersion in RNA-seq counts are reviewed in -[Robinson2010]_ and [Yu2013]_. - -Single Dispersion Model (default) ---------------------------------- - -WASP2's default ``single`` model estimates one pooled :math:`\rho` under the -null hypothesis :math:`\mu = 0.5` by maximum likelihood: - -.. math:: - - \hat{\rho}_{\text{MLE}} = \arg\max_{\rho} \sum_{i=1}^{n} - \log \text{BetaBinom}(X_i; N_i, \alpha(\rho), \beta(\rho)) - -with :math:`\alpha = \beta = 0.5(1-\rho)/\rho`. - -.. code-block:: python - - from scipy.optimize import minimize_scalar - from scipy.stats import betabinom - import numpy as np - - def neg_log_likelihood(rho, ref_counts, n_counts): - alpha = 0.5 * (1 - rho) / rho - beta = 0.5 * (1 - rho) / rho - return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) - - rho_mle = minimize_scalar( - neg_log_likelihood, - args=(ref_array, n_array), - method='bounded', - bounds=(1e-6, 1 - 1e-6), - ).x - -:math:`\rho` is held at this null-model MLE when computing the likelihood-ratio -test statistic (see :doc:`statistical_models`). - -Linear Dispersion Model ------------------------ - -For large datasets with wide coverage ranges, WASP2 provides a ``linear`` model -in which dispersion varies with total count on the logit scale: - -.. math:: - - \text{logit}(\rho_i) = \beta_0 + \beta_1 \cdot N_i - -Low-coverage regions typically show greater apparent dispersion from sampling -noise; high-coverage regions reveal more of the true biological variance. The -linear model captures this trend with one additional parameter. - -.. code-block:: python - - from scipy.optimize import minimize - from scipy.special import expit - - def neg_ll_linear(params, ref_counts, n_counts): - beta0, beta1 = params - rho = expit(np.clip(beta0 + beta1 * n_counts, -10, 10)) - alpha = 0.5 * (1 - rho) / rho - beta = 0.5 * (1 - rho) / rho - return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) - - beta0, beta1 = minimize(neg_ll_linear, x0=(0, 0), - method='Nelder-Mead', - args=(ref_array, n_array)).x - -Choosing a model ----------------- - -.. table:: - :widths: 30 70 - - ================================ ============================================================= - Context Recommended model - ================================ ============================================================= - Small dataset (< 1,000 regions) ``single`` — linear model is underpowered - Uniform coverage ``single`` - Large dataset, variable coverage ``linear`` - Model-selection check Compare AIC/BIC; linear has one extra parameter - ================================ ============================================================= - -Model comparison via AIC (:math:`\text{AIC} = 2k - 2\ell(\hat\theta)`) can be -used to check whether the extra parameter is justified on a given dataset. - -Convergence ------------ - -The MLE optimizer may have trouble when :math:`\rho` is very close to 0 or 1, -or when extreme outliers dominate the pooled likelihood. WASP2 bounds -:math:`\rho \in (10^{-6}, 1-10^{-6})` and clips logit values in the linear -model. Persistent convergence failures usually indicate data issues (CNVs, -mapping artifacts) rather than optimizer pathology. - -See Also --------- - -- :doc:`statistical_models` — how :math:`\rho` enters the LRT -- :doc:`fdr_correction` — multiple-testing correction downstream - -References ----------- - -.. [Kumasaka2016] Kumasaka N, Knights AJ, Gaffney DJ (2016). Fine-mapping - cellular QTLs with RASQUAL and ATAC-seq. *Nature Genetics* 48:206-213. - -.. [Robinson2010] Robinson MD, Smyth GK (2010). Small-sample estimation of - negative binomial dispersion, with applications to SAGE data. - *Biostatistics* 9:321-332. *Analogous NB-dispersion literature.* - -.. [Yu2013] Yu D, Huber W, Vitek O (2013). Shrinkage estimation of dispersion - in Negative Binomial models for RNA-seq experiments with small sample size. - *Bioinformatics* 29:1275-1282. *Analogous NB-dispersion literature.* diff --git a/docs/source/methods/fdr_correction.rst b/docs/source/methods/fdr_correction.rst deleted file mode 100644 index 50532fa..0000000 --- a/docs/source/methods/fdr_correction.rst +++ /dev/null @@ -1,66 +0,0 @@ -False Discovery Rate Correction -================================ - -WASP2 corrects for multiple testing using the Benjamini-Hochberg (BH) procedure -[Benjamini1995]_, applied via :func:`scipy.stats.false_discovery_control`. - -.. code-block:: python - - from scipy.stats import false_discovery_control - - fdr_pvals = false_discovery_control(pvals, method='bh') - -Each region receives both a raw p-value (``pval``) from the likelihood ratio test -and a BH-adjusted p-value (``fdr_pval``). Significance is typically declared at -``fdr_pval < 0.05``. - -BH controls FDR under independence or positive regression dependence (PRDS). -The chi-squared p-values produced by WASP2's LRT satisfy these assumptions when -regions are analyzed independently. If regions share SNPs or otherwise induce -negative dependence, the more conservative Benjamini-Yekutieli procedure -(``method='by'``) can be substituted. - -.. warning:: - - ``scipy.stats.false_discovery_control`` raises on NaN p-values, but a manual - BH implementation based on ``np.minimum.accumulate`` silently propagates NaN - through the cumulative minimum. Always drop or impute NaN p-values before BH - correction. - -For studies requiring higher-power FDR control with an estimated proportion of -true nulls, :math:`\pi_0`, see Storey's q-value [Storey2003]_. WASP2 does not -apply this by default because BH is sufficient for typical ASE workloads and -requires no tuning. - -Reporting ---------- - -When reporting results, state the correction method, threshold, and number -of tests: - - "BH correction was applied to N = ... chi-squared p-values; regions with - ``fdr_pval < 0.05`` were declared significant." - -Output columns --------------- - -.. table:: - :widths: 20 20 60 - - ========= ======== ==================================================== - Column Type Description - ========= ======== ==================================================== - pval float Raw p-value from likelihood ratio test - fdr_pval float BH-adjusted p-value - ========= ======== ==================================================== - -References ----------- - -.. [Benjamini1995] Benjamini Y, Hochberg Y (1995). Controlling the false - discovery rate: A practical and powerful approach to multiple testing. - *Journal of the Royal Statistical Society B* 57:289-300. - -.. [Storey2003] Storey JD, Tibshirani R (2003). Statistical significance for - genomewide studies. *Proceedings of the National Academy of Sciences* - 100:9440-9445. diff --git a/docs/source/methods/index.rst b/docs/source/methods/index.rst deleted file mode 100644 index 149a60a..0000000 --- a/docs/source/methods/index.rst +++ /dev/null @@ -1,48 +0,0 @@ -Statistical Methods -=================== - -This section provides detailed documentation of the statistical methods, -algorithms, and biological rationale underlying WASP2's allele-specific -analysis pipeline. - -.. toctree:: - :maxdepth: 2 - :caption: Contents - - counting_algorithm - mapping_filter - statistical_models - dispersion_estimation - fdr_correction - -Overview --------- - -WASP2 implements a complete pipeline for allele-specific analysis: - -1. **Allele Counting**: Reads are assigned to reference or alternate alleles - at heterozygous variant sites using base-level alignment information. - -2. **Mapping Bias Correction**: The WASP algorithm removes reads that exhibit - mapping bias by testing whether allele-swapped reads map to the same location. - -3. **Statistical Testing**: Beta-binomial models account for overdispersion - in allele count data when testing for allelic imbalance. - -4. **Multiple Testing Correction**: False discovery rate control ensures - reliable detection of true imbalanced regions. - -References ----------- - -.. [vandeGeijn2015] van de Geijn B, McVicker G, Gilad Y, Pritchard JK (2015). - WASP: allele-specific software for robust molecular quantitative trait - locus discovery. *Nature Methods* 12:1061-1063. - -.. [Castel2015] Castel SE, Levy-Moonshine A, Mohammadi P, Banks E, Lappalainen T (2015). - Tools and best practices for data processing in allelic expression analysis. - *Genome Biology* 16:195. - -.. [Skelly2011] Skelly DA, Johansson M, Madeoy J, Wakefield J, Akey JM (2011). - A powerful and flexible statistical framework for testing hypotheses of - allele-specific gene expression from RNA-seq data. *Genome Research* 21:1728-1737. diff --git a/docs/source/methods/mapping_filter.rst b/docs/source/methods/mapping_filter.rst index c196d0b..81d5ae1 100644 --- a/docs/source/methods/mapping_filter.rst +++ b/docs/source/methods/mapping_filter.rst @@ -174,6 +174,12 @@ If your pipeline needs those defensive filters, apply them upstream: Earlier WASP2 releases (v1.2.0–v1.4.0) applied six SAM flag filters internally. See ``CHANGELOG.md`` for details. +**Counting step.** The same canonical contract applies to the allele +counting step (``wasp2-count count-variants``): only the unmapped filter +(``0x4``) is applied. All other SAM-flag decisions are the caller's +responsibility — pre-filter the BAM upstream if defensive filtering is +needed. + **Same-Locus Test** For SNPs, exact position matching is required: @@ -314,7 +320,7 @@ The typical WASP2 workflow: See Also -------- -- :doc:`counting_algorithm` — Allele counting after WASP filtering +- :doc:`statistical_models` — beta-binomial LRT and FDR applied to the counts References ---------- diff --git a/docs/source/methods/statistical_models.rst b/docs/source/methods/statistical_models.rst index b3037f8..69a09f6 100644 --- a/docs/source/methods/statistical_models.rst +++ b/docs/source/methods/statistical_models.rst @@ -1,57 +1,56 @@ -Beta-Binomial Model for Allelic Imbalance -========================================== +Statistical Model +================= -WASP2 detects allelic imbalance via a beta-binomial likelihood-ratio test. -The beta-binomial accommodates overdispersion from biological variation, -PCR amplification, and aggregation of counts across SNPs within a region — -sources of variance that a simple binomial model cannot absorb. +WASP2 detects allelic imbalance with a **beta-binomial likelihood-ratio +test** applied per region (gene or peak). This page documents the model, +the dispersion-estimation choices, the multiple-testing correction, and +the WASP2-specific defaults and contracts that are not covered by the +papers cited below. + +For the WASP mapping filter upstream of counting, see +:doc:`mapping_filter`. Model ----- -With :math:`X` the reference count out of :math:`N` total at a site: +With :math:`X` the reference count out of :math:`N` total reads at a +site: .. math:: - p &\sim \text{Beta}(\alpha, \beta) \\ - X \mid p &\sim \text{Binomial}(N, p) + p \sim \text{Beta}(\alpha, \beta), \qquad X \mid p \sim \text{Binomial}(N, p) -WASP2 uses the mean-dispersion parameterization, :math:`\mu = \alpha/(\alpha+\beta)` -and :math:`\rho = 1/(\alpha+\beta+1)`: +WASP2 uses the mean-dispersion parameterization +:math:`\mu = \alpha/(\alpha+\beta)` and :math:`\rho = 1/(\alpha+\beta+1)`: .. math:: - \alpha = \mu \cdot \frac{1 - \rho}{\rho}, \qquad - \beta = (1 - \mu) \cdot \frac{1 - \rho}{\rho} - -The variance, - -.. math:: + \alpha = \mu\cdot\frac{1-\rho}{\rho}, \qquad \beta = (1-\mu)\cdot\frac{1-\rho}{\rho}, \qquad \text{Var}(X) = N\mu(1-\mu)[1 + (N-1)\rho] - \text{Var}(X) = N\mu(1-\mu)\left[1 + (N-1)\rho\right], +Binomial variance is recovered as :math:`\rho \to 0`. The framework +follows the RASQUAL / MBASED lineage ([Kumasaka2016]_, [Mayba2014]_). -recovers the binomial as :math:`\rho \to 0`. Estimation of :math:`\rho` is -described in :doc:`dispersion_estimation`. - -Hypothesis Test +Hypothesis test --------------- For each region, WASP2 tests .. math:: - H_0 : \mu = 0.5 \qquad \text{vs.} \qquad H_1 : \mu \neq 0.5 + H_0: \mu = 0.5 \qquad \text{vs.} \qquad H_1: \mu \neq 0.5 -with the likelihood ratio statistic +with the likelihood-ratio statistic .. math:: - \Lambda = -2\left[\log \mathcal{L}_0 - \log \mathcal{L}_1\right], \qquad - \Lambda \sim \chi^2_1 \text{ under } H_0, + \Lambda = -2[\log\mathcal{L}_0 - \log\mathcal{L}_1] \sim \chi^2_1 \text{ under } H_0. + +.. important:: -where :math:`\mathcal{L}_1` is maximized over :math:`\mu` with :math:`\rho` -held at its null-model MLE (profile likelihood in :math:`\mu`). The p-value -is :math:`P(\chi^2_1 > \Lambda)`. + **Profile-likelihood convention.** :math:`\rho` is held at its + null-model MLE when maximizing :math:`\mathcal{L}_1` over :math:`\mu` + (profile likelihood in :math:`\mu`, df = 1). WASP2 does **not** jointly + re-estimate :math:`\rho` under :math:`H_1`. MLE for :math:`\mu` under :math:`H_1`: @@ -61,113 +60,141 @@ MLE for :math:`\mu` under :math:`H_1`: from scipy.stats import betabinom import numpy as np - def neg_log_likelihood(mu, ref_counts, n_counts, rho): + def neg_ll_mu(mu, ref, n, rho): alpha = mu * (1 - rho) / rho beta = (1 - mu) * (1 - rho) / rho - return -np.sum(betabinom.logpmf(ref_counts, n_counts, alpha, beta)) - - mu_mle = minimize_scalar( - neg_log_likelihood, - args=(ref_counts, n_counts, rho), - method='bounded', - bounds=(0, 1), - ).x + return -np.sum(betabinom.logpmf(ref, n, alpha, beta)) -Dispersion models ------------------ + mu_mle = minimize_scalar(neg_ll_mu, args=(ref, n, rho), + method='bounded', bounds=(0, 1)).x -WASP2 provides a single pooled :math:`\rho` (default) and a coverage-dependent -``linear`` model (logit-linear in :math:`N`). See :doc:`dispersion_estimation` -for the choice criteria and code. +Dispersion +---------- -.. code-block:: python +Estimation of :math:`\rho` is by MLE under :math:`H_0` (:math:`\mu=0.5`). +WASP2 ships two models: - from analysis.as_analysis import single_model, linear_model +- ``single`` (default) — one pooled :math:`\rho` across all regions. + Appropriate for small-to-moderate datasets or uniform coverage. +- ``linear`` — :math:`\text{logit}(\rho_i) = \beta_0 + \beta_1 N_i`. + Appropriate for large cohorts with wide coverage ranges, where + low-coverage regions show different apparent dispersion from + high-coverage ones. Compared to ``single`` via AIC/BIC. - results = single_model(df, region_col="region", phased=False) - results = linear_model(df, region_col="region", phased=False) +**Implementation contract.** :math:`\rho` is bounded to the open interval +:math:`(10^{-6},\, 1-10^{-6})` during MLE to avoid the boundary +singularities of the beta-binomial. In the linear model, the logit is +clipped to :math:`[-10,\, 10]` before the inverse-logit, preserving +numerical stability on extreme :math:`N`. Phased vs. unphased data ------------------------- +------------------------- **Phased** genotypes reduce to a direct beta-binomial per SNP, with :math:`\mu` or :math:`1-\mu` depending on which haplotype carries the -reference allele: +reference allele. + +**Unphased** analysis marginalizes over the :math:`2^{n-1}` phase +configurations for a region of :math:`n` SNPs using a uniform prior +(dynamic programming over SNPs; approach of [Mayba2014]_): .. math:: - P(X_i \mid \text{hap}_i) = \text{BetaBinom}(X_i; N_i, \mu_{\text{hap}_i}, \rho) + P(\mathbf{X}) = \sum_{\phi} P(\mathbf{X}\mid\phi)\,P(\phi), + \qquad P(\phi) = 0.5^{n-1}. -**Unphased** analysis marginalizes over phase configurations with a uniform -prior across the :math:`2^{n-1}` phase assignments for a region of :math:`n` -SNPs, using dynamic programming over SNPs (approach introduced in -[Mayba2014]_): +Multiple-testing correction +--------------------------- -.. math:: +WASP2 applies the Benjamini–Hochberg procedure [Benjamini1995]_ via +:func:`scipy.stats.false_discovery_control`: - P(\mathbf{X}) = \sum_{\phi} P(\mathbf{X} \mid \phi) \, P(\phi), - \qquad P(\phi) = 0.5^{n-1}. +.. code-block:: python -Output ------- + from scipy.stats import false_discovery_control + fdr_pvals = false_discovery_control(pvals, method='bh') -.. table:: Columns emitted by the analysis pipeline - :widths: 20 15 65 +BH controls FDR under independence or positive regression dependence +(PRDS); :math:`\chi^2_1` p-values from the LRT satisfy these assumptions +when regions are analyzed independently. For dependent tests across +overlapping regions, Benjamini–Yekutieli (``method='by'``) is the +conservative fallback. - ============ ======== ====================================================== - Column Type Interpretation - ============ ======== ====================================================== - null_ll float Log-likelihood under :math:`H_0` (:math:`\mu=0.5`) - alt_ll float Log-likelihood under :math:`H_1` (:math:`\mu=\hat\mu_{\text{MLE}}`) - mu float MLE of the reference-allele proportion - lrt float Likelihood ratio statistic - pval float :math:`\chi^2_1` p-value - fdr_pval float BH-adjusted p-value (see :doc:`fdr_correction`) - ============ ======== ====================================================== - -:math:`\mu > 0.5` indicates a reference-allele preference; :math:`\mu < 0.5` -indicates an alternate preference; :math:`|\mu - 0.5|` is the effect size. - -Practical notes ---------------- +.. warning:: + + ``scipy.stats.false_discovery_control`` raises on NaN p-values, but a + hand-written BH loop based on ``np.minimum.accumulate`` **silently + propagates NaN** through the cumulative minimum — every corrected + p-value at or above the first NaN becomes NaN. Drop or impute NaN + p-values before BH correction. + +Defaults and output +------------------- -**Pseudocounts.** WASP2 adds :math:`c = 1` to both allele counts by default, -shrinking estimates slightly toward 0.5. This gives conservative inference and -avoids log-zero issues at extreme counts. +WASP2 default parameter values that affect the results: -**Minimum counts.** Regions with :math:`N < \text{min\_count}` (default 10) are -dropped — low coverage yields poor power and unstable MLEs. +.. list-table:: + :header-rows: 1 + :widths: 25 20 55 -**Region aggregation.** Analyzing at the region level (peaks, genes) rather -than individual SNPs increases power, reduces the multiple-testing burden, and -better captures regulatory effects that act across a region. + * - Parameter + - Default + - Meaning + * - ``pseudocount`` + - 1 + - Added to both ``ref_count`` and ``alt_count``. Laplace-style + shrinkage toward :math:`\mu = 0.5` — conservative under :math:`H_0`. + * - ``min_count`` + - 10 + - Regions with :math:`N < 10` are dropped; below this, power is too + low and the MLE is unstable. + +Output columns per region: + +.. list-table:: + :header-rows: 1 + :widths: 20 15 65 -**Power.** Power depends jointly on :math:`N`, :math:`\mu`, and :math:`\rho`. -Because WASP2 does not compute analytic power estimates at runtime, we do not -tabulate values here — power simulations for a given dataset should be run at -the dataset's own estimated :math:`\hat\rho` rather than relying on generic -tables. + * - Column + - Type + - Interpretation + * - ``null_ll`` + - float + - Log-likelihood under :math:`H_0` (:math:`\mu=0.5`). + * - ``alt_ll`` + - float + - Log-likelihood under :math:`H_1` (:math:`\mu=\hat\mu_{\text{MLE}}`). + * - ``mu`` + - float + - MLE of the reference-allele proportion. + * - ``lrt`` + - float + - Likelihood-ratio statistic. + * - ``pval`` + - float + - :math:`\chi^2_1` p-value. + * - ``fdr_pval`` + - float + - BH-adjusted p-value. + +:math:`\mu > 0.5` indicates a reference-allele preference; +:math:`|\mu - 0.5|` is the effect size. See Also -------- -- :doc:`dispersion_estimation` — estimating :math:`\rho` -- :doc:`fdr_correction` — multiple-testing correction -- :doc:`counting_algorithm` — how counts are produced -- :doc:`mapping_filter` — WASP remap-and-filter producing the BAM +- :doc:`mapping_filter` — canonical WASP filter contract, upstream of counts References ---------- -.. [Mayba2014] Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: - allele-specific expression detection in cancer tissues and cell lines. - *Genome Biology* 15:405. *Introduces the phase-marginalized mixture model.* - .. [Kumasaka2016] Kumasaka N, Knights AJ, Gaffney DJ (2016). Fine-mapping cellular QTLs with RASQUAL and ATAC-seq. *Nature Genetics* 48:206-213. - *Beta-binomial LRT framework in which WASP2's analysis sits.* -.. [vandeGeijn2015] van de Geijn B, McVicker G, Gilad Y, Pritchard JK (2015). - WASP: allele-specific software for robust molecular quantitative trait - locus discovery. *Nature Methods* 12:1061-1063. *The mapping-bias - correction upstream of the count step.* +.. [Mayba2014] Mayba O, Gilbert HN, Liu J, et al. (2014). MBASED: + allele-specific expression detection in cancer tissues and cell lines. + *Genome Biology* 15:405. + +.. [Benjamini1995] Benjamini Y, Hochberg Y (1995). Controlling the false + discovery rate: A practical and powerful approach to multiple testing. + *Journal of the Royal Statistical Society B* 57:289-300. diff --git a/docs/source/tutorials/bulk_workflow.rst b/docs/source/tutorials/bulk_workflow.rst index 692f3ad..ad721be 100644 --- a/docs/source/tutorials/bulk_workflow.rst +++ b/docs/source/tutorials/bulk_workflow.rst @@ -110,8 +110,7 @@ Step 3 — Test for allelic imbalance The beta-binomial LRT tests :math:`\mu = 0.5` against :math:`\mu \neq 0.5` per region, correcting for overdispersion and multiple testing via -Benjamini–Hochberg (:doc:`/methods/statistical_models`, -:doc:`/methods/fdr_correction`). +Benjamini–Hochberg (:doc:`/methods/statistical_models`). ``--phased`` applies only when the VCF GT field uses ``0|1`` / ``1|0``; pass it off for unphased data.