Skip to content

rust: restore canonical WASP filter contract (van de Geijn 2015)#96

Merged
Jaureguy760 merged 4 commits intomasterfrom
feat/canonical-wasp-filter-contract
Apr 19, 2026
Merged

rust: restore canonical WASP filter contract (van de Geijn 2015)#96
Jaureguy760 merged 4 commits intomasterfrom
feat/canonical-wasp-filter-contract

Conversation

@Jaureguy760
Copy link
Copy Markdown
Collaborator

Summary

Restore canonical WASP filter contract in the Rust counting path. The v1.2.0 migration (commit a72ffba) added 6 SAM flag filters that silently drop valid WASP-pass alignments when the input BAM has been pre-filtered. The impact is small on BWA output but substantial on STAR output: on RNA, v1.3.0+ disagrees with the pre-v1.3.0 Python counter at ~37% of gene-level ref/alt rows on the same BAM.

Fix

Three edits in rust/src/, keeping is_unmapped for crash safety:

File Dropped
bam_counter.rs is_secondary | is_supplementary | is_duplicate
mapping_filter.rs !is_proper_pair | is_secondary | is_supplementary
bam_filter.rs bitmask 0x100 | 0x800 | 0x200 | 0x400

Net: +3 insertions, -13 deletions across 3 files.

Validation

On one donor (BAM already WASP-remapped), canonical-parity Rust vs pre-v1.3.0 Python:

  • RNA gene counts: 0 / 17,728 rows differ (byte-identical)
  • ATAC peak counts: 8 / 7,986 rows differ (0.10%; residual is an unrelated pre-dedup ordering bug in the Python counter, not a filter-policy issue)

tests/test_rust_python_counting_parity.py Python reference updated to match the new contract.

Backwards compatibility

Users who relied on the defensive filtering should pre-filter BAMs upstream:

samtools view -F 0x904 in.bam -o filtered.bam

This matches the canonical WASP contract documented in bmvdgeijn/WASP CHT/bam2h5.py:28-30: "This program does not perform filtering of reads based on mappability. It is assumed that the input BAM files are filtered appropriately prior to calling this script."

Reference

van de Geijn et al. 2015, Nat Methods, 10.1038/nmeth.3582

Remove 6 SAM flag filters added in 1.2.0 migration (commit a72ffba) so
WASP2 counting respects pre-filtered BAMs. This realigns behavior with
the canonical WASP documented contract that callers are responsible for
upstream BAM filtering (bmvdgeijn/WASP CHT/bam2h5.py: "This program does
not perform filtering of reads based on mappability. It is assumed that
the input BAM files are filtered appropriately prior to calling this
script.").

Three edits, keeping is_unmapped for crash safety:

  - rust/src/bam_counter.rs
      drop is_secondary | is_supplementary | is_duplicate
  - rust/src/mapping_filter.rs
      drop !is_proper_pair | is_secondary | is_supplementary
  - rust/src/bam_filter.rs
      drop bitmask 0x100 | 0x800 | 0x200 | 0x400

Rationale: on WASP-remapped input (e.g., *_wasp_filt_rmdup.bam), the
v1.2.0 filters re-filter already-cleaned reads, silently dropping valid
WASP-pass alignments. The impact is small on BWA output (~0.15%) but
substantial on STAR output (~37% of gene-level ref/alt rows differ)
where secondary/supplementary alignments are routine. Removing the
filters restores byte-level parity with the pre-v1.3.0 Python counter
on RNA (0/17,728 rows differ) and within 0.10% on ATAC, with the ATAC
residual tracing to a pre-existing pre-dedup ordering bug in
count_alleles.py (read marked seen before the aligned-pairs check).

Callers relying on the defensive filters should pre-filter BAMs
upstream (e.g., samtools view -F 0x904).

Refs: van de Geijn et al. 2015, Nat Methods 10.1038/nmeth.3582
test_rust_python_counting_parity reimplements the Rust counter in pure
Python to check numerical parity. Update the Python reference to drop
only unmapped reads, matching the new canonical WASP behavior in
bam_counter.rs. Without this, the parity test would fail because the
Python reference would continue to filter secondary/supplementary/
duplicate while Rust no longer does.
Remove quoted return type annotation on make_intersect_df; polars is
imported at module top so the forward reference is unnecessary. This
un-blocks ruff in CI.
@Jaureguy760 Jaureguy760 merged commit 40ef4b6 into master Apr 19, 2026
12 checks passed
@Jaureguy760 Jaureguy760 deleted the feat/canonical-wasp-filter-contract branch April 19, 2026 00:07
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
Bump Cargo.toml/Dockerfile/bioconda-recipe/Singularity.def from 1.4.0
to 1.4.1. Move CHANGELOG [Unreleased] to [1.4.1] with 2026-04-18 date.

Release notes:
- rust: restore canonical WASP filter contract (#96)
- tests: align Python reference with canonical filter contract (#96)
- chore: fix pre-existing ruff UP037 in intersect_variant_data.py (#96)
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
* feat(security): add CodeQL analysis and improve vulnerability scanning

- Add CodeQL workflow for advanced Python SAST with security-extended queries
- Improve security.yml to fail builds on vulnerabilities (removed || true)
- Add Gitleaks CI job for automated secret detection
- Add weekly scheduled security scans
- Create SECURITY.md with vulnerability disclosure policy

Implements GitHub issue #27.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(ci): use self-hosted runner for gitleaks, make CodeQL optional

- Modified gitleaks to run on self-hosted runner with direct CLI
- Added continue-on-error to CodeQL (requires GitHub-hosted runners)
- This allows CI to pass with only self-hosted infrastructure

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix(ci): correct pip-audit flag and handle cargo audit lock file

- Change --require-hashes=false to --no-require-hashes (correct flag syntax)
- Remove stale advisory-db lock file before cargo audit
- Add || true to make audit failures non-blocking (advisory only)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* docs: add comprehensive 10X scRNA-seq barcode file format examples (#88)

Add detailed documentation for 10X barcode file formats including:
- Chemistry version table (v2/v3/Multiome) with whitelist sizes
- PBMC and multi-sample aggregated examples
- Format validation utilities (bash and Python)
- Common format variations and suffix handling
- Quick diagnostic commands for troubleshooting

Add example barcode test files:
- barcodes_10x_multi_sample.tsv (multi-sample with -1/-2/-3 suffixes)
- barcodes_10x_hierarchical.tsv (hierarchical cell type naming)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
* rust: restore canonical WASP filter contract

Remove 6 SAM flag filters added in 1.2.0 migration (commit ed5b007) so
WASP2 counting respects pre-filtered BAMs. This realigns behavior with
the canonical WASP documented contract that callers are responsible for
upstream BAM filtering (bmvdgeijn/WASP CHT/bam2h5.py: "This program does
not perform filtering of reads based on mappability. It is assumed that
the input BAM files are filtered appropriately prior to calling this
script.").

Three edits, keeping is_unmapped for crash safety:

  - rust/src/bam_counter.rs
      drop is_secondary | is_supplementary | is_duplicate
  - rust/src/mapping_filter.rs
      drop !is_proper_pair | is_secondary | is_supplementary
  - rust/src/bam_filter.rs
      drop bitmask 0x100 | 0x800 | 0x200 | 0x400

Rationale: on WASP-remapped input (e.g., *_wasp_filt_rmdup.bam), the
v1.2.0 filters re-filter already-cleaned reads, silently dropping valid
WASP-pass alignments. The impact is small on BWA output (~0.15%) but
substantial on STAR output (~37% of gene-level ref/alt rows differ)
where secondary/supplementary alignments are routine. Removing the
filters restores byte-level parity with the pre-v1.3.0 Python counter
on RNA (0/17,728 rows differ) and within 0.10% on ATAC, with the ATAC
residual tracing to a pre-existing pre-dedup ordering bug in
count_alleles.py (read marked seen before the aligned-pairs check).

Callers relying on the defensive filters should pre-filter BAMs
upstream (e.g., samtools view -F 0x904).

Refs: van de Geijn et al. 2015, Nat Methods 10.1038/nmeth.3582

* tests: align Python reference with canonical filter contract

test_rust_python_counting_parity reimplements the Rust counter in pure
Python to check numerical parity. Update the Python reference to drop
only unmapped reads, matching the new canonical WASP behavior in
bam_counter.rs. Without this, the parity test would fail because the
Python reference would continue to filter secondary/supplementary/
duplicate while Rust no longer does.

* docs: CHANGELOG entry for canonical WASP filter restoration

* chore: fix ruff UP037 in intersect_variant_data

Remove quoted return type annotation on make_intersect_df; polars is
imported at module top so the forward reference is unnecessary. This
un-blocks ruff in CI.
Jaureguy760 added a commit that referenced this pull request Apr 19, 2026
Bump Cargo.toml/Dockerfile/bioconda-recipe/Singularity.def from 1.4.0
to 1.4.1. Move CHANGELOG [Unreleased] to [1.4.1] with 2026-04-18 date.

Release notes:
- rust: restore canonical WASP filter contract (#96)
- tests: align Python reference with canonical filter contract (#96)
- chore: fix pre-existing ruff UP037 in intersect_variant_data.py (#96)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant