Kmer by mikessh · Pull Request #16 · antigenomics/mirpy

mikessh · 2026-04-11T17:18:53Z

Refactor sequence translation, tokenisation, and alignment. Provide optimised C code for these features. I've extensively tested using wrappers for bytes & stringzilla, but they were 3-4x slower than naive Python string slicing. Also, implement K-mer summary tables and refactor the original kmers.py code for the contingency table Chi2 test. Add unit tests, benchmarks & docs.

mikessh · 2026-04-11T17:20:32Z

N.B. some tests (for segments.py and repertoire.py) are failing - will fix them later!

Copilot

Pull request overview

This PR refactors and accelerates core sequence operations (translation, tokenization, distances, alignment) by introducing new pybind11 C extensions, and adds new k‑mer indexing/summarisation utilities (object-based + Polars) plus accompanying docs/tests/benchmarks.

Changes:

Added mirseq (translation/tokenization) and seqdist_c (distances + alignment scoring) C extensions, and rewired Python wrappers to use them.
Introduced new k‑mer utilities (alphabets.py, tokens.py, token_tables.py, token_tables_pl.py) and replaced the legacy mir/basic/kmers.py implementation with mir/biomarkers/kmer_stats.py.
Added extensive tests/benchmarks, updated docs/README, and adjusted build/config files (CMake, pyproject, requirements, VS Code settings).

Reviewed changes

Copilot reviewed 36 out of 43 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
`tests/test_tokens.py`	New unit tests for Python tokenization wrappers around `mirseq`.
`tests/test_token_tables.py`	New tests + large benchmarks for object-based k‑mer indexing/summaries.
`tests/test_token_tables_impl.py`	New tests + benchmarks comparing Polars vs object-based implementations.
`tests/test_seqdist.py`	New unit tests for `seqdist_c` and thin Python wrappers.
`tests/test_segments.py`	New tests covering `SegmentLibrary` bootstrap/download logic.
`tests/test_repertoire.py`	Updates fixture paths for repertoire dataset tests.
`tests/test_mirseq.py`	New comprehensive tests for `mirseq` translation/tokenization/reduction.
`tests/test_mirseq_benchmark.py`	New microbenchmarks comparing Python vs C implementations.
`tests/test_kmer_stats.py`	New benchmark/integration-style validation for k‑mer differential stats.
`tests/test_alphabets.py`	New tests for alphabets, masking, and matching helpers.
`tests/test_aligner.py`	New correctness + benchmark tests for aligner, incl. C vs Python fallback checks.
`tests/benchmakrs/memory_benchmark.py`	Removes non-English banner comments in benchmark script.
`tests/assets/repertoire_1.csv`	Adds repertoire fixture CSV.
`tests/assets/repertoire_2.csv`	Adds repertoire fixture CSV.
`tests/assets/repertoire_3.csv`	Adds repertoire fixture CSV.
`tests/assets/repertoire_4.csv`	Adds repertoire fixture CSV.
`tests/assets/meta.csv`	Adds metadata fixture CSV.
`tests/assets/fetch_vdjdb_gilgfvftl.sh`	Adds helper script to fetch VDJdb-derived validation dataset.
`test/test_repertoires/test_meta.csv`	Removes old metadata fixture.
`requirements.txt`	Adds new pinned dependencies (e.g., polars, pybind11, stringzilla).
`README.md`	Adds documentation/examples for masking and matching helpers.
`quick_setup.sh`	Adds a quick setup script.
`pyproject.toml`	Updates project metadata, dependencies, and license fields.
`notebooks/kmer_generator.ipynb`	Updates notebook imports to new k‑mer stats API and clears execution count.
`mir/distances/seqdist.py`	Adds thin Python wrappers for `seqdist_c`.
`mir/distances/seqdist.cpp`	Adds new combined C extension for distances + CDR3 scoring + visualization alignment.
`mir/distances/cdrscore.cpp`	Removes old CDR scoring module (merged into `seqdist_c`).
`mir/distances/aligner.py`	Refactors aligner to use `seqdist_c`, adds `align()` visualization API and richer docs.
`mir/biomarkers/kmer_stats.py`	Adds new k‑mer counting + chi-squared comparison utilities (replacement for legacy).
`mir/biomarkers/__init__.py`	Updates module docstring.
`mir/basic/tokens.py`	Adds C-backed tokenization wrapper functions.
`mir/basic/token_tables.py`	Adds object-based rearrangement k‑mer indexing + summaries.
`mir/basic/token_tables_pl.py`	Adds Polars-based expansion/summarisation + fetch helpers.
`mir/basic/mirseq.cpp`	Adds C extension for translation, tokenization, and reduced alphabet conversion.
`mir/basic/kmers.py`	Removes legacy k‑mer comparison implementation.
`mir/basic/alphabets.py`	Adds alphabets, validation, masking, and wildcard matching helpers.
`docs/mir.distances.rst`	Documents new `mir.distances.seqdist` and `seqdist_c`.
`docs/mir.biomarkers.rst`	Documents new `mir.biomarkers.kmer_stats`.
`docs/mir.basic.rst`	Documents new `mir.basic.alphabets` and `mir.basic.mirseq`; removes legacy `kmers` docs.
`CMakeLists.txt`	Updates native build to produce `seqdist_c` and `mirseq` modules.
`.vscode/settings.json`	Adjusts pytest discovery target to `tests/`.
`.vscode/launch.json`	Switches debugger type to `debugpy`.

Comments suppressed due to low confidence (1)

tests/test_repertoire.py:17

The test is pointing at assets/test_meta.csv and assets/{file} but the committed fixtures live under tests/assets/ (meta.csv and repertoire_*.csv). As written, this will raise FileNotFoundError when running pytest from the repo root. Consider resolving paths relative to __file__ (e.g., Path(__file__).parent / "assets") and using the actual filename (meta.csv).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-11T17:24:20Z

+            if id_set is None:
+                ids[key] = {rid}
+                dups[key] = dc
+            else:


summarize_rearrangements claims duplicate_count is summed across unique rearrangements, but dups[key] += dc is executed for every occurrence of the k-mer (including repeated occurrences within the same rearrangement). This over-counts duplicates when a rearrangement contains the same k-mer multiple times. Only add dc when rid is newly added to id_set (or track per-(key,rid) seen).

Suggested change

else:

elif rid not in id_set:

Copilot · 2026-04-11T17:24:21Z

+static py::list c_tokenize_gapped_bytes(const py::object& obj, int k, int mask_byte) {
+    auto sv = to_view(obj);
+    int n = (int)sv.len;
+    if (k < 1 || k > n)
+        throw std::invalid_argument("k must be between 1 and sequence length");
+    int n_windows = n - k + 1;
+    int total = n_windows * k;
+    py::list result(total);
+    // Temporary buffer for each gapped k-mer
+    char* buf = (char*)alloca(k);
+    int idx = 0;
+    for (int i = 0; i < n_windows; ++i) {
+        for (int j = 0; j < k; ++j) {
+            std::memcpy(buf, sv.data + i, k);
+            buf[j] = (char)mask_byte;
+            result[idx++] = py::bytes(buf, k);
+        }
+    }
+    return result;
+}
+
+static py::list c_tokenize_gapped_str(const py::object& obj, int k, int mask_byte) {
+    auto sv = to_view(obj);
+    int n = (int)sv.len;
+    if (k < 1 || k > n)
+        throw std::invalid_argument("k must be between 1 and sequence length");
+    int n_windows = n - k + 1;
+    int total = n_windows * k;
+    py::list result(total);
+    char* buf = (char*)alloca(k);
+    int idx = 0;


alloca(k) is used but the file doesn't include the appropriate header (platform-dependent) and alloca isn't standard C++. This can break builds on some toolchains (notably MSVC) or with stricter compiler settings. Prefer a reusable stack buffer via std::string buf(k, '\0') / std::vector<char> (allocated once per call) or add the correct platform-conditional include + _alloca handling.

Copilot · 2026-04-11T17:24:21Z

+    auto mbuf = mat256.request();
+    const double* mat = static_cast<double*>(mbuf.ptr);


selfscore uses mat256.request() but does not validate that mat256 is actually 256x256, unlike score_max/best_alignment. Since seqdist_c.selfscore is part of the public module API, passing a wrong-shaped array could lead to out-of-bounds reads. Reuse extract_mat() (or replicate its shape checks) inside c_selfscore.

Suggested change

auto mbuf = mat256.request();

const double* mat = static_cast<double*>(mbuf.ptr);

const double* mat = extract_mat(mat256, use_mat);

Copilot · 2026-04-11T17:24:21Z

 classifiers = [
  "Programming Language :: Python :: 3",
-  "License :: OSI Approved :: MIT License",
+  "License :: OSI Approved :: GPL-3.0 license",


The Trove classifier string "License :: OSI Approved :: GPL-3.0 license" is not a valid PyPI classifier and will cause packaging metadata validation errors. Use the canonical classifier (e.g. License :: OSI Approved :: GNU General Public License v3 (GPLv3)) and keep license text separately if desired.

Suggested change

"License :: OSI Approved :: GPL-3.0 license",

"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",

Copilot · 2026-04-11T17:24:21Z

+python3 -m venv venv
+. venv/bin/activate.fish


quick_setup.sh is missing a shebang and sources venv/bin/activate.fish, which will fail when executed with bash/sh (and on systems without fish). If this is meant to be a portable setup script, add a shebang and either use venv/bin/activate or detect the user shell and source the appropriate activation script.

Suggested change

python3 -m venv venv

. venv/bin/activate.fish

#!/usr/bin/env sh

python3 -m venv venv

. venv/bin/activate

Copilot · 2026-04-11T17:24:22Z

+"""Benchmarks: C extensions vs pure-Python for key operations.
+
+Run with ``python -m pytest tests/test_mirseq_benchmark.py -v -s``.
+"""


This file is entirely performance benchmarks (large inner loops) but is named test_... and will be collected and executed by pytest by default. Consider renaming/moving it to a benchmark-only location or skipping unless an explicit flag/marker is provided.

Copilot · 2026-04-11T17:24:23Z

+@unittest.skipUnless(GILG_FILE.exists(),
+                     "VDJdb asset missing — run tests/assets/fetch_vdjdb_gilgfvftl.sh")
+class TestKmerStatsBenchmark(unittest.TestCase):
+    """K-mer differential analysis: GILGFVFTL-specific vs OLGA background."""
+
+    @classmethod
+    def setUpClass(cls):
+        cls.gilg_cdr3s = _load_gilgfvftl_cdr3s()
+        cls.bg_cdr3s = _generate_olga_background(1000)
+        cls.gilg_rep = _make_repertoire(cls.gilg_cdr3s)
+        cls.bg_rep = _make_repertoire(cls.bg_cdr3s)
+


tests/test_kmer_stats.py will run by default because the required VDJdb asset file is committed, and it generates an OLGA background + runs multiple chi-squared comparisons/prints output. This looks more like an integration/benchmark test than a unit test and may be slow/flaky across OLGA/model changes. Consider marking/guarding it (env var/pytest marker) so CI can run a fast unit-test subset reliably.

Copilot · 2026-04-11T17:24:23Z

+class TestSummarizeRearrangementsBenchmark:
+    N = 100_000
+    K = 5
+    JUNCTION = "CASSLAPGATNEKLFF"
+
+    @pytest.fixture(scope="class")
+    def rearrangements(self):
+        return [
+            Rearrangement("TRB", i, "TRBV5-1", "TRBC1",
+                          self.JUNCTION, 10)
+            for i in range(self.N)
+        ]
+
+    def test_benchmark_summarize_plain(self, rearrangements):
+        summarize_rearrangements(rearrangements[:1000], self.K)
+        t0 = time.perf_counter()
+        stats = summarize_rearrangements(rearrangements, self.K)
+        elapsed = time.perf_counter() - t0
+        print(
+            f"\nsummarize (plain): {self.N:,} rearrangements, k={self.K} → "
+            f"{len(stats):,} unique keys, {elapsed:.3f}s "
+            f"({self.N / elapsed:,.0f} rearrangements/s)"
+        )
+
+    def test_benchmark_summarize_gapped(self, rearrangements):
+        summarize_rearrangements(rearrangements[:1000], self.K, mask_byte=MASK)
+        t0 = time.perf_counter()
+        stats = summarize_rearrangements(rearrangements, self.K, mask_byte=MASK)
+        elapsed = time.perf_counter() - t0


This benchmark class (N = 100_000) is collected as part of the normal pytest suite and will run in CI by default. It will allocate large in-memory structures and can easily dominate test runtime. Consider skipping behind an env var/marker or relocating benchmarks outside the default tests/ collection.

Copilot · 2026-04-11T17:24:23Z

+class TestSummarizeAnnotationsBenchmark:
+    N = 100_000
+    K = 5
+    JUNCTION = "CASSLAPGATNEKLFF"
+
+    @pytest.fixture(scope="class")
+    def rearrangements(self):
+        return [
+            Rearrangement("TRB", i, "TRBV5-1", "TRBC1",
+                          self.JUNCTION, 10)
+            for i in range(self.N)
+        ]
+
+    def test_benchmark_annotations_plain(self, rearrangements):
+        summarize_annotations(rearrangements[:1000], self.K)
+        t0 = time.perf_counter()
+        ann = summarize_annotations(rearrangements, self.K)
+        elapsed = time.perf_counter() - t0
+        total_annotations = sum(len(v) for v in ann.values())
+        print(
+            f"\nsummarize_annotations (plain): {self.N:,} rearrangements, "
+            f"k={self.K} → {len(ann):,} KmerSeq keys, "
+            f"{total_annotations:,} annotations, "
+            f"{elapsed:.3f}s ({self.N / elapsed:,.0f} rearrangements/s)"
+        )


This benchmark (N = 100_000) runs as a standard test_... and will be executed by pytest by default, which is likely to be too slow/heavy for CI. Consider guarding with a benchmark marker/env var or moving it to a dedicated benchmark suite.

Copilot · 2026-04-11T17:24:23Z

+class TestOlgaKmerSummary:
+    """Generate 10,000 human TCR-beta rearrangements via OLGA and validate
+    biological expectations on k-mer incidence."""
+
+    N = 10_000
+    K = 3
+
+    @pytest.fixture(scope="class")
+    def olga_rearrangements(self):
+        from mir.basic.pgen import OlgaModel
+
+        model = OlgaModel(chain="TRB")
+        seqs = model.generate_sequences_with_meta(self.N, pgens=False)
+        return [
+            Rearrangement(
+                locus="TRB",
+                id=i,
+                v_gene=rec["v_gene"].split("*")[0],  # strip allele
+                c_gene="",
+                junction_aa=rec["cdr3"],
+                duplicate_count=1,
+            )
+            for i, rec in enumerate(seqs)
+        ]
+
+    @pytest.fixture(scope="class")
+    def annotations(self, olga_rearrangements):
+        return summarize_annotations(olga_rearrangements, self.K)


This OLGA-based suite generates 10,000 sequences and asserts statistical properties of k-mer incidence. Because it lives under tests/ and uses test_... names, it will run by default and may be slow and potentially flaky across OLGA/model/library changes. Consider marking as integration/slow and skipping unless explicitly enabled.

mikessh added 17 commits April 10, 2026 07:06

some setup

3f98774

wip

5f4e0b2

wip tests fixing

d35dd96

wip sequence mask

19bfc61

wip sequence

acd22b9

wip tokenizer

b856dfe

roll back to python-native methods as they turn out to be faster

df3e064

wip token summary

7b8fcfd

add polars implementation for kmers

165e311

C implementation wip

6a3a889

wip C impl

7e24427

move cdrscore to seqdist, tests for aligner

845ec00

2prev

17a79e4

finished alignment refactoring for now

235af36

Rewrite old kmer impl by Danya

9ddcb76

Upd docs

74b20dc

Test for kmer biomarkers

0f3cc2c

Copilot AI review requested due to automatic review settings April 11, 2026 17:18

mikessh added the feature-request Feature request label Apr 11, 2026

mikessh requested review from LuppovDaniil, MikePodsytnik and VEK239 April 11, 2026 17:19

Copilot started reviewing on behalf of mikessh April 11, 2026 17:19 View session

mikessh self-assigned this Apr 11, 2026

Copilot AI reviewed Apr 11, 2026

View reviewed changes

removed needless dependencies, upd tedm etc

1b172f6

VEK239 approved these changes Apr 13, 2026

View reviewed changes

VEK239 merged commit c4824b2 into dev Apr 13, 2026
2 checks passed

VEK239 deleted the kmer branch April 13, 2026 21:09

VEK239 restored the kmer branch April 13, 2026 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmer#16

Kmer#16
VEK239 merged 18 commits intodevfrom
kmer

mikessh commented Apr 11, 2026

Uh oh!

mikessh commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Copilot AI Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		auto mbuf = mat256.request();
		const double* mat = static_cast<double*>(mbuf.ptr);

	auto mbuf = mat256.request();
	const double* mat = static_cast<double*>(mbuf.ptr);
	const double* mat = extract_mat(mat256, use_mat);

	"License :: OSI Approved :: GPL-3.0 license",
	"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",

Conversation

mikessh commented Apr 11, 2026

Uh oh!

mikessh commented Apr 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants