Conversation
|
N.B. some tests (for segments.py and repertoire.py) are failing - will fix them later! |
There was a problem hiding this comment.
Pull request overview
This PR refactors and accelerates core sequence operations (translation, tokenization, distances, alignment) by introducing new pybind11 C extensions, and adds new k‑mer indexing/summarisation utilities (object-based + Polars) plus accompanying docs/tests/benchmarks.
Changes:
- Added
mirseq(translation/tokenization) andseqdist_c(distances + alignment scoring) C extensions, and rewired Python wrappers to use them. - Introduced new k‑mer utilities (
alphabets.py,tokens.py,token_tables.py,token_tables_pl.py) and replaced the legacymir/basic/kmers.pyimplementation withmir/biomarkers/kmer_stats.py. - Added extensive tests/benchmarks, updated docs/README, and adjusted build/config files (CMake, pyproject, requirements, VS Code settings).
Reviewed changes
Copilot reviewed 36 out of 43 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_tokens.py |
New unit tests for Python tokenization wrappers around mirseq. |
tests/test_token_tables.py |
New tests + large benchmarks for object-based k‑mer indexing/summaries. |
tests/test_token_tables_impl.py |
New tests + benchmarks comparing Polars vs object-based implementations. |
tests/test_seqdist.py |
New unit tests for seqdist_c and thin Python wrappers. |
tests/test_segments.py |
New tests covering SegmentLibrary bootstrap/download logic. |
tests/test_repertoire.py |
Updates fixture paths for repertoire dataset tests. |
tests/test_mirseq.py |
New comprehensive tests for mirseq translation/tokenization/reduction. |
tests/test_mirseq_benchmark.py |
New microbenchmarks comparing Python vs C implementations. |
tests/test_kmer_stats.py |
New benchmark/integration-style validation for k‑mer differential stats. |
tests/test_alphabets.py |
New tests for alphabets, masking, and matching helpers. |
tests/test_aligner.py |
New correctness + benchmark tests for aligner, incl. C vs Python fallback checks. |
tests/benchmakrs/memory_benchmark.py |
Removes non-English banner comments in benchmark script. |
tests/assets/repertoire_1.csv |
Adds repertoire fixture CSV. |
tests/assets/repertoire_2.csv |
Adds repertoire fixture CSV. |
tests/assets/repertoire_3.csv |
Adds repertoire fixture CSV. |
tests/assets/repertoire_4.csv |
Adds repertoire fixture CSV. |
tests/assets/meta.csv |
Adds metadata fixture CSV. |
tests/assets/fetch_vdjdb_gilgfvftl.sh |
Adds helper script to fetch VDJdb-derived validation dataset. |
test/test_repertoires/test_meta.csv |
Removes old metadata fixture. |
requirements.txt |
Adds new pinned dependencies (e.g., polars, pybind11, stringzilla). |
README.md |
Adds documentation/examples for masking and matching helpers. |
quick_setup.sh |
Adds a quick setup script. |
pyproject.toml |
Updates project metadata, dependencies, and license fields. |
notebooks/kmer_generator.ipynb |
Updates notebook imports to new k‑mer stats API and clears execution count. |
mir/distances/seqdist.py |
Adds thin Python wrappers for seqdist_c. |
mir/distances/seqdist.cpp |
Adds new combined C extension for distances + CDR3 scoring + visualization alignment. |
mir/distances/cdrscore.cpp |
Removes old CDR scoring module (merged into seqdist_c). |
mir/distances/aligner.py |
Refactors aligner to use seqdist_c, adds align() visualization API and richer docs. |
mir/biomarkers/kmer_stats.py |
Adds new k‑mer counting + chi-squared comparison utilities (replacement for legacy). |
mir/biomarkers/__init__.py |
Updates module docstring. |
mir/basic/tokens.py |
Adds C-backed tokenization wrapper functions. |
mir/basic/token_tables.py |
Adds object-based rearrangement k‑mer indexing + summaries. |
mir/basic/token_tables_pl.py |
Adds Polars-based expansion/summarisation + fetch helpers. |
mir/basic/mirseq.cpp |
Adds C extension for translation, tokenization, and reduced alphabet conversion. |
mir/basic/kmers.py |
Removes legacy k‑mer comparison implementation. |
mir/basic/alphabets.py |
Adds alphabets, validation, masking, and wildcard matching helpers. |
docs/mir.distances.rst |
Documents new mir.distances.seqdist and seqdist_c. |
docs/mir.biomarkers.rst |
Documents new mir.biomarkers.kmer_stats. |
docs/mir.basic.rst |
Documents new mir.basic.alphabets and mir.basic.mirseq; removes legacy kmers docs. |
CMakeLists.txt |
Updates native build to produce seqdist_c and mirseq modules. |
.vscode/settings.json |
Adjusts pytest discovery target to tests/. |
.vscode/launch.json |
Switches debugger type to debugpy. |
Comments suppressed due to low confidence (1)
tests/test_repertoire.py:17
- The test is pointing at
assets/test_meta.csvandassets/{file}but the committed fixtures live undertests/assets/(meta.csv and repertoire_*.csv). As written, this will raise FileNotFoundError when running pytest from the repo root. Consider resolving paths relative to__file__(e.g.,Path(__file__).parent / "assets") and using the actual filename (meta.csv).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if id_set is None: | ||
| ids[key] = {rid} | ||
| dups[key] = dc | ||
| else: |
There was a problem hiding this comment.
summarize_rearrangements claims duplicate_count is summed across unique rearrangements, but dups[key] += dc is executed for every occurrence of the k-mer (including repeated occurrences within the same rearrangement). This over-counts duplicates when a rearrangement contains the same k-mer multiple times. Only add dc when rid is newly added to id_set (or track per-(key,rid) seen).
| else: | |
| elif rid not in id_set: |
| static py::list c_tokenize_gapped_bytes(const py::object& obj, int k, int mask_byte) { | ||
| auto sv = to_view(obj); | ||
| int n = (int)sv.len; | ||
| if (k < 1 || k > n) | ||
| throw std::invalid_argument("k must be between 1 and sequence length"); | ||
| int n_windows = n - k + 1; | ||
| int total = n_windows * k; | ||
| py::list result(total); | ||
| // Temporary buffer for each gapped k-mer | ||
| char* buf = (char*)alloca(k); | ||
| int idx = 0; | ||
| for (int i = 0; i < n_windows; ++i) { | ||
| for (int j = 0; j < k; ++j) { | ||
| std::memcpy(buf, sv.data + i, k); | ||
| buf[j] = (char)mask_byte; | ||
| result[idx++] = py::bytes(buf, k); | ||
| } | ||
| } | ||
| return result; | ||
| } | ||
|
|
||
| static py::list c_tokenize_gapped_str(const py::object& obj, int k, int mask_byte) { | ||
| auto sv = to_view(obj); | ||
| int n = (int)sv.len; | ||
| if (k < 1 || k > n) | ||
| throw std::invalid_argument("k must be between 1 and sequence length"); | ||
| int n_windows = n - k + 1; | ||
| int total = n_windows * k; | ||
| py::list result(total); | ||
| char* buf = (char*)alloca(k); | ||
| int idx = 0; |
There was a problem hiding this comment.
alloca(k) is used but the file doesn't include the appropriate header (platform-dependent) and alloca isn't standard C++. This can break builds on some toolchains (notably MSVC) or with stricter compiler settings. Prefer a reusable stack buffer via std::string buf(k, '\0') / std::vector<char> (allocated once per call) or add the correct platform-conditional include + _alloca handling.
| auto mbuf = mat256.request(); | ||
| const double* mat = static_cast<double*>(mbuf.ptr); |
There was a problem hiding this comment.
selfscore uses mat256.request() but does not validate that mat256 is actually 256x256, unlike score_max/best_alignment. Since seqdist_c.selfscore is part of the public module API, passing a wrong-shaped array could lead to out-of-bounds reads. Reuse extract_mat() (or replicate its shape checks) inside c_selfscore.
| auto mbuf = mat256.request(); | |
| const double* mat = static_cast<double*>(mbuf.ptr); | |
| const double* mat = extract_mat(mat256, use_mat); |
| classifiers = [ | ||
| "Programming Language :: Python :: 3", | ||
| "License :: OSI Approved :: MIT License", | ||
| "License :: OSI Approved :: GPL-3.0 license", |
There was a problem hiding this comment.
The Trove classifier string "License :: OSI Approved :: GPL-3.0 license" is not a valid PyPI classifier and will cause packaging metadata validation errors. Use the canonical classifier (e.g. License :: OSI Approved :: GNU General Public License v3 (GPLv3)) and keep license text separately if desired.
| "License :: OSI Approved :: GPL-3.0 license", | |
| "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", |
| python3 -m venv venv | ||
| . venv/bin/activate.fish |
There was a problem hiding this comment.
quick_setup.sh is missing a shebang and sources venv/bin/activate.fish, which will fail when executed with bash/sh (and on systems without fish). If this is meant to be a portable setup script, add a shebang and either use venv/bin/activate or detect the user shell and source the appropriate activation script.
| python3 -m venv venv | |
| . venv/bin/activate.fish | |
| #!/usr/bin/env sh | |
| python3 -m venv venv | |
| . venv/bin/activate |
| """Benchmarks: C extensions vs pure-Python for key operations. | ||
|
|
||
| Run with ``python -m pytest tests/test_mirseq_benchmark.py -v -s``. | ||
| """ |
There was a problem hiding this comment.
This file is entirely performance benchmarks (large inner loops) but is named test_... and will be collected and executed by pytest by default. Consider renaming/moving it to a benchmark-only location or skipping unless an explicit flag/marker is provided.
| @unittest.skipUnless(GILG_FILE.exists(), | ||
| "VDJdb asset missing — run tests/assets/fetch_vdjdb_gilgfvftl.sh") | ||
| class TestKmerStatsBenchmark(unittest.TestCase): | ||
| """K-mer differential analysis: GILGFVFTL-specific vs OLGA background.""" | ||
|
|
||
| @classmethod | ||
| def setUpClass(cls): | ||
| cls.gilg_cdr3s = _load_gilgfvftl_cdr3s() | ||
| cls.bg_cdr3s = _generate_olga_background(1000) | ||
| cls.gilg_rep = _make_repertoire(cls.gilg_cdr3s) | ||
| cls.bg_rep = _make_repertoire(cls.bg_cdr3s) | ||
|
|
There was a problem hiding this comment.
tests/test_kmer_stats.py will run by default because the required VDJdb asset file is committed, and it generates an OLGA background + runs multiple chi-squared comparisons/prints output. This looks more like an integration/benchmark test than a unit test and may be slow/flaky across OLGA/model changes. Consider marking/guarding it (env var/pytest marker) so CI can run a fast unit-test subset reliably.
| class TestSummarizeRearrangementsBenchmark: | ||
| N = 100_000 | ||
| K = 5 | ||
| JUNCTION = "CASSLAPGATNEKLFF" | ||
|
|
||
| @pytest.fixture(scope="class") | ||
| def rearrangements(self): | ||
| return [ | ||
| Rearrangement("TRB", i, "TRBV5-1", "TRBC1", | ||
| self.JUNCTION, 10) | ||
| for i in range(self.N) | ||
| ] | ||
|
|
||
| def test_benchmark_summarize_plain(self, rearrangements): | ||
| summarize_rearrangements(rearrangements[:1000], self.K) | ||
| t0 = time.perf_counter() | ||
| stats = summarize_rearrangements(rearrangements, self.K) | ||
| elapsed = time.perf_counter() - t0 | ||
| print( | ||
| f"\nsummarize (plain): {self.N:,} rearrangements, k={self.K} → " | ||
| f"{len(stats):,} unique keys, {elapsed:.3f}s " | ||
| f"({self.N / elapsed:,.0f} rearrangements/s)" | ||
| ) | ||
|
|
||
| def test_benchmark_summarize_gapped(self, rearrangements): | ||
| summarize_rearrangements(rearrangements[:1000], self.K, mask_byte=MASK) | ||
| t0 = time.perf_counter() | ||
| stats = summarize_rearrangements(rearrangements, self.K, mask_byte=MASK) | ||
| elapsed = time.perf_counter() - t0 |
There was a problem hiding this comment.
This benchmark class (N = 100_000) is collected as part of the normal pytest suite and will run in CI by default. It will allocate large in-memory structures and can easily dominate test runtime. Consider skipping behind an env var/marker or relocating benchmarks outside the default tests/ collection.
| class TestSummarizeAnnotationsBenchmark: | ||
| N = 100_000 | ||
| K = 5 | ||
| JUNCTION = "CASSLAPGATNEKLFF" | ||
|
|
||
| @pytest.fixture(scope="class") | ||
| def rearrangements(self): | ||
| return [ | ||
| Rearrangement("TRB", i, "TRBV5-1", "TRBC1", | ||
| self.JUNCTION, 10) | ||
| for i in range(self.N) | ||
| ] | ||
|
|
||
| def test_benchmark_annotations_plain(self, rearrangements): | ||
| summarize_annotations(rearrangements[:1000], self.K) | ||
| t0 = time.perf_counter() | ||
| ann = summarize_annotations(rearrangements, self.K) | ||
| elapsed = time.perf_counter() - t0 | ||
| total_annotations = sum(len(v) for v in ann.values()) | ||
| print( | ||
| f"\nsummarize_annotations (plain): {self.N:,} rearrangements, " | ||
| f"k={self.K} → {len(ann):,} KmerSeq keys, " | ||
| f"{total_annotations:,} annotations, " | ||
| f"{elapsed:.3f}s ({self.N / elapsed:,.0f} rearrangements/s)" | ||
| ) |
There was a problem hiding this comment.
This benchmark (N = 100_000) runs as a standard test_... and will be executed by pytest by default, which is likely to be too slow/heavy for CI. Consider guarding with a benchmark marker/env var or moving it to a dedicated benchmark suite.
| class TestOlgaKmerSummary: | ||
| """Generate 10,000 human TCR-beta rearrangements via OLGA and validate | ||
| biological expectations on k-mer incidence.""" | ||
|
|
||
| N = 10_000 | ||
| K = 3 | ||
|
|
||
| @pytest.fixture(scope="class") | ||
| def olga_rearrangements(self): | ||
| from mir.basic.pgen import OlgaModel | ||
|
|
||
| model = OlgaModel(chain="TRB") | ||
| seqs = model.generate_sequences_with_meta(self.N, pgens=False) | ||
| return [ | ||
| Rearrangement( | ||
| locus="TRB", | ||
| id=i, | ||
| v_gene=rec["v_gene"].split("*")[0], # strip allele | ||
| c_gene="", | ||
| junction_aa=rec["cdr3"], | ||
| duplicate_count=1, | ||
| ) | ||
| for i, rec in enumerate(seqs) | ||
| ] | ||
|
|
||
| @pytest.fixture(scope="class") | ||
| def annotations(self, olga_rearrangements): | ||
| return summarize_annotations(olga_rearrangements, self.K) |
There was a problem hiding this comment.
This OLGA-based suite generates 10,000 sequences and asserts statistical properties of k-mer incidence. Because it lives under tests/ and uses test_... names, it will run by default and may be slow and potentially flaky across OLGA/model/library changes. Consider marking as integration/slow and skipping unless explicitly enabled.
Refactor sequence translation, tokenisation, and alignment. Provide optimised C code for these features. I've extensively tested using wrappers for bytes & stringzilla, but they were 3-4x slower than naive Python string slicing. Also, implement K-mer summary tables and refactor the original kmers.py code for the contingency table Chi2 test. Add unit tests, benchmarks & docs.