feat(qc): add combined recombination detection strategies B+C+D+F#1742
Open
ivan-aksamentov wants to merge 7 commits into
Open
feat(qc): add combined recombination detection strategies B+C+D+F#1742ivan-aksamentov wants to merge 7 commits into
ivan-aksamentov wants to merge 7 commits into
Conversation
Closes #1699 Combines four recombination detection strategies: - B: Spatial uniformity (PR #1737) - C: Cluster gaps (PR #1738) - D: Reversion clustering (PR #1739) - F: Label switching (PR #1741) Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/` Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example CLI test: ``` nextclade run \ --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \ --output-all output/ \ data/recomb/enpen/enterovirus/ev-d68/sequences.fasta ``` Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.
All four recombinant detection strategies now apply their weight consistently: base_score × weight. Previously clusterGaps was missing this final weight multiplication. - Add `weight` field to `QcRecombConfigClusterGaps` (default: 50.0) - Remove unused `QcRecombConfigWeightedThreshold` strategy - Remove `mutations_threshold` and `weighted_threshold` config fields
- Use sorted_by_key() instead of sort_by() for float sorting - Make strategy_label_switching consistent with other strategies - Simplify score summation with array-based iteration
Remove weightGapSize and weightPerGap, leaving only weight (default 1250) to match other strategies. Gap size beyond threshold was negligible and two weights were mathematically redundant for linear scoring.
Show one decimal place for score display, matching CV precision.
The previous implementation counted distinct labels minus one based on centroids. Now counts actual position-based label transitions along the genome, correctly detecting recombination breakpoints.
32bb543 to
530999d
Compare
Member
Author
|
Relevant piece of QC config in pathogen.json nextclade/data/recomb/enpen/enterovirus/ev-d68/pathogen.json Lines 68 to 94 in 30d267e |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1699
Combines four recombination detection strategies:
Test dataset in this PR:
./data/recomb/enpen/enterovirus/ev-d68/Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click
Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example
CLI test:
Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.