Skip to content

feat(qc): add combined recombination detection strategies B+C+D+F#1742

Open
ivan-aksamentov wants to merge 7 commits into
masterfrom
feat/qc-recomb-strategy-combined
Open

feat(qc): add combined recombination detection strategies B+C+D+F#1742
ivan-aksamentov wants to merge 7 commits into
masterfrom
feat/qc-recomb-strategy-combined

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

Closes #1699

Combines four recombination detection strategies:

Test dataset in this PR: ./data/recomb/enpen/enterovirus/ev-d68/

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:

nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta

Note: The current weighted score aggregation (simple sum of strategy scores) is a temporary solution. The scoring mechanism needs further discussion to determine optimal combination approach.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 20, 2026

Closes #1699

Combines four recombination detection strategies:
- B: Spatial uniformity (PR #1737)
- C: Cluster gaps (PR #1738)
- D: Reversion clustering (PR #1739)
- F: Label switching (PR #1741)

Test dataset in this PR: `./data/recomb/enpen/enterovirus/ev-d68/`

Preview: https://nextstrain--nextclade--pr-1742.previews.neherlab.click

Preview with test dataset: https://nextstrain--nextclade--pr-1742.previews.neherlab.click?dataset-url=gh:nextstrain/nextclade@feat/qc-recomb-strategy-combined@/data/recomb/enpen/enterovirus/ev-d68/&input-fasta=example

CLI test:
```
nextclade run \
  --input-dataset data/recomb/enpen/enterovirus/ev-d68/ \
  --output-all output/ \
  data/recomb/enpen/enterovirus/ev-d68/sequences.fasta
```

Note: The current weighted score aggregation (simple sum of strategy
scores) is a temporary solution. The scoring mechanism needs further
discussion to determine optimal combination approach.
All four recombinant detection strategies now apply their weight
consistently: base_score × weight. Previously clusterGaps was missing
this final weight multiplication.

- Add `weight` field to `QcRecombConfigClusterGaps` (default: 50.0)
- Remove unused `QcRecombConfigWeightedThreshold` strategy
- Remove `mutations_threshold` and `weighted_threshold` config fields
- Use sorted_by_key() instead of sort_by() for float sorting
- Make strategy_label_switching consistent with other strategies
- Simplify score summation with array-based iteration
Remove weightGapSize and weightPerGap, leaving only weight (default 1250)
to match other strategies. Gap size beyond threshold was negligible and
two weights were mathematically redundant for linear scoring.
Show one decimal place for score display, matching CV precision.
The previous implementation counted distinct labels minus one based on
centroids. Now counts actual position-based label transitions along the
genome, correctly detecting recombination breakpoints.
@ivan-aksamentov ivan-aksamentov force-pushed the feat/qc-recomb-strategy-combined branch from 32bb543 to 530999d Compare January 20, 2026 15:14
@ivan-aksamentov
Copy link
Copy Markdown
Member Author

Relevant piece of QC config in pathogen.json

"recombinants": {
"enabled": true,
"scoreWeight": 100.0,
"spatialUniformity": {
"enabled": true,
"numSegments": 10,
"cvThreshold": 1.5,
"weight": 50.0
},
"reversionClustering": {
"enabled": true,
"ratioThreshold": 0.3,
"clusterWindowSize": 500,
"minClusterSize": 3,
"weight": 50.0
},
"labelSwitching": {
"enabled": true,
"minLabels": 2,
"weight": 50.0
},
"clusterGaps": {
"enabled": true,
"minGapSize": 1000,
"weight": 1250.0
}
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QC label for recombinant sequences

1 participant