fix: treat deletions as undetermined in nearest node distance#1765
Merged
ivan-aksamentov merged 2 commits intomasterfrom Apr 28, 2026
Merged
fix: treat deletions as undetermined in nearest node distance#1765ivan-aksamentov merged 2 commits intomasterfrom
ivan-aksamentov merged 2 commits intomasterfrom
Conversation
The nearest node search distance metric did not account for query deletions. Node mutations at positions within a query deletion were counted as mismatches, biasing placement toward the root for sequences with large internal deletions. Pass deletion ranges as a separate parameter and exclude deleted positions from the distance calculation.
Add test verifying that query deletions covering node mutation positions reduce the distance (undetermined sites excluded). Update existing tests for the new `qry_deletions` parameter.
ivan-aksamentov
commented
Apr 28, 2026
Comment on lines
117
to
127
|
|
||
| // determine the number of sites that are mutated in the node but missing in seq. | ||
| // determine the number of sites that are mutated in the node but missing or deleted in seq. | ||
| // for these we can't tell whether the node agrees with seq | ||
| let mut undetermined_sites = 0_i64; | ||
| for pos in node.tmp.substitutions.keys() { | ||
| if !is_nuc_sequenced(*pos, &masked_qry_missing, aln_range) { | ||
| if !is_nuc_sequenced(*pos, &masked_qry_missing, aln_range) | ||
| || qry_deletions.iter().any(|del| del.range().contains(*pos)) | ||
| { | ||
| undetermined_sites += 1; | ||
| } | ||
| } |
Member
Author
There was a problem hiding this comment.
Dels are new passed to distance metric calculation and counted same as unknowns or unsequenced. But I imagine it can also be treat differently if needed.
ivan-aksamentov
commented
Apr 28, 2026
Comment on lines
+451
to
484
|
|
||
| #[rstest] | ||
| fn deletion_covering_node_mutations_reduces_distance() -> Result<(), Report> { | ||
| let node = node_with_simple_nuc_subs(); | ||
| let qry_nuc_subs: Vec<NucSub> = vec![]; | ||
| let qry_missing: Vec<NucRange> = vec![]; | ||
| let aln_range = NucRefGlobalRange::from_usize(0, 100); | ||
| let masked_ranges = vec![]; | ||
|
|
||
| let no_deletions: Vec<NucDelRange> = vec![]; | ||
| let distance_without_del = tree_calculate_node_distance( | ||
| &node, | ||
| &qry_nuc_subs, | ||
| &qry_missing, | ||
| &no_deletions, | ||
| &aln_range, | ||
| &masked_ranges, | ||
| ); | ||
| assert_eq!(distance_without_del, 5); | ||
|
|
||
| let qry_deletions = vec![NucDelRange::from_usize(10, 25)]; | ||
| let distance_with_del = tree_calculate_node_distance( | ||
| &node, | ||
| &qry_nuc_subs, | ||
| &qry_missing, | ||
| &qry_deletions, | ||
| &aln_range, | ||
| &masked_ranges, | ||
| ); | ||
| assert_eq!(distance_with_del, 2); | ||
|
|
||
| Ok(()) | ||
| } | ||
| } |
Member
Author
There was a problem hiding this comment.
Tests that adding a del reduces distance metric
Member
|
Thanks! this places these sequences as expected and is I think a universal improvement. One could add |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sequences with large internal deletions are placed near the root of the reference tree instead of near their true closest relatives. The nearest node distance metric counts node mutations at deleted positions as mismatches (the query "has reference" there), making the root -- which has zero mutations -- appear closest. Replacing gaps with N produces correct placement because N positions are already excluded from the distance as undetermined.
This PR adds query deletion ranges as a separate parameter to the distance calculation and counts node mutations at deleted positions as undetermined. The fix is limited to placement; private mutation calling already handles deletions correctly via a separate code path.
Work items
qry_deletions: &[NucDelRange]parameter tograph_find_nearest_nodesandtree_calculate_node_distance&deletionsfromnextclade_run_one