Skip to content

fix: treat deletions as undetermined in nearest node distance#1765

Merged
ivan-aksamentov merged 2 commits intomasterfrom
fix/nearest-node-deletion-distance
Apr 28, 2026
Merged

fix: treat deletions as undetermined in nearest node distance#1765
ivan-aksamentov merged 2 commits intomasterfrom
fix/nearest-node-deletion-distance

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

Sequences with large internal deletions are placed near the root of the reference tree instead of near their true closest relatives. The nearest node distance metric counts node mutations at deleted positions as mismatches (the query "has reference" there), making the root -- which has zero mutations -- appear closest. Replacing gaps with N produces correct placement because N positions are already excluded from the distance as undetermined.

This PR adds query deletion ranges as a separate parameter to the distance calculation and counts node mutations at deleted positions as undetermined. The fix is limited to placement; private mutation calling already handles deletions correctly via a separate code path.

Work items

The nearest node search distance metric did not account for query deletions. Node mutations at positions within a query deletion were counted as mismatches, biasing placement toward the root for sequences with large internal deletions. Pass deletion ranges as a separate parameter and exclude deleted positions from the distance calculation.
Add test verifying that query deletions covering node mutation positions reduce the distance (undetermined sites excluded). Update existing tests for the new `qry_deletions` parameter.
Comment on lines 117 to 127

// determine the number of sites that are mutated in the node but missing in seq.
// determine the number of sites that are mutated in the node but missing or deleted in seq.
// for these we can't tell whether the node agrees with seq
let mut undetermined_sites = 0_i64;
for pos in node.tmp.substitutions.keys() {
if !is_nuc_sequenced(*pos, &masked_qry_missing, aln_range) {
if !is_nuc_sequenced(*pos, &masked_qry_missing, aln_range)
|| qry_deletions.iter().any(|del| del.range().contains(*pos))
{
undetermined_sites += 1;
}
}
Copy link
Copy Markdown
Member Author

@ivan-aksamentov ivan-aksamentov Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dels are new passed to distance metric calculation and counted same as unknowns or unsequenced. But I imagine it can also be treat differently if needed.

Comment on lines +451 to 484

#[rstest]
fn deletion_covering_node_mutations_reduces_distance() -> Result<(), Report> {
let node = node_with_simple_nuc_subs();
let qry_nuc_subs: Vec<NucSub> = vec![];
let qry_missing: Vec<NucRange> = vec![];
let aln_range = NucRefGlobalRange::from_usize(0, 100);
let masked_ranges = vec![];

let no_deletions: Vec<NucDelRange> = vec![];
let distance_without_del = tree_calculate_node_distance(
&node,
&qry_nuc_subs,
&qry_missing,
&no_deletions,
&aln_range,
&masked_ranges,
);
assert_eq!(distance_without_del, 5);

let qry_deletions = vec![NucDelRange::from_usize(10, 25)];
let distance_with_del = tree_calculate_node_distance(
&node,
&qry_nuc_subs,
&qry_missing,
&qry_deletions,
&aln_range,
&masked_ranges,
);
assert_eq!(distance_with_del, 2);

Ok(())
}
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests that adding a del reduces distance metric

@github-actions
Copy link
Copy Markdown

@rneher
Copy link
Copy Markdown
Member

rneher commented Apr 28, 2026

Thanks! this places these sequences as expected and is I think a universal improvement. One could add +1 to the distance for every deletion that is not in the attachment node, but deletions are represented differently on the auspice tree nodes and ignoring distance by deletion is in line with what phylo tools generally do... So it is probably not worth the trouble.

@ivan-aksamentov ivan-aksamentov merged commit 3004cda into master Apr 28, 2026
20 checks passed
@ivan-aksamentov ivan-aksamentov deleted the fix/nearest-node-deletion-distance branch April 28, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants