Conversation
|
Will review in more details later. Need to brush up science first - I already forgot most of it. For now, pasting AI review below, in case useful. If not, please ignore.
Problem Statement
Proposed Changes
Identified IssuesH1: The new heuristic measures Ns on the wrong sequence [click to expand]The code flow reveals a mismatch between what's measured and what's used:
Concrete counterexample:
With equal depth:
Why this matters:
Possible Solutions: S1. Count Ns on aligned intervals (recommended) [click to expand]In fn assign_anchor_block(mergers: &mut [Alignment], graph: &Pangraph) {
for m in mergers.iter_mut() {
let ref_block = &graph.blocks[&m.reff.name];
let qry_block = &graph.blocks[&m.qry.name];
let anchor = if ref_block.depth() != qry_block.depth() {
// ... existing depth logic
} else {
// Count Ns in aligned intervals only
let ref_interval_ns = ref_block.consensus()[m.reff.interval.to_range()]
.iter().filter(|c| c.0 == b'N').count();
let qry_interval_ns = qry_block.consensus()[m.qry.interval.to_range()]
.iter().filter(|c| c.0 == b'N').count();
if ref_interval_ns <= qry_interval_ns {
AnchorBlock::Ref
} else {
AnchorBlock::Qry
}
};
m.anchor_block = Some(anchor);
}
}S2. Add helper method to PangraphBlock [click to expand]Add S3. Regression test [click to expand]Test where block A has more total Ns but fewer Ns in aligned interval. Fails with current implementation, passes once fixed. #[test]
fn test_assign_anchor_block_prefers_fewer_interval_ns() {
fn new_hit(block_id: usize, start: usize, end: usize) -> Hit {
Hit { name: BlockId(block_id), length: 15, interval: Interval::new(start, end) }
}
fn new_aln(q: usize, q_interval: (usize, usize), r: usize, r_interval: (usize, usize)) -> Alignment {
Alignment {
qry: new_hit(q, q_interval.0, q_interval.1),
reff: new_hit(r, r_interval.0, r_interval.1),
matches: 0, length: 0, quality: 0, orientation: Forward,
new_block_id: None, anchor_block: None, cigar: Cigar::default(),
divergence: None, align: None,
}
}
fn e(nids: &[usize]) -> BTreeMap<NodeId, Edit> {
nids.iter().map(|nid| (NodeId(*nid), Edit::empty())).collect()
}
// Block 1: 10 total Ns, interval [5..9] "ACGT" has 0 Ns
// Block 2: 1 total N, interval [4..8] "ACNT" has 1 N
let b1 = PangraphBlock::new(BlockId(1), "NNNNNACGTNNNNN", e(&[1, 2]));
let b2 = PangraphBlock::new(BlockId(2), "ACGTACNTACGT", e(&[3, 4]));
let pangraph = Pangraph {
blocks: btreemap! { BlockId(1) => b1, BlockId(2) => b2 },
paths: btreemap! {}, nodes: btreemap! {},
};
// With interval-based counting: Ref wins (0 < 1)
// With whole-block counting (bug): Qry wins (1 < 10)
let mut mergers = vec![new_aln(2, (4, 8), 1, (5, 9))];
assign_anchor_block(&mut mergers, &pangraph);
assert_eq!(mergers[0].anchor_block, Some(AnchorBlock::Ref));
}Summary
|
|
AI was right, this was indeed an issue, thanks for the review! Should now be addressed! |
Extends #176 with a test case that would fail if N counting used whole-block consensus instead of aligned intervals. Scenario: - Block 1: 10 total Ns, but aligned interval has 0 Ns - Block 2: 1 total N, and aligned interval has 1 N With whole-block counting: Block 2 wins (1 < 10) With interval counting: Block 1 wins (0 < 1)
Extends test/pr-176-interval-regression by refactoring the three separate anchor block test functions into a single rstest parameterized test. Benefits: - Reduces test code from 148 to 45 lines - Table format makes test cases easier to compare - Adding new cases requires only one line All 4 original test cases preserved with identical coverage.
Extends refactor/pr-176-parameterized-tests with additional edge cases. New cases (7 added, 11 total): - equal_depth_equal_ns_ref_wins: tie-breaker when both have same N count - equal_depth_zero_ns_ref_wins: tie-breaker when neither has Ns - equal_depth_many_ns_qry_wins: qry wins when ref has more Ns - ref_deeper_wins: ref with greater depth wins despite more Ns - depth_large_difference: large depth gap (10 vs 2) - interval_at_end: N at block end position - single_base_interval: single nucleotide alignment
|
Our AI overlords are happy now! My human take:
My current understanding how it affects production, from user perspective: The problem was: when input genomes contain stretches of Ns, these could get "stuck" in block consensus sequences, preventing related sequences from being merged - effectively hiding good data. With the fix: less ambiguous consensus sequences, better merging of related regions, fewer blocks trapped by ambiguous data, nicer pangraphs. |
|
While going through code, I thought of some improvements to the tests. I made optional stacked PRs extending on top of this PR, for consideration:
All optional - can be merged independently, along with or after this PR. To incorporate all 3 optional changes into this PR, preserving nice linear history: git checkout fix/block_consensus
git merge --ff-only test/pr-176-more-cases # fast forward PR branch pointer to the last stacked branch pointer
git pushOtherwise, 176 is good to go as is. Feel free to merge. Let me know if I should release on Tuesday-Wednesday or if you want to do it. |
test: add regression test for interval-based N counting
…ests refactor: consolidate anchor block tests into parameterized test
test: add more anchor block selection test cases
😆 Thank you! I checked the other PRs and they all make a lot of sense! Merged everything and I can take care of the release tomorrow/wed, I think your instructions for this were very complete! |
issue with
NNNNNsequencesFor sequences that contain large stretches of
NNNNNNs, it can rarely happen that these end up in the consensus of a block and prevent the block from being subsequently aligned and merged.This can happen for example when merging two blocks of depth one (e.g. two single-genome graphs, which happens at all tips of the guide tree). In this case one of the two sequences is randomly chosen as the consensus sequences, and it could be the one with many
NNNNNN.This could be later corrected in the
reconsensusstep if the block is aligned and increases in depth, and theNNNNNstretch is found in only a minority of genomes. However if the stretch is very long (or the block is split such that theNNNNNstretch is now a majority of the consensus sequence) now the block cannot be further aligned anymore, and the other non-NNNNNsequence in the block is "trapped" and masked, even if it could find homology somewhere else.the fix
A possible first-order fix, implemented here, is to simply break ties when picking consensus sequences of blocks. Currently the consensus sequence is chosen based on the block depth (this minimizes re-alignments). On top of this, now for blocks with the same depth sequences with fewer amounts of
NNNNNNs are favored.a possible further improvement
This fix does not solve cases where
NNNNNNs are found in multiple sequences. This could be in principle be taken care of in thereconsensusstep, by accepting observed substitutionsN -> Xto the consensus even if they are present in a minority of sequences, as long as they removeNs from the consensus. Currently I did not implement this, since it requires deeper changes and might break other things if not done properly (one needs to be careful that theseNs are not re-introduced later because they still are majority substitutions).I'm not sure these edge-cases are frequent enough to merit more complicated changes to the code, so for now I only implemented this minimal change.
Thanks to @plaquette for running into this and reporting!