fix: `N` characters in block consensus by mmolari · Pull Request #176 · neherlab/pangraph

mmolari · 2026-03-12T16:16:04Z

issue with `NNNNN` sequences

For sequences that contain large stretches of NNNNNNs, it can rarely happen that these end up in the consensus of a block and prevent the block from being subsequently aligned and merged.

This can happen for example when merging two blocks of depth one (e.g. two single-genome graphs, which happens at all tips of the guide tree). In this case one of the two sequences is randomly chosen as the consensus sequences, and it could be the one with many NNNNNN.

This could be later corrected in the reconsensus step if the block is aligned and increases in depth, and the NNNNN stretch is found in only a minority of genomes. However if the stretch is very long (or the block is split such that the NNNNN stretch is now a majority of the consensus sequence) now the block cannot be further aligned anymore, and the other non-NNNNN sequence in the block is "trapped" and masked, even if it could find homology somewhere else.

the fix

A possible first-order fix, implemented here, is to simply break ties when picking consensus sequences of blocks. Currently the consensus sequence is chosen based on the block depth (this minimizes re-alignments). On top of this, now for blocks with the same depth sequences with fewer amounts of NNNNNNs are favored.

a possible further improvement

This fix does not solve cases where NNNNNNs are found in multiple sequences. This could be in principle be taken care of in the reconsensus step, by accepting observed substitutions N -> X to the consensus even if they are present in a minority of sequences, as long as they remove Ns from the consensus. Currently I did not implement this, since it requires deeper changes and might break other things if not done properly (one needs to be careful that these Ns are not re-introduced later because they still are majority substitutions).

I'm not sure these edge-cases are frequent enough to merit more complicated changes to the code, so for now I only implemented this minimal change.

Thanks to @plaquette for running into this and reporting!

ivan-aksamentov · 2026-03-13T12:08:03Z

Will review in more details later. Need to brush up science first - I already forgot most of it.

For now, pasting AI review below, in case useful. If not, please ignore.

⚠️🤖 AI-generated content below. May contain mistakes

Problem Statement

During graph merging, when two blocks with equal depth are candidates for merging, the anchor block's consensus becomes the merged block's consensus. Previously, ties were broken arbitrarily (ref wins). This PR aims to prefer the block with fewer ambiguous N characters, improving merged consensus quality.

Proposed Changes

Add a tie-breaker to anchor block selection: when two blocks have equal depth, prefer the block with fewer N characters in its consensus. The goal is to minimize Ns in merged block consensuses.

Identified Issues

H1: The new heuristic measures Ns on the wrong sequence [click to expand]

The code flow reveals a mismatch between what's measured and what's used:

Anchor decision (packages/pangraph/src/pangraph/reweave.rs#L144-L163):

let ref_block = &graph.blocks[&m.reff.name];
let qry_block = &graph.blocks[&m.qry.name];
// ...
if ref_block.count_n() <= qry_block.count_n() {

The count_n() is called on whole original blocks from graph.blocks.

Slicing happens later (packages/pangraph/src/pangraph/reweave.rs#L364):
```
let (b_slice, n_dict) = block_slice(b, &interval, graph);
```
Slice creates new consensus (packages/pangraph/src/pangraph/slice.rs#L143):
```
let new_consensus: Seq = b.consensus()[i.interval.to_range()].into();
```
Only this interval slice becomes the merged consensus.

Merged block uses anchor's slice (packages/pangraph/src/pangraph/reweave.rs#L318 and reweave.rs#L93):

let anchor_block = b_anch.block.clone(); // already sliced
// ... in solve_promise:
Ok(self.anchor_block.clone()) // returns sliced anchor

Concrete counterexample:

Block	Total Consensus	Aligned Interval	Total Ns	Interval Ns
A	`NNNNN-ACGT-NNNNN`	`ACGT`	10	0
B	`ACGT-ACNT-ACGT`	`ACNT`	1	1

With equal depth:

Current code picks B as anchor (1 < 10 total Ns)
Merged consensus: ACNT (1 N)
Optimal choice: A's interval ACGT (0 Ns)

Why this matters:

Partial-block merges are the norm in reweave - alignments rarely span entire block consensuses
The m.reff.interval and m.qry.interval fields contain the actual aligned regions
Reconsensus (packages/pangraph/src/reconsensus/reconsensus.rs#L32-L94) uses majority-rule on the merged block (post-slice), so starting with more Ns in the anchor slice propagates them

Possible Solutions:

S1. Count Ns on aligned intervals (recommended) [click to expand]

In assign_anchor_block(), extract and count Ns on intervals instead of whole blocks:

fn assign_anchor_block(mergers: &mut [Alignment], graph: &Pangraph) {
  for m in mergers.iter_mut() {
    let ref_block = &graph.blocks[&m.reff.name];
    let qry_block = &graph.blocks[&m.qry.name];

    let anchor = if ref_block.depth() != qry_block.depth() {
      // ... existing depth logic
    } else {
      // Count Ns in aligned intervals only
      let ref_interval_ns = ref_block.consensus()[m.reff.interval.to_range()]
        .iter().filter(|c| c.0 == b'N').count();
      let qry_interval_ns = qry_block.consensus()[m.qry.interval.to_range()]
        .iter().filter(|c| c.0 == b'N').count();

      if ref_interval_ns <= qry_interval_ns {
        AnchorBlock::Ref
      } else {
        AnchorBlock::Qry
      }
    };
    m.anchor_block = Some(anchor);
  }
}

S2. Add helper method to PangraphBlock [click to expand]

Add count_n_in_range(interval: &Interval) -> usize to packages/pangraph/src/pangraph/pangraph_block.rs for cleaner code.

S3. Regression test [click to expand]

Test where block A has more total Ns but fewer Ns in aligned interval. Fails with current implementation, passes once fixed.

#[test]
fn test_assign_anchor_block_prefers_fewer_interval_ns() {
  fn new_hit(block_id: usize, start: usize, end: usize) -> Hit {
    Hit { name: BlockId(block_id), length: 15, interval: Interval::new(start, end) }
  }

  fn new_aln(q: usize, q_interval: (usize, usize), r: usize, r_interval: (usize, usize)) -> Alignment {
    Alignment {
      qry: new_hit(q, q_interval.0, q_interval.1),
      reff: new_hit(r, r_interval.0, r_interval.1),
      matches: 0, length: 0, quality: 0, orientation: Forward,
      new_block_id: None, anchor_block: None, cigar: Cigar::default(),
      divergence: None, align: None,
    }
  }

  fn e(nids: &[usize]) -> BTreeMap<NodeId, Edit> {
    nids.iter().map(|nid| (NodeId(*nid), Edit::empty())).collect()
  }

  // Block 1: 10 total Ns, interval [5..9] "ACGT" has 0 Ns
  // Block 2: 1 total N, interval [4..8] "ACNT" has 1 N
  let b1 = PangraphBlock::new(BlockId(1), "NNNNNACGTNNNNN", e(&[1, 2]));
  let b2 = PangraphBlock::new(BlockId(2), "ACGTACNTACGT", e(&[3, 4]));

  let pangraph = Pangraph {
    blocks: btreemap! { BlockId(1) => b1, BlockId(2) => b2 },
    paths: btreemap! {}, nodes: btreemap! {},
  };

  // With interval-based counting: Ref wins (0 < 1)
  // With whole-block counting (bug): Qry wins (1 < 10)
  let mut mergers = vec![new_aln(2, (4, 8), 1, (5, 9))];
  assign_anchor_block(&mut mergers, &pangraph);
  assert_eq!(mergers[0].anchor_block, Some(AnchorBlock::Ref));
}

Summary

Issue	Valid	Fix
H1: The new heuristic measures Ns on the wrong sequence	Yes	Count Ns on aligned intervals, not whole blocks

mmolari · 2026-03-13T15:23:21Z

AI was right, this was indeed an issue, thanks for the review! Should now be addressed!

Extends #176 with a test case that would fail if N counting used whole-block consensus instead of aligned intervals. Scenario: - Block 1: 10 total Ns, but aligned interval has 0 Ns - Block 2: 1 total N, and aligned interval has 1 N With whole-block counting: Block 2 wins (1 < 10) With interval counting: Block 1 wins (0 < 1)

Extends test/pr-176-interval-regression by refactoring the three separate anchor block test functions into a single rstest parameterized test. Benefits: - Reduces test code from 148 to 45 lines - Table format makes test cases easier to compare - Adding new cases requires only one line All 4 original test cases preserved with identical coverage.

Extends refactor/pr-176-parameterized-tests with additional edge cases. New cases (7 added, 11 total): - equal_depth_equal_ns_ref_wins: tie-breaker when both have same N count - equal_depth_zero_ns_ref_wins: tie-breaker when neither has Ns - equal_depth_many_ns_qry_wins: qry wins when ref has more Ns - ref_deeper_wins: ref with greater depth wins despite more Ns - depth_large_difference: large depth gap (10 vs 2) - interval_at_end: N at block end position - single_base_interval: single nucleotide alignment

ivan-aksamentov · 2026-03-13T21:26:19Z

Our AI overlords are happy now!

My human take:

Technically - perfect
I brushed up on science a little and I think changes are sound scientifically, for my current understanding.

My current understanding how it affects production, from user perspective:

The problem was: when input genomes contain stretches of Ns, these could get "stuck" in block consensus sequences, preventing related sequences from being merged - effectively hiding good data.

With the fix: less ambiguous consensus sequences, better merging of related regions, fewer blocks trapped by ambiguous data, nicer pangraphs.

ivan-aksamentov · 2026-03-13T21:27:30Z

While going through code, I thought of some improvements to the tests.

I made optional stacked PRs extending on top of this PR, for consideration:

test: add regression test for interval-based N counting #177 - Regression test verifying the new interval-based N counting (guards against whole-block counting bug, as discussed)
refactor: consolidate anchor block tests into parameterized test #178 - Refactor anchor block tests into parameterized rstest (4 cases, table format for readability, both humans and AI)
test: add more anchor block selection test cases #179 - Additional edge cases suggested by AI: tie-breakers, depth priority, interval positions (11 cases total). To be verified thoroughly, or dropped if no time.

All optional - can be merged independently, along with or after this PR.

To incorporate all 3 optional changes into this PR, preserving nice linear history:

git checkout fix/block_consensus
git merge --ff-only test/pr-176-more-cases  # fast forward PR branch pointer to the last stacked branch pointer
git push

Otherwise, 176 is good to go as is. Feel free to merge. Let me know if I should release on Tuesday-Wednesday or if you want to do it.

test: add regression test for interval-based N counting

…ests refactor: consolidate anchor block tests into parameterized test

test: add more anchor block selection test cases

mmolari · 2026-03-16T09:53:25Z

Our AI overlords are happy now!

😆

Thank you! I checked the other PRs and they all make a lot of sense! Merged everything and I can take care of the release tomorrow/wed, I think your instructions for this were very complete!

mmolari and others added 3 commits March 12, 2026 16:22

feat: break anchor block assignment ties with least N counts

3d497f3

chore: update changelog

89df47e

docs(changelog): fix typos

dbf47ae

mmolari added 2 commits March 13, 2026 16:21

fix: consider only the matched region for alignment N count

f34a390

style: improved comments

2122c2a

ivan-aksamentov added 3 commits March 13, 2026 21:32

ivan-aksamentov mentioned this pull request Mar 13, 2026

test: add regression test for interval-based N counting #177

Merged

mmolari added 3 commits March 16, 2026 10:37

Merge pull request #177 from neherlab/test/pr-176-interval-regression

3df33e6

test: add regression test for interval-based N counting

Merge pull request #178 from neherlab/refactor/pr-176-parameterized-t…

d74b314

…ests refactor: consolidate anchor block tests into parameterized test

Merge pull request #179 from neherlab/test/pr-176-more-cases

0480a2b

test: add more anchor block selection test cases

mmolari merged commit e330586 into master Mar 16, 2026
19 checks passed

mmolari deleted the fix/block_consensus branch March 16, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `N` characters in block consensus#176

fix: `N` characters in block consensus#176
mmolari merged 11 commits intomasterfrom
fix/block_consensus

mmolari commented Mar 12, 2026

Uh oh!

ivan-aksamentov commented Mar 13, 2026 •

edited

Loading

Uh oh!

mmolari commented Mar 13, 2026

Uh oh!

ivan-aksamentov commented Mar 13, 2026

Uh oh!

ivan-aksamentov commented Mar 13, 2026 •

edited

Loading

Uh oh!

mmolari commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mmolari commented Mar 12, 2026

issue with NNNNN sequences

the fix

a possible further improvement

Uh oh!

ivan-aksamentov commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Proposed Changes

Identified Issues

Summary

Uh oh!

mmolari commented Mar 13, 2026

Uh oh!

ivan-aksamentov commented Mar 13, 2026

Uh oh!

ivan-aksamentov commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmolari commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

issue with `NNNNN` sequences

ivan-aksamentov commented Mar 13, 2026 •

edited

Loading

ivan-aksamentov commented Mar 13, 2026 •

edited

Loading