Skip to content

Conversation

@bharathv
Copy link
Contributor

@bharathv bharathv commented Jan 2, 2026

This list is supposed to contain the list of dead nodes that are relevant to a particular partition, instead it just returns the original list.

For example test/0 has replicas [1, 3, 5, 7] and [5, 7, 8] are dead nodes, this method now returns [5, 7] instead of [5, 7, 8] since only the former is the relevant set of replicas for this partition.

Additionally sorts the replica/node lists for easy visual comparison.

Fixes: https://redpandadata.atlassian.net/browse/INC-1044

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

This list is supposed to contain the list of dead nodes that are
relevant to a particular partition, instead it just returns the original
list.

For example test/0 has replicas [1, 3, 5, 7] and [5, 7, 8] are dead
nodes, this method now returns [5, 7] instead of [5, 7, 8] since only
the former is the relevant set of replicas for this partition.

Additionally sorts the replica/node lists for easy visual comparison.
Copilot AI review requested due to automatic review settings January 2, 2026 20:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where the per-partition dead node list was incorrectly returning all dead nodes in the cluster instead of only the dead nodes that are replicas of the specific partition. For example, if partition test/0 has replicas [1, 3, 5, 7] and cluster-wide dead nodes are [5, 7, 8], the method now correctly returns [5, 7] instead of [5, 7, 8].

Key changes:

  • Extract only dead nodes that are actual replicas of the partition being evaluated
  • Sort replica lists for consistent output ordering

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#78463

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/partition_force_reconfiguration_test.py::NodeWiseRecoveryTest.test_node_wise_recovery@{"dead_node_count":2}

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#78463
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
NodeWiseRecoveryTest test_node_wise_recovery {"dead_node_count": 2} integration https://buildkite.com/redpanda/redpanda/builds/78463#019b8083-26a4-4ccc-becb-13b2c6bb182b FAIL 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_node_wise_recovery
TopicRecoveryTest test_many_partitions {"check_mode": "check_manifest_and_segment_metadata", "cloud_storage_type": 2} integration https://buildkite.com/redpanda/redpanda/builds/78463#019b8083-26a3-4e77-a46a-d1fe571aef7c FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TopicRecoveryTest&test_method=test_many_partitions
src/v/wasm/tests/wasm_transform_test src/v/wasm/tests/wasm_transform_test unit https://buildkite.com/redpanda/redpanda/builds/78463#019b806e-eaa5-4df1-b0e1-3841d1c9828b FAIL 0/1

@bharathv
Copy link
Contributor Author

bharathv commented Jan 3, 2026

The failures seem related. I think I know why, but I need a bit more time to think it through...please hold off on reviewing until I fix it.

Comment on lines +1397 to +1398
std::ranges::find(dead_nodes, replica.node_id)
!= dead_nodes.end()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use std::ranges::contains

const auto& current = assignment.replicas;
auto current = assignment.replicas;
// sort for a consistent final output ordering.
std::ranges::sort(current, [](const auto& a, const auto& b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
can we kick the sort to after the quorum loss determination?

We only need a sorted copy of assignment.replicas if theres quorum loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants