Skip to content

coll: extending circulant graph algorithm#7710

Open
hzhou wants to merge 12 commits intopmodels:mainfrom
hzhou:2601_coll_circ
Open

coll: extending circulant graph algorithm#7710
hzhou wants to merge 12 commits intopmodels:mainfrom
hzhou:2601_coll_circ

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Jan 29, 2026

Pull Request Description

Extend the circulant graph algorithm to reduce and allgather.

  • Reduce is the reverse of the bcast
  • Allgather is the concurrent running of all-bcast - bcast with each process as root.

In this PR -

  • Refactor the bcast_circ_graph algorithm into 3 pieces

    1. The generation of the circulant graph schedules
    2. The queuing and dependency tracking for non-blocking requests from running the schedule
    3. The bcast algorithm itself
  • The piece 2 is the most interesting part of this PR. The goal is to evolve it into a semi-general collective schedule framework that can perform

    1. multi-stage async local staging/packing/unpacking for each send/recv
    2. dependency tracking
    3. concurrency limit control
    4. generalized request abstraction
  • Bcast is the simplest. The recvs have no dependency. The send may depend on previous recv of the same block

  • Allgather extends the amount of buffers or block by the number of processes, but otherwise it is the same as bcast

  • Reduce -

    1. the recv has two parts: recv into tmp_buf and reduce into recvbuf. The recv part need clear previous recv, but the reduce part need previous sends
    2. the send need clear previous recv including the reduction
  • Reduce_scatter is an "all-" version of Reduce just as Allgather is an "all-" version of Bcast. However, since we cannot reduce into sendbuf, we need create local temp copy of sendbuf. Thus it makes the algorithm not appealing memory-wise.

Reference:

NOTES

  • The algorithm performs reduces out-of-order. This is problematic for floating point reduction. It may result in nondeterministic (from user's point of view) results.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou force-pushed the 2601_coll_circ branch 8 times, most recently from 35f6086 to a65f7b7 Compare February 3, 2026 19:40
@hzhou hzhou force-pushed the 2601_coll_circ branch 2 times, most recently from 57ced10 to d9e1b20 Compare February 6, 2026 16:57
@hzhou hzhou marked this pull request as ready for review February 6, 2026 16:57
@hzhou hzhou requested review from mjwilkins18 and raffenet February 9, 2026 22:44
@hzhou hzhou force-pushed the 2601_coll_circ branch 12 times, most recently from 79855f1 to 6b1dcc0 Compare February 27, 2026 15:38
@hzhou hzhou force-pushed the 2601_coll_circ branch 3 times, most recently from 24e5b0e to 14d81e1 Compare March 6, 2026 19:05
The circulant graph algorithm can be extended to reduce, allgather, and
allreduce. Refactor so we can share the algorithm code.
hzhou added 10 commits March 10, 2026 12:44
Before we extend the circ_graph algorithm to more collectives, e.g.
reduce and allgather, refactor to prepare for the new code.
Remove the extra parameters chunk_size and q_len for the bcast
circ_graph algorithm. Instead, use global cvar
MPIR_CVAR_CIRC_GRAPH_CHUNK_SIZE and MPIR_CVAR_CIRC_GRAPH_Q_LEN to tune
all circ_graph algorithms. Both the chunk_size and q_len have more to do
with communication latency and bandwidth curve, and less to do with
specific collective operations. Remove the extra parameters for now
simplifies the effort to extend the circ_graph algorithm to more
collectives such as reduce and allgather. We can add the parameters back
in the future when it is shown to be necessary.
Instead of just a true/false, we can store the actual pending request
index in the pending_blocks[] (replace can_send[]) to avoid a linear
search every time a send block is pending.
Handle the non-contig datatype packing and unpacking in
cga_request_queue. This paves way for later extend the cga_request_queue
into nonblocking and be able to handle asynchronous GPU
packing/unpacking.

Also move the q_len and chunk_size handling into cga_request_queue.c.
Bcast zero-sized messages works with the circ_graph algorithm.
If we reverse the circulant graph bcast schedule, we get the reduce
algorithm. We extend the cga_request_queue facility to perform reduction
at the completion of receive requests.

Unlike bcast, which only receives a block once, reduce receives the same
block from multiple processes (and performs reduction), thus we need
check for pending previous receives before issuing new ones.
Allgather is the same as all-bcast with every rank assuming as root.
Compared to bcast, the buffers are aggregate buffers for comm_size
processes.
Different collective types have very different dependency conditions in
issuing sends and recvs. Split them into separate functions rather than
having a big switch with a single function.
In bcast and allgather the dependency tracking is simple as recv does
not have dependency and send only depend on at most a single recv.
For reduction, we may have multiple pending sends and a single pending
recv.
It makes more sense to wait-if-full before issuing the send or recv. In
addition, we don't need q_tail since we can directly test the q_head
slot to see if we need to wait.
Add placeholder code for MPIR_cga_init and MPIR_cga_finalize in
preparation of pre-allocating genq pools for pipelining chunks.
@hzhou
Copy link
Contributor Author

hzhou commented Mar 10, 2026

test:mpich/ch3/most
test:mpich/ch4/most

@hzhou
Copy link
Contributor Author

hzhou commented Mar 10, 2026

test:mpich/ch3/most
test:mpich/ch4/most

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant