Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Nov 21, 2025

This commit introduces the multi-eager protocol to ob1. This protocol works by fragmenting into multiple eager-sized messages and sending them in parallel to the destination. On the receiver the first fragment is matched against a posted receive if one exists. If a receive is matched then each incoming multi- eager packet is copied directly into the user buffer without additional buffering in ob1. Once all fragments have arrived the receive request is marked complete. If the message is unexpected it is buffered until all fragments have arrived then processed as a large eager message.

Usage of this protocol is disabled by default and is enabled by setting a BTL's multi_eager_limit larger than its eager_limit. When enabled ob1 will use the new protocol for messages that are larger than the eager limit but smaller than the multi_eager_limit. At that point ob1 will switch to doing a full rendezvous.

This protocol is inspired by the multiple send eager protocol used by OpenUCX. It can provide lower latency communication at a cost of additional resources for in-flight messages vs the various rendezvous protocols because it doesn't wait for an ack from the receiver and does not make use of either RDMA read or RDMA write. The cost is highest for non-contiguous data on the sender and unexpected receives on the receiver.

This commit also re-organizes the match code so that multi-eager can make use of the same code.

This commit introduces the multi-eager protocol to ob1. This protocol works by
fragmenting into multiple eager-sized messages and sending them in parallel to
the destination. On the receiver the first fragment is matched against a
posted receive if one exists. If a receive is matched then each incoming multi-
eager packet is copied directly into the user buffer without additional
buffering in ob1. Once all fragments have arrived the receive request is marked
complete. If the message is unexpected it is buffered until all fragments have
arrived then processed as a large eager message.

Usage of this protocol is disabled by default and is enabled by setting a BTL's
multi_eager_limit larger than its eager_limit. When enabled ob1 will use the new
protocol for messages that are larger than the eager limit but smaller than the
multi_eager_limit. At that point ob1 will switch to doing a full rendezvous.

This protocol is inspired by the multiple send eager protocol used by OpenUCX.
It can provide lower latency communication at a cost of additional resources
for in-flight messages vs the various rendezvous protocols because it doesn't
wait for an ack from the receiver and does not make use of either RDMA read
or RDMA write. The cost is highest for non-contiguous data on the sender and
unexpected receives on the receiver.

This commit also re-organizes the match code so that multi-eager can make use
of the same code.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn hjelmn force-pushed the pml_ob1_multi_eager_protocol branch from 8061924 to 0c74ef3 Compare November 21, 2025 20:41
@hjelmn hjelmn requested a review from bosilca November 24, 2025 17:06
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have multiple comments regarding this issue:

  1. It makes a lot of unwarranted and unnecessary changes to a critical and functioning code that are not justified by the addition of a new protocol.
  2. it impacts the matching logic leading to deprioritize incoming traffic
  3. It adds a protocol in which I see little value because the same outcome can be achieved with just a larger eager message, which will not require all the extra fragmentation of the send path while potentially using the same amount of memory on the receiver.

What exactly is the benefit for this new protocol ? In what conditions exactly ? Do you have performance evaluations to show it ?

match = match_one(btl, hdr, segments, num_segments, comm_ptr, proc, NULL);
if ((!OMPI_COMM_CHECK_ASSERT_ALLOW_OVERTAKE(comm_ptr) || 0 > hdr->hdr_tag) &&
(MCA_PML_OB1_HDR_TYPE_MULTI_EAGER != hdr->hdr_common.hdr_type || match)) {
/* Only increment the expected sequence if this is an internal message or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not an internal message. Typo on grags.

I dont understand the logic of handling the sequence number here. Are you waiting for all multi-eager fragments to arrive before handling them ? That sounds so wasteful !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not happy with the code myself. The logic is if this a multi-eager AND it did not match then do not increment the expected sequence as that will happen when the whole fragment is available. I can certainly improve the logic to always increment here but it will take some more refactoring.


/* release matching lock before processing fragment */
OB1_MATCHING_UNLOCK(&comm->matching_lock);
/* We matched the frag, Now see if we already have the next sequence in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are in the match callback of an incoming message, potentially blocking the communication engine. Instead of completing the match for the incoming packet, which would guarantee priority for incoming traffic, the new logic stops after the match but before handling the matched fragment and goes on match and handle an out-of-sequence fragment. This has two issues with me: allows for a period of time to have two fragments matched but not handled and deprioritize the incoming traffic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprioritization should only happen in the unexpected case. I can improve that but it requires more changes that are out of scope until there is an agreement that this proposal should go in (modified of course).

As for multiple matched but not handled frags. That shouldn't be a problem, correct? If we have multiple receives waiting on additional eager fragments it doesn't violate the standard nor should it cause issues because the requests are separate. MPI only requires we match them in order not complete them. This is no different than having multiple matched rendezvous sends.

static inline int mca_pml_ob1_send_helper(mca_pml_ob1_send_request_t *sendreq, mca_bml_base_btl_t *bml_btl, void *hdr, size_t hdr_size, size_t *size,
mca_btl_base_completion_fn_t comp_fn)
{
int rc = mca_bml_base_sendi (bml_btl, &sendreq->req_send.req_base.req_convertor, hdr, hdr_size,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a very proper way of handling this. What exactly warrant the addition of yet another intermediary function because clearly it cannot be used for the multi-eager (because they should not be taking the sendi path) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendi is available for all fragments as the size of each falls within the eager limit and sendi may help with latency because it may avoid the overhead of allocating and tracking a fragment (depends on the btl of course). This helper was added to make it easier to follow the code since how the fragment gets sent is not relevant to the protocol itself.

The overview of the method:

  1. Attempt sendi for the current fragment of data.
  2. If that fails, attempt to allocate an in-place fragment.
  3. If all else fails fall back on a fully buffered fragment.

I had originally intended the method to be more general use for any send fragment (rendezvous, eager, etc) but it may not end up being reusable. Still think it is worthwhile for keeping mca_pml_ob1_send_multi_eager_fragment simple and easy to reason about.

@hjelmn
Copy link
Member Author

hjelmn commented Dec 1, 2025

I have multiple comments regarding this issue:

  1. It makes a lot of unwarranted and unnecessary changes to a critical and functioning code that are not justified by the addition of a new protocol.
  2. it impacts the matching logic leading to deprioritize incoming traffic
  3. It adds a protocol in which I see little value because the same outcome can be achieved with just a larger eager message, which will not require all the extra fragmentation of the send path while potentially using the same amount of memory on the receiver.

What exactly is the benefit for this new protocol ? In what conditions exactly ? Do you have performance evaluations to show it ?

  1. I can certainly undo some of the changes. The changes to the primary send method are intended to make the protocol selection clearer. That can be undone if it is not wanted or broken into a separate change if it makes this PR harder to review.

  2. Yes and no. If the sender does not encounter an out-of-resource issue then we should expect (due to BTL ordering) all the fragments of the multi-send to arrive before any higher sequence number. If the multi-send arrives and is matched immediately (non-buffered case) incoming traffic is unblocked and we process unexpected messages as we would had it been an eager send. If the message was unexpected it will indeed block processing of other sends until all of the multi-send fragments have arrived. I can improve this but it will take some refactoring. I wanted to get this proposal up there before investing in that optimization. No point in doing it if multi-eager is not something we want to support.

  3. The primary use case for this protocol is for btls with limited maximum eager limits. The uct btl limits the eager limit to the maximum send size for the transport. This can be tweaked by changing the ucx configuration (UCX_RC_VERBS_SEG_SIZE etc) so there is an argument that we should be doing that instead. UCX chose to keep these sizes small by default an implement a multi-eager protocol in UCP for intermediate messages so this proposal is about adding it as an option to match UCP out of the box performance in ob1. I still very much prefer using ob1 rather than having UCX handle most of the MPI semantics.

As for using the same amount of memory on the receiver. I would have expected it to be a wash but multi-eager seems to help considerably:

PingPong with multi-eager disabled and UCX_RC_VERBS_SEG_SIZE=272144 (using time for rough memory estimation): https://gist.github.com/hjelmn/02a5ea00c2f30081818cddfb47dbac23

Also bumping the btl default eager limit to 256kiB: https://gist.github.com/hjelmn/fa68815d5361a0b32c24fb249b9c8300

Maximum resident set size (kbytes) is about 1.2GB

PingPong with multi-eager used for (8192,272144) and UCX_RC_VERBS_SEG_SIZE=8256 (default): https://gist.github.com/hjelmn/91e22a0e03c65a20eb3082d368ec112c

Maximum resident set size (kbytes) is about 74MB.

I had not measured the memory before but it shows that multi-eager does help with memory usage. This benchmark usually has a preposted receive so it shouldn't be totally surprising. The buffers in UCT (except the short ones) will all be 256kiB in size so there is a lot of waste there. You can see that multi-eager gives similar or better performance than the higher eager limit and matches UCP (https://gist.github.com/hjelmn/f7b931e75b5ac5cf9d06552c0444b319).

Now, I could fragment in the btl itself but then it can't eagerly put the incoming data into the posted user buffer as they come it. It would always have to wait for all the fragments before calling the match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants