-
Notifications
You must be signed in to change notification settings - Fork 937
pml/ob1: introduce a new protocol for intermediate-sized messages #13527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit introduces the multi-eager protocol to ob1. This protocol works by fragmenting into multiple eager-sized messages and sending them in parallel to the destination. On the receiver the first fragment is matched against a posted receive if one exists. If a receive is matched then each incoming multi- eager packet is copied directly into the user buffer without additional buffering in ob1. Once all fragments have arrived the receive request is marked complete. If the message is unexpected it is buffered until all fragments have arrived then processed as a large eager message. Usage of this protocol is disabled by default and is enabled by setting a BTL's multi_eager_limit larger than its eager_limit. When enabled ob1 will use the new protocol for messages that are larger than the eager limit but smaller than the multi_eager_limit. At that point ob1 will switch to doing a full rendezvous. This protocol is inspired by the multiple send eager protocol used by OpenUCX. It can provide lower latency communication at a cost of additional resources for in-flight messages vs the various rendezvous protocols because it doesn't wait for an ack from the receiver and does not make use of either RDMA read or RDMA write. The cost is highest for non-contiguous data on the sender and unexpected receives on the receiver. This commit also re-organizes the match code so that multi-eager can make use of the same code. Signed-off-by: Nathan Hjelm <[email protected]>
8061924 to
0c74ef3
Compare
bosilca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have multiple comments regarding this issue:
- It makes a lot of unwarranted and unnecessary changes to a critical and functioning code that are not justified by the addition of a new protocol.
- it impacts the matching logic leading to deprioritize incoming traffic
- It adds a protocol in which I see little value because the same outcome can be achieved with just a larger eager message, which will not require all the extra fragmentation of the send path while potentially using the same amount of memory on the receiver.
What exactly is the benefit for this new protocol ? In what conditions exactly ? Do you have performance evaluations to show it ?
| match = match_one(btl, hdr, segments, num_segments, comm_ptr, proc, NULL); | ||
| if ((!OMPI_COMM_CHECK_ASSERT_ALLOW_OVERTAKE(comm_ptr) || 0 > hdr->hdr_tag) && | ||
| (MCA_PML_OB1_HDR_TYPE_MULTI_EAGER != hdr->hdr_common.hdr_type || match)) { | ||
| /* Only increment the expected sequence if this is an internal message or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is not an internal message. Typo on grags.
I dont understand the logic of handling the sequence number here. Are you waiting for all multi-eager fragments to arrive before handling them ? That sounds so wasteful !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not happy with the code myself. The logic is if this a multi-eager AND it did not match then do not increment the expected sequence as that will happen when the whole fragment is available. I can certainly improve the logic to always increment here but it will take some more refactoring.
|
|
||
| /* release matching lock before processing fragment */ | ||
| OB1_MATCHING_UNLOCK(&comm->matching_lock); | ||
| /* We matched the frag, Now see if we already have the next sequence in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are in the match callback of an incoming message, potentially blocking the communication engine. Instead of completing the match for the incoming packet, which would guarantee priority for incoming traffic, the new logic stops after the match but before handling the matched fragment and goes on match and handle an out-of-sequence fragment. This has two issues with me: allows for a period of time to have two fragments matched but not handled and deprioritize the incoming traffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprioritization should only happen in the unexpected case. I can improve that but it requires more changes that are out of scope until there is an agreement that this proposal should go in (modified of course).
As for multiple matched but not handled frags. That shouldn't be a problem, correct? If we have multiple receives waiting on additional eager fragments it doesn't violate the standard nor should it cause issues because the requests are separate. MPI only requires we match them in order not complete them. This is no different than having multiple matched rendezvous sends.
| static inline int mca_pml_ob1_send_helper(mca_pml_ob1_send_request_t *sendreq, mca_bml_base_btl_t *bml_btl, void *hdr, size_t hdr_size, size_t *size, | ||
| mca_btl_base_completion_fn_t comp_fn) | ||
| { | ||
| int rc = mca_bml_base_sendi (bml_btl, &sendreq->req_send.req_base.req_convertor, hdr, hdr_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a very proper way of handling this. What exactly warrant the addition of yet another intermediary function because clearly it cannot be used for the multi-eager (because they should not be taking the sendi path) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sendi is available for all fragments as the size of each falls within the eager limit and sendi may help with latency because it may avoid the overhead of allocating and tracking a fragment (depends on the btl of course). This helper was added to make it easier to follow the code since how the fragment gets sent is not relevant to the protocol itself.
The overview of the method:
- Attempt sendi for the current fragment of data.
- If that fails, attempt to allocate an in-place fragment.
- If all else fails fall back on a fully buffered fragment.
I had originally intended the method to be more general use for any send fragment (rendezvous, eager, etc) but it may not end up being reusable. Still think it is worthwhile for keeping mca_pml_ob1_send_multi_eager_fragment simple and easy to reason about.
As for using the same amount of memory on the receiver. I would have expected it to be a wash but multi-eager seems to help considerably: PingPong with multi-eager disabled and UCX_RC_VERBS_SEG_SIZE=272144 (using time for rough memory estimation): https://gist.github.com/hjelmn/02a5ea00c2f30081818cddfb47dbac23 Also bumping the btl default eager limit to 256kiB: https://gist.github.com/hjelmn/fa68815d5361a0b32c24fb249b9c8300 Maximum resident set size (kbytes) is about 1.2GB PingPong with multi-eager used for (8192,272144) and UCX_RC_VERBS_SEG_SIZE=8256 (default): https://gist.github.com/hjelmn/91e22a0e03c65a20eb3082d368ec112c Maximum resident set size (kbytes) is about 74MB. I had not measured the memory before but it shows that multi-eager does help with memory usage. This benchmark usually has a preposted receive so it shouldn't be totally surprising. The buffers in UCT (except the short ones) will all be 256kiB in size so there is a lot of waste there. You can see that multi-eager gives similar or better performance than the higher eager limit and matches UCP (https://gist.github.com/hjelmn/f7b931e75b5ac5cf9d06552c0444b319). Now, I could fragment in the btl itself but then it can't eagerly put the incoming data into the posted user buffer as they come it. It would always have to wait for all the fragments before calling the match. |
This commit introduces the multi-eager protocol to ob1. This protocol works by fragmenting into multiple eager-sized messages and sending them in parallel to the destination. On the receiver the first fragment is matched against a posted receive if one exists. If a receive is matched then each incoming multi- eager packet is copied directly into the user buffer without additional buffering in ob1. Once all fragments have arrived the receive request is marked complete. If the message is unexpected it is buffered until all fragments have arrived then processed as a large eager message.
Usage of this protocol is disabled by default and is enabled by setting a BTL's multi_eager_limit larger than its eager_limit. When enabled ob1 will use the new protocol for messages that are larger than the eager limit but smaller than the multi_eager_limit. At that point ob1 will switch to doing a full rendezvous.
This protocol is inspired by the multiple send eager protocol used by OpenUCX. It can provide lower latency communication at a cost of additional resources for in-flight messages vs the various rendezvous protocols because it doesn't wait for an ack from the receiver and does not make use of either RDMA read or RDMA write. The cost is highest for non-contiguous data on the sender and unexpected receives on the receiver.
This commit also re-organizes the match code so that multi-eager can make use of the same code.