Unlocking asynchronicity in continuous batching#3372
Conversation
eda3def to
4f7d49e
Compare
pcuenca
left a comment
There was a problem hiding this comment.
Very nice, did a first pass!
There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!
Happy to take another look later if you want.
| - hub | ||
|
|
||
| - local: continuous_async | ||
| date: April 28, 2026 |
There was a problem hiding this comment.
reminder to update before release
| Everyone likes owning their own tools. And with AI becoming the largest productivity multiplier around, everyone wants to own their model. But owning a model is not enough: you also need to run it. | ||
| Most likely, on a modern GPU, which does not run cheap. So when you are renting a GPU to run your own models, you want to use it at the fullest of its capacities. |
There was a problem hiding this comment.
Not fully sold on this, perhaps a more direct way to refer to maximizing capacity would work. Will come back to it later.
There was a problem hiding this comment.
Probably showcasing hosting big models using Inference Endpoints might be a good motivation to think about using GPUs to the fullest capability.
|
|
||
| The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker. | ||
|
|
||
| This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background. |
There was a problem hiding this comment.
Yes, but so far we are effectively using parallel computation to serialize tasks lol. Perhaps we could call this out a bit more explicitly so the reader is prepared to see the magic in a moment.
|
|
||
|  | ||
|
|
||
| The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks the sync point between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time. |
There was a problem hiding this comment.
| The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks the sync point between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time. | |
| The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks are the sync points between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time. |
|
|
||
| ## Conclusion | ||
|
|
||
| We started with three questions: |
There was a problem hiding this comment.
Perhaps we could do a higher-level recap rather than the lower-level problems we solved.
pcuenca
left a comment
There was a problem hiding this comment.
Very nice, did a first pass!
There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!
Happy to take another look later if you want.
ariG23498
left a comment
There was a problem hiding this comment.
I am done reviewing first half of the blog post. I will be able to get the second half done by early tomorrow.
Initial verdict: I really like how the blog post is paced, while going really deep into technicalities, it never feels overwhelming. The ideas introduced are very advanced but the illustrations make it reasonable and intuitive. This is really good job! I got to intuitively understand events and streams for the very first time.
|
|
||
| # Unlocking asynchronicity in continuous batching | ||
|
|
||
|  |
There was a problem hiding this comment.
The thumbnail and the banner images are different. Was that intentional?
There was a problem hiding this comment.
Yes! I wanted the thumbnail to be a little snapshot of the banner, so does not have too much details. WDYT?
There was a problem hiding this comment.
I am absolutely fine with that. I wanted to point out so that we are sure about them being different! Looks really good.
| Everyone likes owning their own tools. And with AI becoming the largest productivity multiplier around, everyone wants to own their model. But owning a model is not enough: you also need to run it. | ||
| Most likely, on a modern GPU, which does not run cheap. So when you are renting a GPU to run your own models, you want to use it at the fullest of its capacities. |
There was a problem hiding this comment.
Probably showcasing hosting big models using Inference Endpoints might be a good motivation to think about using GPUs to the fullest capability.
|
|
||
| This means we need three streams: one for compute, one for CPU-to-GPU transfers, and one for GPU-to-CPU transfers. The transfers are independent, so there is no reason to serialize them, and each gets its own stream. | ||
|
|
||
| A note on nomenclature: when talking about CPUs and GPUs, the convention (used throughout the CUDA documentation) is to call the CPU the **host** and the GPU the **device**. We will use that convention from now on. CPU-to-GPU transfers are called **host-to-device** (H2D) transfers, and GPU-to-CPU transfers are called **device-to-host** (D2H) transfers. Hence, the three streams are the H2D stream, the compute stream, and the D2H stream. |
There was a problem hiding this comment.
I think this should be an alert, something with start with > [!NOTE]. What do you think?
|
|
||
| The figure above shows a single event synchronizing two streams. The CPU issues three operations in rapid succession (the three small blocks): launch input preparation on stream 1, record the event on stream 1, then tell stream 2 to wait for it. Then the CPU continues immediately. Stream 1 runs its operation, and when it completes, the event is set. Stream 2 is held at the wait marker the whole time, and only starts compute once the event is marked complete. The CPU was not involved in any of this: the ordering was enforced entirely on the GPU side. | ||
|
|
||
| ### Using events in CB |
There was a problem hiding this comment.
I would like this to be fully spelled out in the titles and subtitles, this helps with navigation and also in the TOC.
Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
ariG23498
left a comment
There was a problem hiding this comment.
I really like the second half. I should also mention that there were moments where I had to read the paragraphs quite a number of times to understand it completely. This was due to my first time exposure to a lot of new terms. I predict that this is will also be the case for a lot of readers. To counter this, I think we should describe technical terms (buffers, slots, graphs, etc) better, and also be coherent with our wording.
This is a really well made blog post!
|
|
||
| # Unlocking asynchronicity in continuous batching | ||
|
|
||
|  |
There was a problem hiding this comment.
I am absolutely fine with that. I wanted to point out so that we are sure about them being different! Looks really good.
|
|
||
| *TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference.* | ||
|
|
||
| *This is the second post in a series on efficient LLM inference. The [first post](https://huggingface.co/blog/continuous_batching) covered continuous batching from first principles. It introduces some concepts we build upon: KV cache, FlashAttention, attention masks, etc.* |
There was a problem hiding this comment.
Should we also add the link of this (current) blog post to the one in the past (https://huggingface.co/blog/continuous_batching)?
This would help readers who have come across Part 1, to come to this blog post as a natural succession.
|
|
||
| ## Filling the vacuum | ||
|
|
||
| To prepare batch N+1, we can reuse the same CPU-side objects that prepared batch N. However, we need to pay attention to two things: |
There was a problem hiding this comment.
What do we mean by CPU-side objects? Reading a little further I think we mean the Tensors, but is that right? If so, do you think we should specify it here?
| - the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading | ||
| - if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1 |
There was a problem hiding this comment.
| - the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading | |
| - if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1 | |
| - data corruption: the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading | |
| - data transmission: if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1 |
This sets the tone for the problems that we should solve, and also names the problems beforehand.
This PR adds the
Unlocking asynchronicity in continuous batchingblog post.