Unlocking asynchronicity in continuous batching by remi-or · Pull Request #3372 · huggingface/blog

remi-or · 2026-05-04T07:44:10Z

This PR adds the Unlocking asynchronicity in continuous batching blog post.

pcuenca

Very nice, did a first pass!

There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!

Happy to take another look later if you want.

pcuenca · 2026-05-04T14:43:22Z

    - hub
+
+- local: continuous_async
+  date: April 28, 2026


reminder to update before release

pcuenca · 2026-05-04T14:51:39Z

+Everyone likes owning their own tools. And with AI becoming the largest productivity multiplier around, everyone wants to own their model. But owning a model is not enough: you also need to run it.
+Most likely, on a modern GPU, which does not run cheap. So when you are renting a GPU to run your own models, you want to use it at the fullest of its capacities.  


Not fully sold on this, perhaps a more direct way to refer to maximizing capacity would work. Will come back to it later.

Probably showcasing hosting big models using Inference Endpoints might be a good motivation to think about using GPUs to the fullest capability.

pcuenca · 2026-05-04T15:40:20Z

+
+The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker.
+
+This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background.


Yes, but so far we are effectively using parallel computation to serialize tasks lol. Perhaps we could call this out a bit more explicitly so the reader is prepared to see the magic in a moment.

pcuenca · 2026-05-04T15:55:58Z

+
+![CPU and GPU activity timeline](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_async/cpu_gpu_phases_async.png)
+
+The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks the sync point between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time.


Suggested change

The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks the sync point between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time.

The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks are the sync points between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time.

pcuenca · 2026-05-04T15:56:44Z

+
+## Conclusion
+
+We started with three questions:


Perhaps we could do a higher-level recap rather than the lower-level problems we solved.

pcuenca

Very nice, did a first pass!

There's a lot of ground to cover and you do a good job at introducing the problem and diving progressively into the details. I got a bit lost during the carry-over discussion (it's easy conceptually but I lost track of the positions lol), but it's looking good!

Happy to take another look later if you want.

ariG23498

I am done reviewing first half of the blog post. I will be able to get the second half done by early tomorrow.

Initial verdict: I really like how the blog post is paced, while going really deep into technicalities, it never feels overwhelming. The ideas introduced are very advanced but the illustrations make it reasonable and intuitive. This is really good job! I got to intuitively understand events and streams for the very first time.

ariG23498 · 2026-05-05T09:11:57Z

+
+# Unlocking asynchronicity in continuous batching
+
+![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_async/banner.png)


The thumbnail and the banner images are different. Was that intentional?

Yes! I wanted the thumbnail to be a little snapshot of the banner, so does not have too much details. WDYT?

I am absolutely fine with that. I wanted to point out so that we are sure about them being different! Looks really good.

ariG23498 · 2026-05-05T09:17:44Z

+Everyone likes owning their own tools. And with AI becoming the largest productivity multiplier around, everyone wants to own their model. But owning a model is not enough: you also need to run it.
+Most likely, on a modern GPU, which does not run cheap. So when you are renting a GPU to run your own models, you want to use it at the fullest of its capacities.  


Probably showcasing hosting big models using Inference Endpoints might be a good motivation to think about using GPUs to the fullest capability.

ariG23498 · 2026-05-06T03:26:50Z

+
+This means we need three streams: one for compute, one for CPU-to-GPU transfers, and one for GPU-to-CPU transfers. The transfers are independent, so there is no reason to serialize them, and each gets its own stream.
+
+A note on nomenclature: when talking about CPUs and GPUs, the convention (used throughout the CUDA documentation) is to call the CPU the **host** and the GPU the **device**. We will use that convention from now on. CPU-to-GPU transfers are called **host-to-device** (H2D) transfers, and GPU-to-CPU transfers are called **device-to-host** (D2H) transfers. Hence, the three streams are the H2D stream, the compute stream, and the D2H stream.


I think this should be an alert, something with start with > [!NOTE]. What do you think?

ariG23498 · 2026-05-06T03:45:16Z

+
+The figure above shows a single event synchronizing two streams. The CPU issues three operations in rapid succession (the three small blocks): launch input preparation on stream 1, record the event on stream 1, then tell stream 2 to wait for it. Then the CPU continues immediately. Stream 1 runs its operation, and when it completes, the event is set. Stream 2 is held at the wait marker the whole time, and only starts compute once the event is marked complete. The CPU was not involved in any of this: the ordering was enforced entirely on the GPU side.
+
+### Using events in CB


I would like this to be fully spelled out in the titles and subtitles, this helps with navigation and also in the TOC.

Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

ariG23498

I really like the second half. I should also mention that there were moments where I had to read the paragraphs quite a number of times to understand it completely. This was due to my first time exposure to a lot of new terms. I predict that this is will also be the case for a lot of readers. To counter this, I think we should describe technical terms (buffers, slots, graphs, etc) better, and also be coherent with our wording.

This is a really well made blog post!

ariG23498 · 2026-05-06T23:56:04Z

+
+# Unlocking asynchronicity in continuous batching
+
+![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_async/banner.png)


I am absolutely fine with that. I wanted to point out so that we are sure about them being different! Looks really good.

ariG23498 · 2026-05-06T23:57:15Z

+
+*TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference.*
+
+*This is the second post in a series on efficient LLM inference. The [first post](https://huggingface.co/blog/continuous_batching) covered continuous batching from first principles. It introduces some concepts we build upon: KV cache, FlashAttention, attention masks, etc.*


Should we also add the link of this (current) blog post to the one in the past (https://huggingface.co/blog/continuous_batching)?

This would help readers who have come across Part 1, to come to this blog post as a natural succession.

ariG23498 · 2026-05-07T00:20:56Z

+
+## Filling the vacuum
+
+To prepare batch N+1, we can reuse the same CPU-side objects that prepared batch N. However, we need to pay attention to two things:


What do we mean by CPU-side objects? Reading a little further I think we mean the Tensors, but is that right? If so, do you think we should specify it here?

ariG23498 · 2026-05-07T00:23:26Z

+- the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading
+- if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1


Suggested change

- the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading

- if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1

- data corruption: the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading

- data transmission: if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1

This sets the tone for the problems that we should solve, and also names the problems beforehand.

First draft, every component in

4f7d49e

remi-or force-pushed the remi-or/continuous-async branch from eda3def to 4f7d49e Compare May 4, 2026 07:45

pcuenca reviewed May 4, 2026

View reviewed changes

ariG23498 reviewed May 6, 2026

View reviewed changes

Apply suggestions from code review

bef6424

Co-authored-by: Pedro Cuenca <pedro@huggingface.co> Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>

ariG23498 approved these changes May 7, 2026

View reviewed changes

		Everyone likes owning their own tools. And with AI becoming the largest productivity multiplier around, everyone wants to own their model. But owning a model is not enough: you also need to run it.
		Most likely, on a modern GPU, which does not run cheap. So when you are renting a GPU to run your own models, you want to use it at the fullest of its capacities.


		The figure above shows how this unfolds. The CPU prepares the batch, then quickly enqueues all the GPU work: the H2D transfer, the forward pass, the D2H transfer, with `record` and `wait` calls inserted between each stage. After that, the CPU is free. The GPU takes over, executing each stream in order as its dependency event is set. Notice the green annotation on the right: once the D2H transfer completes, the CPU comes back and reads the results. This final synchronization is the only point where the CPU blocks in the whole step. To implement it, we record a third event on the D2H stream after the output transfer, then call `d2h_done_event.synchronize()` on the CPU side. `synchronize` blocks the CPU until the D2H stream reaches that marker.

		This is the key difference from synchronous batching. Before, the CPU blocked after every operation. Now it blocks once, at the very end, only to read results it genuinely needs at that moment. Everything else runs in the background.


		![CPU and GPU activity timeline](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_async/cpu_gpu_phases_async.png)

		The timeline is almost entirely dark green: CPU and GPU running at the same time. The occasional light green slivers are moments where the GPU is active but the CPU has already finished its prep and is waiting. The near-invisible red marks the sync point between batches, where the CPU blocks to sample batch N's outputs. The GPU is active for 99.4% of total runtime, up from 76.0%. Total generation time drops from 300.6s to 234.5s, a 22% speedup. We predicted 24% if CPU overhead were fully eliminated. The small remaining gap is that unavoidable sync point. No new kernels, no model changes: letting the CPU and GPU work at the same time.


		# Unlocking asynchronicity in continuous batching

		![Title card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/continuous_async/banner.png)


		This means we need three streams: one for compute, one for CPU-to-GPU transfers, and one for GPU-to-CPU transfers. The transfers are independent, so there is no reason to serialize them, and each gets its own stream.

		A note on nomenclature: when talking about CPUs and GPUs, the convention (used throughout the CUDA documentation) is to call the CPU the host and the GPU the device. We will use that convention from now on. CPU-to-GPU transfers are called host-to-device (H2D) transfers, and GPU-to-CPU transfers are called device-to-host (D2H) transfers. Hence, the three streams are the H2D stream, the compute stream, and the D2H stream.


		The figure above shows a single event synchronizing two streams. The CPU issues three operations in rapid succession (the three small blocks): launch input preparation on stream 1, record the event on stream 1, then tell stream 2 to wait for it. Then the CPU continues immediately. Stream 1 runs its operation, and when it completes, the event is set. Stream 2 is held at the wait marker the whole time, and only starts compute once the event is marked complete. The CPU was not involved in any of this: the ordering was enforced entirely on the GPU side.

		### Using events in CB


		TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference.

		This is the second post in a series on efficient LLM inference. The [first post](https://huggingface.co/blog/continuous_batching) covered continuous batching from first principles. It introduces some concepts we build upon: KV cache, FlashAttention, attention masks, etc.


		## Filling the vacuum

		To prepare batch N+1, we can reuse the same CPU-side objects that prepared batch N. However, we need to pay attention to two things:

		- the device-side input buffers for batch N+1 cannot be the same as batch N's: we would corrupt data the GPU is still reading
		- if a request is in both batch N and N+1, and it produces a new token in the outputs of batch N, that token is needed in the inputs of batch N+1

Conversation

remi-or commented May 4, 2026

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ariG23498 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants