Skip to content

[examples] Benchmark plain and graph compilation#3791

Open
kddubey wants to merge 26 commits into
huggingface:mainfrom
kddubey:kddubey/examples/compilation
Open

[examples] Benchmark plain and graph compilation#3791
kddubey wants to merge 26 commits into
huggingface:mainfrom
kddubey:kddubey/examples/compilation

Conversation

@kddubey

@kddubey kddubey commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Hello!

This PR:

  1. Adds an example of bucket-based CUDA graph compilation and benchmarks it against no compilation and "plain" compilation: model[0].compile(dynamic=True). Graph compilation achieves a modest speedup by eliminating Python overhead b/t torch and CUDA. Sharing this code as a dependency-free speedup for latency-sensitive services running standard sentence transformer models on CUDA for shorter sequences.
  2. Adds a tip to use model[0].compile(dynamic=True) w/ a link to the benchmark. The current documented method, model.compile(), is a no-op for inference.

I'm contributing graph compilation as an example rather than library code. Reasons to be cautious:

  • gte-modernbert-base had numerical drift when graph-compiled
  • The code assumes a pretty standard tokenizer. compiled.SentenceTransformer overrides the tokenizer, whose implementation can change a bit, e.g., v5.3 changed tokenize -> preprocess
  • Compilation can add a lot of startup time.
Another thing I'm less sure about Deploying the graph-compiled model to prod caused an error where concurrent GPU usage triggered an opaque device-side assert. Our deployment has many replicas and it has health checks to self-heal, so this error wasn't a big deal. But it's not obvious how to root cause an error from a graph-compiled call, even w/ CUDA_LAUNCH_BLOCKING=1. I needed to put a lock on GPU model calls to fix it, and now compilation has been running in prod for a few weeks w/o error and 3x lower latency.

Disclaimer: a handful of trivial lines in compiled.py and benchmark.py were written by AI, and some of the tests were implemented by AI.

@kddubey kddubey marked this pull request as ready for review June 2, 2026 05:39
@kddubey kddubey changed the title [examples] Bucket-based compilation [examples] Benchmark plain and bucket-based compilation Jun 2, 2026
@kddubey kddubey changed the title [examples] Benchmark plain and bucket-based compilation [examples] Benchmark plain and graph compilation Jun 2, 2026

See :func:`torch.compile` for details on the arguments for this function.

.. tip::

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed these .. tip:: sections don't show up in my IDE b/c it tries to render markdown on hover. Can change to something like—

Note
----

—to make it friendly to RST and markdown

"Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model."
)

def compile(self, *args, **kwargs):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change keeps the existing compile doc and behavior, but adds a tip about how to compile a SentenceTransformer. By itself, the method is a no-op for inference. In a future change we can consider adding a method which auto-finds the Transformer module and compiles it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant