[`examples`] Benchmark plain and graph compilation by kddubey · Pull Request #3791 · huggingface/sentence-transformers

kddubey · 2026-06-02T05:18:25Z

Hello!

This PR:

Adds an example of bucket-based CUDA graph compilation and benchmarks it against no compilation and "plain" compilation: model[0].compile(dynamic=True). Graph compilation achieves a modest speedup by eliminating Python overhead b/t torch and CUDA. Sharing this code as a dependency-free speedup for latency-sensitive services running standard sentence transformer models on CUDA for shorter sequences.
Adds a tip to use model[0].compile(dynamic=True) w/ a link to the benchmark. The current documented method, model.compile(), is a no-op for inference.

I'm contributing graph compilation as an example rather than library code. Reasons to be cautious:

gte-modernbert-base had numerical drift when graph-compiled
The code assumes a pretty standard tokenizer. compiled.SentenceTransformer overrides the tokenizer, whose implementation can change a bit, e.g., v5.3 changed tokenize -> preprocess
Compilation can add a lot of startup time.

Another thing I'm less sure about

Deploying the graph-compiled model to prod caused an error where concurrent GPU usage triggered an opaque device-side assert. Our deployment has many replicas and it has health checks to self-heal, so this error wasn't a big deal. But it's not obvious how to root cause an error from a graph-compiled call, even w/ CUDA_LAUNCH_BLOCKING=1. I needed to put a lock on GPU model calls to fix it, and now compilation has been running in prod for a few weeks w/o error and 3x lower latency.

Disclaimer: a handful of trivial lines in compiled.py and benchmark.py were written by AI, and some of the tests were implemented by AI.

kddubey · 2026-06-04T23:36:59Z

+
+        See :func:`torch.compile` for details on the arguments for this function.
+
+        .. tip::


I noticed these .. tip:: sections don't show up in my IDE b/c it tries to render markdown on hover. Can change to something like—

Note ----

—to make it friendly to RST and markdown

kddubey · 2026-06-04T23:43:34Z

                    "Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model."
                )

+    def compile(self, *args, **kwargs):


This change keeps the existing compile doc and behavior, but adds a tip about how to compile a SentenceTransformer. By itself, the method is a no-op for inference. In a future change we can consider adding a method which auto-finds the Transformer module and compiles it

kddubey and others added 17 commits June 1, 2026 16:41

[]: Compilation

e907a43

notebook for colab

0893234

bwd-fwd compat

9f8fd6d

module 0 compile

b7c6e28

don't drop last bucket if above max seq

7e8eae7

fall back to plain compile

17cd0f3

optional compile fallback

cfef9e8

readme

71bc7eb

docstring

a68116a

specify relative

4e0e78b

toggle

154c3b6

note measure warmup

734d1c6

tests

64f1dc6

parametrize

01e6f3a

applications readme

27df195

Merge branch 'huggingface:main' into kddubey/examples/compilation

1478e15

flash-attn warning

3895aa5

kddubey marked this pull request as ready for review June 2, 2026 05:39

kddubey added 3 commits June 1, 2026 22:42

doc all 3 versions

5d9f189

warn concurrency

7c14ef1

check if there's a gap

1e61739

kddubey changed the title ~~[examples] Bucket-based compilation~~ [examples] Benchmark plain and bucket-based compilation Jun 2, 2026

kddubey changed the title ~~[examples] Benchmark plain and bucket-based compilation~~ [examples] Benchmark plain and graph compilation Jun 2, 2026

kddubey added 3 commits June 2, 2026 21:48

y did i do dat

613c0e3

update old comment

e493151

include tip re plain compile

7a0a11a

kddubey commented Jun 4, 2026

View reviewed changes

kddubey added 2 commits June 4, 2026 16:53

use compiled.DEFAULT

f5318ab

no need for diag

840ecb4

rm superfluous note

81f2d68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`examples`] Benchmark plain and graph compilation#3791

[`examples`] Benchmark plain and graph compilation#3791
kddubey wants to merge 26 commits into
huggingface:mainfrom
kddubey:kddubey/examples/compilation

kddubey commented Jun 2, 2026 •

edited

Loading

Uh oh!

kddubey Jun 4, 2026

Uh oh!

kddubey Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		See :func:`torch.compile` for details on the arguments for this function.

		.. tip::

Conversation

kddubey commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kddubey Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

kddubey Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kddubey commented Jun 2, 2026 •

edited

Loading