[examples] Benchmark plain and graph compilation#3791
Open
kddubey wants to merge 26 commits into
Open
Conversation
examples] Bucket-based compilationexamples] Benchmark plain and bucket-based compilation
examples] Benchmark plain and bucket-based compilationexamples] Benchmark plain and graph compilation
kddubey
commented
Jun 4, 2026
|
|
||
| See :func:`torch.compile` for details on the arguments for this function. | ||
|
|
||
| .. tip:: |
Contributor
Author
There was a problem hiding this comment.
I noticed these .. tip:: sections don't show up in my IDE b/c it tries to render markdown on hover. Can change to something like—
Note
----—to make it friendly to RST and markdown
kddubey
commented
Jun 4, 2026
| "Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model." | ||
| ) | ||
|
|
||
| def compile(self, *args, **kwargs): |
Contributor
Author
There was a problem hiding this comment.
This change keeps the existing compile doc and behavior, but adds a tip about how to compile a SentenceTransformer. By itself, the method is a no-op for inference. In a future change we can consider adding a method which auto-finds the Transformer module and compiles it
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello!
This PR:
model[0].compile(dynamic=True). Graph compilation achieves a modest speedup by eliminating Python overhead b/t torch and CUDA. Sharing this code as a dependency-free speedup for latency-sensitive services running standard sentence transformer models on CUDA for shorter sequences.model[0].compile(dynamic=True)w/ a link to the benchmark. The current documented method,model.compile(), is a no-op for inference.I'm contributing graph compilation as an example rather than library code. Reasons to be cautious:
gte-modernbert-basehad numerical drift when graph-compiledcompiled.SentenceTransformeroverrides the tokenizer, whose implementation can change a bit, e.g., v5.3 changedtokenize->preprocessAnother thing I'm less sure about
Deploying the graph-compiled model to prod caused an error where concurrent GPU usage triggered an opaque device-side assert. Our deployment has many replicas and it has health checks to self-heal, so this error wasn't a big deal. But it's not obvious how to root cause an error from a graph-compiled call, even w/ CUDA_LAUNCH_BLOCKING=1. I needed to put a lock on GPU model calls to fix it, and now compilation has been running in prod for a few weeks w/o error and 3x lower latency.Disclaimer: a handful of trivial lines in compiled.py and benchmark.py were written by AI, and some of the tests were implemented by AI.