A benchmarking suite for analyzing the performance impact of torch.compile() on Large Language Model inference, with a focus on prefill vs decode phases and spot instance deployment scenarios. A more comprehensive analysis is available in the research report.
[📄 research report]
This repository contains the implementation and experimental framework used to study the overhead of torch.compile() in LLM inference. The research investigates how PyTorch's compilation toolkit can be applied to optimize Large Language Model inference, particularly in scenarios involving spot instances where frequent model initialization and compilation occur.
Given the widespread use of Large Language Models and the costs related to their deployment, optimizing inference performance is crucial. Unlike training, which focuses on maximizing throughput, inference optimization aims to minimize generation latency. This is particularly challenging due to the sequential nature of autoregressive decoding in transformers.
- When should compilation be applied? Investigating the trade-offs between prefill and decode phase compilation
- How do dynamic shapes affect performance? Understanding recompilation overhead with variable input lengths
- What are the implications for spot instances? Analyzing compilation costs in frequently restarted environments
TL;DR: To reduce latency in initial LLM generations on spot instances, avoid compilation for the prefill phase. Use
torch.compile(dynamic=True)when expecting variable prefill shapes to prevent recompilations.
A brief introduction to torch.compile() and a comprehensive analysis are available in the research report.
- Decode-only compilation provides optimal performance for most scenarios
- Full model compilation introduces overhead during prefill with minimal benefit
- Compilation timing significantly affects total latency, especially for smaller batch sizes
- Batch size dependency: Compilation benefits increase substantially with larger batch sizes (32+)
- Token generation: Compiled decode shows 15-35% latency reduction per token
- First token penalty: Compilation adds 7-8 seconds overhead for first token generation
- Static compilation: Causes up to 6 recompilations for changing input shapes
- Dynamic compilation (
dynamic=True): Reduces recompilations to 3, maintaining performance - Shape variability: Dynamic compilation prevents recompilation overhead with variable prefill lengths
- Memory efficiency: Vertical fusion reduces memory reads by combining operations (e.g., SiLU + multiplication)
- Function consolidation: Compiled version uses fewer function calls (fused operations vs separate ATen ops)
- Optimization patterns: 71% of eager mode time spent in add-multiply operations, significantly optimized in compiled version
- Initialization cost: Compilation overhead ranges from 7-17 seconds depending on strategy
- Break-even point: Benefits appear after 50-100 tokens for most batch sizes
- Recommendation: Use decode-only compilation with dynamic shapes for spot instance deployment
- Python 3.12 or higher
- PyTorch 2.8 or higher
- Transformers 4.55+ library
- CUDA-capable GPU
- Clone the repository:
git clone https://github.com/TheRootOf3/torch-compile-benchmarks.git
cd torch-compile-benchmarks- Install dependencies:
pip install -r requirements.txt- Download models (optional, for advanced benchmarks):
# Example for Llama models - adjust paths in config files
huggingface-cli download meta-llama/Llama-3.2-1B# Run performance comparison
python benchmark_cli.py compare --model gpt2 --batch-size 32
# Profile operations with debug info
python benchmark_cli.py profile --script compare_llm --debug
# Test dynamic shapes
python benchmark_cli.py dynamic-shapes --sizes 64,128,256torch-compile-benchmarks/
├── benchmarking/ # Core benchmarking framework
│ ├── benchmark_compile_prefill_decode.py # Main benchmarking script
│ ├── compile_functions.py # Compilation strategies
│ └── utils.py # Utility functions
├── scripts/ # Standalone analysis scripts
│ ├── compare_llm.py # Simple model comparison
│ ├── profile_ops.py # Operation-level profiling
│ ├── compile_multiple_inputs.py # Dynamic input analysis
│ └── example_llm.py # Basic compilation example
├── models/ # Model implementations
├── results/ # Experimental results
├── resources/ # Images and plots
├── notebooks/ # Analysis notebooks
│ └── visualize_results.ipynb
├── data/ # Test data
├── config.py # Configuration settings
├── benchmark_cli.py # Command-line interface
└── report.pdf # Original research report
Update model paths in config.py or via environment variables:
# In config.py
LLAMA3_1B_PATH = "/path/to/Llama-3.2-1B"
GPT2_PATH = "/path/to/gpt2"
# Or via environment variables
export LLAMA3_1B_PATH="/path/to/Llama-3.2-1B"
export TORCH_DEVICE="cuda:0" # or "cpu"The framework supports multiple compilation strategies based on research findings:
compile_model_fn: Full model compilation (not recommended for prefill)- Decode-only compilation: Recommended approach
- Dynamic compilation: Use
torch.compile(model, dynamic=True)for variable shapes
For detailed compilation debugging:
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1
export TORCH_COMPILE_DEBUG=1
export TORCH_LOGS=recompiles,graph_breaksThe experiments are designed around real-world deployment scenarios:
- Spot Instance Simulation: Models frequently restarted with compilation overhead
- Prefill vs Decode Analysis: Separate measurement of both phases
- Dynamic Shape Testing: Variable input lengths to trigger recompilation
- Batch Size Scaling: Performance across different batch dimensions
- Prefill Latency: Time to process initial input sequence
- First Token Latency: Time from input to first generated token (includes compilation)
- Token Generation Rate: Sustained throughput during decode phase
- Compilation Overhead: Time spent in torch.compile()
- Recompilation Events: The number of graph recompilations
- Batch sizes: 1, 4, 32, 128, 256
- Sequence lengths: Variable (5-500 tokens)
- Models: Llama 3.2-1B, GPT-2
- Hardware: CPU and GPU configurations
# Proper benchmarking methodology
for _ in range(warmup_iterations):
model(input_data) # Warmup phase
start_time = time.perf_counter()
for _ in range(measurement_iterations):
model(input_data)
torch.cuda.synchronize() if torch.cuda.is_available() else None
end_time = time.perf_counter()| Configuration | Prefill Benefit | Decode Benefit | Recommendation |
|---|---|---|---|
| Eager Mode | Baseline | Baseline | Development/debugging |
| Compile Decode Only | Minimal overhead | 15-35% faster | Single sequences and small batches |
| Compile Full Model | High overhead | 15-35% faster | Large batch only (32+) |
| Dynamic Shapes | Prevents recompilation | Maintained performance | Variable input lengths |
- Compilation Timing Matters: When you compile affects performance more than what you compile
- Batch Size Threshold: Benefits become significant at larger batch sizes (≥32 in this study)
- Dynamic Shapes Essential: Static compilation causes excessive recompilation with variable inputs
- Operator Fusion Benefits: Vertical fusion (e.g., SiLU + multiplication) provides substantial memory efficiency gains
Access detailed results in the results/ directory and use the visualization notebook for custom analysis.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this work in your research, please cite:
@misc{szablewski2024torch-compile-benchmarks,
title={Studying the Overhead of torch.compile() in Large Language Model Inference},
author={Andrzej Szablewski},
year={2024},
url={https://github.com/TheRootOf3/torch-compile-benchmarks}
}- nanoGPT: GPT-2 implementation based on Andrej Karpathy's nanoGPT
This research was conducted as part of coursework for R244: Large Scale Data Optimisation and Processing at the University of Cambridge.
