Studying the Overhead of `torch.compile()` in Large Language Model Inference

A benchmarking suite for analyzing the performance impact of torch.compile() on Large Language Model inference, with a focus on prefill vs decode phases and spot instance deployment scenarios. A more comprehensive analysis is available in the research report.

[📄 research report]

🔬 Research Overview

This repository contains the implementation and experimental framework used to study the overhead of torch.compile() in LLM inference. The research investigates how PyTorch's compilation toolkit can be applied to optimize Large Language Model inference, particularly in scenarios involving spot instances where frequent model initialization and compilation occur.

Study Motivation

Given the widespread use of Large Language Models and the costs related to their deployment, optimizing inference performance is crucial. Unlike training, which focuses on maximizing throughput, inference optimization aims to minimize generation latency. This is particularly challenging due to the sequential nature of autoregressive decoding in transformers.

Key Research Questions

When should compilation be applied? Investigating the trade-offs between prefill and decode phase compilation
How do dynamic shapes affect performance? Understanding recompilation overhead with variable input lengths
What are the implications for spot instances? Analyzing compilation costs in frequently restarted environments

📊 Key Findings

Core Insights

TL;DR: To reduce latency in initial LLM generations on spot instances, avoid compilation for the prefill phase. Use torch.compile(dynamic=True) when expecting variable prefill shapes to prevent recompilations.

Detailed Results

A brief introduction to torch.compile() and a comprehensive analysis are available in the research report.

1. Compilation Strategy Impact

Decode-only compilation provides optimal performance for most scenarios
Full model compilation introduces overhead during prefill with minimal benefit
Compilation timing significantly affects total latency, especially for smaller batch sizes

2. Performance Characteristics

Batch size dependency: Compilation benefits increase substantially with larger batch sizes (32+)
Token generation: Compiled decode shows 15-35% latency reduction per token
First token penalty: Compilation adds 7-8 seconds overhead for first token generation

3. Dynamic Shape Handling

Static compilation: Causes up to 6 recompilations for changing input shapes
Dynamic compilation (dynamic=True): Reduces recompilations to 3, maintaining performance
Shape variability: Dynamic compilation prevents recompilation overhead with variable prefill lengths

4. Operator-Level Analysis

Memory efficiency: Vertical fusion reduces memory reads by combining operations (e.g., SiLU + multiplication)
Function consolidation: Compiled version uses fewer function calls (fused operations vs separate ATen ops)
Optimization patterns: 71% of eager mode time spent in add-multiply operations, significantly optimized in compiled version

5. Spot Instance Implications

Initialization cost: Compilation overhead ranges from 7-17 seconds depending on strategy
Break-even point: Benefits appear after 50-100 tokens for most batch sizes
Recommendation: Use decode-only compilation with dynamic shapes for spot instance deployment

🚀 Quick Start

Prerequisites

Python 3.12 or higher
PyTorch 2.8 or higher
Transformers 4.55+ library
CUDA-capable GPU

Installation

Clone the repository:

git clone https://github.com/TheRootOf3/torch-compile-benchmarks.git
cd torch-compile-benchmarks

Install dependencies:

pip install -r requirements.txt

Download models (optional, for advanced benchmarks):

# Example for Llama models - adjust paths in config files
huggingface-cli download meta-llama/Llama-3.2-1B

CLI Interface

# Run performance comparison
python benchmark_cli.py compare --model gpt2 --batch-size 32

# Profile operations with debug info
python benchmark_cli.py profile --script compare_llm --debug

# Test dynamic shapes
python benchmark_cli.py dynamic-shapes --sizes 64,128,256

📁 Project Structure

torch-compile-benchmarks/
├── benchmarking/              # Core benchmarking framework
│   ├── benchmark_compile_prefill_decode.py  # Main benchmarking script
│   ├── compile_functions.py   # Compilation strategies
│   └── utils.py              # Utility functions
├── scripts/                  # Standalone analysis scripts
│   ├── compare_llm.py        # Simple model comparison
│   ├── profile_ops.py        # Operation-level profiling
│   ├── compile_multiple_inputs.py  # Dynamic input analysis
│   └── example_llm.py        # Basic compilation example
├── models/                   # Model implementations
├── results/                  # Experimental results
├── resources/               # Images and plots
├── notebooks/               # Analysis notebooks
│   └── visualize_results.ipynb
├── data/                    # Test data
├── config.py                # Configuration settings
├── benchmark_cli.py         # Command-line interface
└── report.pdf              # Original research report

🔧 Configuration

Model Paths

Update model paths in config.py or via environment variables:

# In config.py
LLAMA3_1B_PATH = "/path/to/Llama-3.2-1B"
GPT2_PATH = "/path/to/gpt2"

# Or via environment variables
export LLAMA3_1B_PATH="/path/to/Llama-3.2-1B"
export TORCH_DEVICE="cuda:0"  # or "cpu"

Compilation Strategies

The framework supports multiple compilation strategies based on research findings:

compile_model_fn: Full model compilation (not recommended for prefill)
Decode-only compilation: Recommended approach
Dynamic compilation: Use torch.compile(model, dynamic=True) for variable shapes

Environment Variables

For detailed compilation debugging:

export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1
export TORCH_COMPILE_DEBUG=1
export TORCH_LOGS=recompiles,graph_breaks

📈 Experimental Methodology

Benchmark Design

The experiments are designed around real-world deployment scenarios:

Spot Instance Simulation: Models frequently restarted with compilation overhead
Prefill vs Decode Analysis: Separate measurement of both phases
Dynamic Shape Testing: Variable input lengths to trigger recompilation
Batch Size Scaling: Performance across different batch dimensions

Key Metrics

Prefill Latency: Time to process initial input sequence
First Token Latency: Time from input to first generated token (includes compilation)
Token Generation Rate: Sustained throughput during decode phase
Compilation Overhead: Time spent in torch.compile()
Recompilation Events: The number of graph recompilations

Tested Configurations

Batch sizes: 1, 4, 32, 128, 256
Sequence lengths: Variable (5-500 tokens)
Models: Llama 3.2-1B, GPT-2
Hardware: CPU and GPU configurations

Measurement Approach

# Proper benchmarking methodology
for _ in range(warmup_iterations):
    model(input_data)  # Warmup phase

start_time = time.perf_counter()
for _ in range(measurement_iterations):
    model(input_data)
    torch.cuda.synchronize() if torch.cuda.is_available() else None
end_time = time.perf_counter()

📊 Results and Analysis

Performance Summary

Configuration	Prefill Benefit	Decode Benefit	Recommendation
Eager Mode	Baseline	Baseline	Development/debugging
Compile Decode Only	Minimal overhead	15-35% faster	Single sequences and small batches
Compile Full Model	High overhead	15-35% faster	Large batch only (32+)
Dynamic Shapes	Prevents recompilation	Maintained performance	Variable input lengths

Critical Insights

Compilation Timing Matters: When you compile affects performance more than what you compile
Batch Size Threshold: Benefits become significant at larger batch sizes (≥32 in this study)
Dynamic Shapes Essential: Static compilation causes excessive recompilation with variable inputs
Operator Fusion Benefits: Vertical fusion (e.g., SiLU + multiplication) provides substantial memory efficiency gains

Access detailed results in the results/ directory and use the visualization notebook for custom analysis.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use this work in your research, please cite:

@misc{szablewski2024torch-compile-benchmarks,
  title={Studying the Overhead of torch.compile() in Large Language Model Inference},
  author={Andrzej Szablewski},
  year={2024},
  url={https://github.com/TheRootOf3/torch-compile-benchmarks}
}

🙏 Acknowledgments

nanoGPT: GPT-2 implementation based on Andrej Karpathy's nanoGPT

This research was conducted as part of coursework for R244: Large Scale Data Optimisation and Processing at the University of Cambridge.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
benchmarking		benchmarking
data		data
models		models
notebooks		notebooks
resources		resources
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_cli.py		benchmark_cli.py
config.py		config.py
report.pdf		report.pdf
requirements.txt		requirements.txt

License

TheRootOf3/torch-compile-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Studying the Overhead of torch.compile() in Large Language Model Inference