Skip to content

TheRootOf3/torch-compile-benchmarks

Repository files navigation

Studying the Overhead of torch.compile() in Large Language Model Inference

Python 3.12+ PyTorch Transformers

A benchmarking suite for analyzing the performance impact of torch.compile() on Large Language Model inference, with a focus on prefill vs decode phases and spot instance deployment scenarios. A more comprehensive analysis is available in the research report.

[📄 research report]

🔬 Research Overview

This repository contains the implementation and experimental framework used to study the overhead of torch.compile() in LLM inference. The research investigates how PyTorch's compilation toolkit can be applied to optimize Large Language Model inference, particularly in scenarios involving spot instances where frequent model initialization and compilation occur.

Study Motivation

Given the widespread use of Large Language Models and the costs related to their deployment, optimizing inference performance is crucial. Unlike training, which focuses on maximizing throughput, inference optimization aims to minimize generation latency. This is particularly challenging due to the sequential nature of autoregressive decoding in transformers.

Key Research Questions

  1. When should compilation be applied? Investigating the trade-offs between prefill and decode phase compilation
  2. How do dynamic shapes affect performance? Understanding recompilation overhead with variable input lengths
  3. What are the implications for spot instances? Analyzing compilation costs in frequently restarted environments

📊 Key Findings

Core Insights

TL;DR: To reduce latency in initial LLM generations on spot instances, avoid compilation for the prefill phase. Use torch.compile(dynamic=True) when expecting variable prefill shapes to prevent recompilations.

Performance Comparison

Detailed Results

A brief introduction to torch.compile() and a comprehensive analysis are available in the research report.

1. Compilation Strategy Impact

  • Decode-only compilation provides optimal performance for most scenarios
  • Full model compilation introduces overhead during prefill with minimal benefit
  • Compilation timing significantly affects total latency, especially for smaller batch sizes

2. Performance Characteristics

  • Batch size dependency: Compilation benefits increase substantially with larger batch sizes (32+)
  • Token generation: Compiled decode shows 15-35% latency reduction per token
  • First token penalty: Compilation adds 7-8 seconds overhead for first token generation

3. Dynamic Shape Handling

  • Static compilation: Causes up to 6 recompilations for changing input shapes
  • Dynamic compilation (dynamic=True): Reduces recompilations to 3, maintaining performance
  • Shape variability: Dynamic compilation prevents recompilation overhead with variable prefill lengths

4. Operator-Level Analysis

  • Memory efficiency: Vertical fusion reduces memory reads by combining operations (e.g., SiLU + multiplication)
  • Function consolidation: Compiled version uses fewer function calls (fused operations vs separate ATen ops)
  • Optimization patterns: 71% of eager mode time spent in add-multiply operations, significantly optimized in compiled version

5. Spot Instance Implications

  • Initialization cost: Compilation overhead ranges from 7-17 seconds depending on strategy
  • Break-even point: Benefits appear after 50-100 tokens for most batch sizes
  • Recommendation: Use decode-only compilation with dynamic shapes for spot instance deployment

🚀 Quick Start

Prerequisites

  • Python 3.12 or higher
  • PyTorch 2.8 or higher
  • Transformers 4.55+ library
  • CUDA-capable GPU

Installation

  1. Clone the repository:
git clone https://github.com/TheRootOf3/torch-compile-benchmarks.git
cd torch-compile-benchmarks
  1. Install dependencies:
pip install -r requirements.txt
  1. Download models (optional, for advanced benchmarks):
# Example for Llama models - adjust paths in config files
huggingface-cli download meta-llama/Llama-3.2-1B

CLI Interface

# Run performance comparison
python benchmark_cli.py compare --model gpt2 --batch-size 32

# Profile operations with debug info
python benchmark_cli.py profile --script compare_llm --debug

# Test dynamic shapes
python benchmark_cli.py dynamic-shapes --sizes 64,128,256

📁 Project Structure

torch-compile-benchmarks/
├── benchmarking/              # Core benchmarking framework
│   ├── benchmark_compile_prefill_decode.py  # Main benchmarking script
│   ├── compile_functions.py   # Compilation strategies
│   └── utils.py              # Utility functions
├── scripts/                  # Standalone analysis scripts
│   ├── compare_llm.py        # Simple model comparison
│   ├── profile_ops.py        # Operation-level profiling
│   ├── compile_multiple_inputs.py  # Dynamic input analysis
│   └── example_llm.py        # Basic compilation example
├── models/                   # Model implementations
├── results/                  # Experimental results
├── resources/               # Images and plots
├── notebooks/               # Analysis notebooks
│   └── visualize_results.ipynb
├── data/                    # Test data
├── config.py                # Configuration settings
├── benchmark_cli.py         # Command-line interface
└── report.pdf              # Original research report

🔧 Configuration

Model Paths

Update model paths in config.py or via environment variables:

# In config.py
LLAMA3_1B_PATH = "/path/to/Llama-3.2-1B"
GPT2_PATH = "/path/to/gpt2"

# Or via environment variables
export LLAMA3_1B_PATH="/path/to/Llama-3.2-1B"
export TORCH_DEVICE="cuda:0"  # or "cpu"

Compilation Strategies

The framework supports multiple compilation strategies based on research findings:

  • compile_model_fn: Full model compilation (not recommended for prefill)
  • Decode-only compilation: Recommended approach
  • Dynamic compilation: Use torch.compile(model, dynamic=True) for variable shapes

Environment Variables

For detailed compilation debugging:

export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1
export TORCH_COMPILE_DEBUG=1
export TORCH_LOGS=recompiles,graph_breaks

📈 Experimental Methodology

Benchmark Design

The experiments are designed around real-world deployment scenarios:

  1. Spot Instance Simulation: Models frequently restarted with compilation overhead
  2. Prefill vs Decode Analysis: Separate measurement of both phases
  3. Dynamic Shape Testing: Variable input lengths to trigger recompilation
  4. Batch Size Scaling: Performance across different batch dimensions

Key Metrics

  • Prefill Latency: Time to process initial input sequence
  • First Token Latency: Time from input to first generated token (includes compilation)
  • Token Generation Rate: Sustained throughput during decode phase
  • Compilation Overhead: Time spent in torch.compile()
  • Recompilation Events: The number of graph recompilations

Tested Configurations

  • Batch sizes: 1, 4, 32, 128, 256
  • Sequence lengths: Variable (5-500 tokens)
  • Models: Llama 3.2-1B, GPT-2
  • Hardware: CPU and GPU configurations

Measurement Approach

# Proper benchmarking methodology
for _ in range(warmup_iterations):
    model(input_data)  # Warmup phase

start_time = time.perf_counter()
for _ in range(measurement_iterations):
    model(input_data)
    torch.cuda.synchronize() if torch.cuda.is_available() else None
end_time = time.perf_counter()

📊 Results and Analysis

Performance Summary

Configuration Prefill Benefit Decode Benefit Recommendation
Eager Mode Baseline Baseline Development/debugging
Compile Decode Only Minimal overhead 15-35% faster Single sequences and small batches
Compile Full Model High overhead 15-35% faster Large batch only (32+)
Dynamic Shapes Prevents recompilation Maintained performance Variable input lengths

Critical Insights

  1. Compilation Timing Matters: When you compile affects performance more than what you compile
  2. Batch Size Threshold: Benefits become significant at larger batch sizes (≥32 in this study)
  3. Dynamic Shapes Essential: Static compilation causes excessive recompilation with variable inputs
  4. Operator Fusion Benefits: Vertical fusion (e.g., SiLU + multiplication) provides substantial memory efficiency gains

Access detailed results in the results/ directory and use the visualization notebook for custom analysis.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📚 Citation

If you use this work in your research, please cite:

@misc{szablewski2024torch-compile-benchmarks,
  title={Studying the Overhead of torch.compile() in Large Language Model Inference},
  author={Andrzej Szablewski},
  year={2024},
  url={https://github.com/TheRootOf3/torch-compile-benchmarks}
}

🙏 Acknowledgments


This research was conducted as part of coursework for R244: Large Scale Data Optimisation and Processing at the University of Cambridge.

About

⚙️ Investigating how `torch.compile()` recompilations affect prefill and decode latency in LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published