Torch Inductor Backend Guide
The Torch Inductor backend uses PyTorch's built-in compiler (torch.compile with backend="inductor") for model tuning. It provides automatic kernel fusion and optimization without external dependencies.
Overview
- Pure PyTorch: No external dependencies
- Automatic Optimization: Kernel fusion and code generation
- Multiple Modes: Default, reduce-overhead, max-autotune
- Dynamic Shapes: Configurable dynamic shape support
- Cross-Platform: Works on CPU and CUDA
Quick Start
from aitune.torch.backend import TorchInductorBackend, TorchInductorBackendConfig
import aitune.torch as ait
import torch
# Configure backend
config = TorchInductorBackendConfig(mode="max-autotune")
backend = TorchInductorBackend(config)
# Use in tuning
from aitune.torch.tune_strategy import OneBackendStrategy
strategy = ait.OneBackendStrategy(backend=backend)
model = ait.Module(model, "my-model", strategy=strategy)
ait.tune(model, input_data)
Configuration Options
TorchInductorBackendConfig
@dataclass
class TorchInductorBackendConfig(BackendConfig):
fullgraph: bool = False
dynamic: bool | None = None
mode: str | None = None
options: dict | None = None
autocast_enabled: bool = False
autocast_dtype: torch.dtype | None = None
mode
Predefined optimization modes:
# Default mode (balanced)
config = TorchInductorBackendConfig(mode="default")
# Reduce Python overhead with CUDA graphs
config = TorchInductorBackendConfig(mode="reduce-overhead")
# Maximum auto-tuning
config = TorchInductorBackendConfig(mode="max-autotune")
# Max autotune without CUDA graphs
config = TorchInductorBackendConfig(mode="max-autotune-no-cudagraphs")
Mode Details:
- default: Good balance, general purpose
- reduce-overhead: Uses CUDA graphs for small batches, reduces Python overhead
- max-autotune: Leverages Triton for matmul/conv, enables CUDA graphs
- max-autotune-no-cudagraphs: Like max-autotune but without CUDA graphs
fullgraph
Require complete graph capture:
config = TorchInductorBackendConfig(
fullgraph=True, # Error if graph breaks occur
mode="max-autotune",
)
dynamic
Control dynamic shape behavior:
# Always generate dynamic kernels
config = TorchInductorBackendConfig(dynamic=True)
# Never generate dynamic kernels (always specialize)
config = TorchInductorBackendConfig(dynamic=False)
# Auto-detect (default)
config = TorchInductorBackendConfig(dynamic=None)
options
Custom inductor options:
# See all options: torch._inductor.list_options()
config = TorchInductorBackendConfig(
options={
"triton.cudagraphs": True,
"max_autotune": True,
"coordinate_descent_tuning": True,
}
)
Note: Cannot use both mode and options.
autocast
Enable automatic mixed precision:
config = TorchInductorBackendConfig(
mode="max-autotune",
autocast_enabled=True,
autocast_dtype=torch.float16,
)
Debugging
Enable Logging
# Set environment variables before running
import os
os.environ['TORCH_LOGS'] = 'dynamic,perf_hints,graph_breaks'
# Then run tuning
ait.tune(wrapped_model, input_data)
Check Optimizations
# See what mode does
import torch
print(torch._inductor.list_mode_options())
# See all available options
print(torch._inductor.list_options())
Best Practices
- Start with max-autotune: Best performance for most models
- Use reduce-overhead: For latency-critical applications
- Enable Autocast: Free performance boost with FP16
- Dynamic Shapes: Only when necessary (adds overhead)
- Warmup: Run a few iterations before benchmarking
Troubleshooting
Issue: Graph breaks
Check where breaks occur:
Solution: Use fullgraph=False (default) to allow partial compilation.
Issue: Slow compilation
Solution: Reduce auto-tuning:
Issue: Not using CUDA graphs
Check logs:
Common causes: Input mutations, unsupported operations
Issue: Variable shape recompilations
Solution: Enable dynamic shapes:
Comparison with Other Backends
| Feature | Inductor | TensorRT | TorchAO |
|---|---|---|---|
| Dependencies | None | TensorRT | torchao |
| Setup | Easy | Moderate | Easy |
| Performance | Good | Excellent | Good |
| Quantization | Limited | Advanced | Extensive |
| Portability | Excellent | NVIDIA only | Good |
Next Steps
- Compare with TensorRT Backend for maximum performance
- Explore TorchAO Backend for quantization
- Learn about Tune Strategies
- Review Deployment Guide