Torch Inductor Backend Guide

The Torch Inductor backend uses PyTorch's built-in compiler (torch.compile with backend="inductor") for model tuning. It provides automatic kernel fusion and optimization without external dependencies.

Overview

Pure PyTorch: No external dependencies
Automatic Optimization: Kernel fusion and code generation
Multiple Modes: Default, reduce-overhead, max-autotune
Dynamic Shapes: Configurable dynamic shape support
Cross-Platform: Works on CPU and CUDA

Quick Start

from aitune.torch.backend import TorchInductorBackend, TorchInductorBackendConfig
import aitune.torch as ait
import torch

# Configure backend
config = TorchInductorBackendConfig(mode="max-autotune")
backend = TorchInductorBackend(config)

# Use in tuning
from aitune.torch.tune_strategy import OneBackendStrategy
strategy = ait.OneBackendStrategy(backend=backend)

model = ait.Module(model, "my-model", strategy=strategy)
ait.tune(model, input_data)

Configuration Options

TorchInductorBackendConfig

@dataclass
class TorchInductorBackendConfig(BackendConfig):
    fullgraph: bool = False
    dynamic: bool | None = None
    mode: str | None = None
    options: dict | None = None
    autocast_enabled: bool = False
    autocast_dtype: torch.dtype | None = None

mode

Predefined optimization modes:

# Default mode (balanced)
config = TorchInductorBackendConfig(mode="default")

# Reduce Python overhead with CUDA graphs
config = TorchInductorBackendConfig(mode="reduce-overhead")

# Maximum auto-tuning
config = TorchInductorBackendConfig(mode="max-autotune")

# Max autotune without CUDA graphs
config = TorchInductorBackendConfig(mode="max-autotune-no-cudagraphs")

Mode Details:

default: Good balance, general purpose
reduce-overhead: Uses CUDA graphs for small batches, reduces Python overhead
max-autotune: Leverages Triton for matmul/conv, enables CUDA graphs
max-autotune-no-cudagraphs: Like max-autotune but without CUDA graphs

fullgraph

Require complete graph capture:

config = TorchInductorBackendConfig(
    fullgraph=True,  # Error if graph breaks occur
    mode="max-autotune",
)

dynamic

Control dynamic shape behavior:

# Always generate dynamic kernels
config = TorchInductorBackendConfig(dynamic=True)

# Never generate dynamic kernels (always specialize)
config = TorchInductorBackendConfig(dynamic=False)

# Auto-detect (default)
config = TorchInductorBackendConfig(dynamic=None)

options

Custom inductor options:

# See all options: torch._inductor.list_options()
config = TorchInductorBackendConfig(
    options={
        "triton.cudagraphs": True,
        "max_autotune": True,
        "coordinate_descent_tuning": True,
    }
)

Note: Cannot use both mode and options.

autocast

Enable automatic mixed precision:

config = TorchInductorBackendConfig(
    mode="max-autotune",
    autocast_enabled=True,
    autocast_dtype=torch.float16,
)

Debugging

Enable Logging

# Set environment variables before running
import os
os.environ['TORCH_LOGS'] = 'dynamic,perf_hints,graph_breaks'

# Then run tuning
ait.tune(wrapped_model, input_data)

Check Optimizations

# See what mode does
import torch
print(torch._inductor.list_mode_options())

# See all available options
print(torch._inductor.list_options())

Best Practices

Start with max-autotune: Best performance for most models
Use reduce-overhead: For latency-critical applications
Enable Autocast: Free performance boost with FP16
Dynamic Shapes: Only when necessary (adds overhead)
Warmup: Run a few iterations before benchmarking

Troubleshooting

Issue: Graph breaks

Check where breaks occur:

TORCH_LOGS=graph_breaks python your_script.py

Solution: Use fullgraph=False (default) to allow partial compilation.

Issue: Slow compilation

Solution: Reduce auto-tuning:

config = TorchInductorBackendConfig(mode="default")

Issue: Not using CUDA graphs

Check logs:

TORCH_LOGS=perf_hints python your_script.py

Common causes: Input mutations, unsupported operations

Issue: Variable shape recompilations

Solution: Enable dynamic shapes:

config = TorchInductorBackendConfig(
    mode="default",
    dynamic=True,
)

Comparison with Other Backends

Feature	Inductor	TensorRT	TorchAO
Dependencies	None	TensorRT	torchao
Setup	Easy	Moderate	Easy
Performance	Good	Excellent	Good
Quantization	Limited	Advanced	Extensive
Portability	Excellent	NVIDIA only	Good

Next Steps

Compare with TensorRT Backend for maximum performance
Explore TorchAO Backend for quantization
Learn about Tune Strategies
Review Deployment Guide