TensorRT Backend Guide
The TensorRT backend provides highly optimized inference using NVIDIA's TensorRT engine. It offers the best performance for production deployments on NVIDIA GPUs and seamlessly integrates TensorRT Model Optimizer for advanced quantization workflows.
Overview
The TensorRT backend:
- High Performance: Maximum inference speed on NVIDIA GPUs
- Dynamic Shapes: Supports optimization profiles for variable input sizes
- Quantization: INT8, FP16, and mixed precision support
- CUDA Graphs: Optional CUDA graph capture for reduced CPU overhead
- Model Optimizer Integration: Advanced quantization via TensorRT Model Optimizer
- Flexible Export: Supports both Dynamo and script-based ONNX export
Quick Start
Basic Usage
from aitune.torch.backend import TensorRTBackend, TensorRTBackendConfig, ONNXAutoCastConfig, ONNXQuantizationConfig, TorchQuantizationConfig
import aitune.torch as ait
# Configure TensorRT backend
config = TensorRTBackendConfig(use_dynamo=True)
backend = TensorRTBackend(config)
# Use with tuning
from aitune.torch.tune_strategy import OneBackendStrategy
strategy = OneBackendStrategy(backend=backend)
model = ait.Module(model, "my-model", strategy=strategy)
ait.tune(model, input_data)
With FP16 Precision
config = TensorRTBackendConfig(
quantization_config=ONNXAutoCastConfig(precision="fp16"),
workspace_size=1 << 30, # 1GB workspace
)
backend = TensorRTBackend(config)
With CUDA Graphs
config = TensorRTBackendConfig(
use_cuda_graphs=True, # Enable CUDA graphs
)
backend = TensorRTBackend(config)
Configuration Options
TensorRTBackendConfig
@dataclass
class TensorRTBackendConfig(BackendConfig):
use_dynamo: bool = True
workspace_size: int | None = None
opset_version: int | None = None
optimization_level: int | None = None
compatibility_level: int | None = None
timing_cache: Path | None = None
profiles: ProfileMode | list[TensorRTProfile] = ProfileMode.SINGLE
device: str = "cuda"
quantization_config: ONNXAutoCastConfig | ONNXQuantizationConfig | TorchQuantizationConfig | None = None
enable_tf32: bool = True
use_cuda_graphs: bool = False
use_dynamo
Use torch.dynamo for ONNX export (recommended).
# Use Dynamo export (recommended)
config = TensorRTBackendConfig(use_dynamo=True)
# Use script-based export (fallback)
config = TensorRTBackendConfig(use_dynamo=False)
When to use:
True(default): Better compatibility with modern PyTorch modelsFalse: Legacy models or when Dynamo export fails
workspace_size
Maximum memory workspace for TensorRT engine building.
config = TensorRTBackendConfig(
workspace_size=1 << 30, # 1GB
)
# Or larger for complex models
config = TensorRTBackendConfig(
workspace_size=4 << 30, # 4GB
)
Guidelines:
- Default: TensorRT chooses automatically
- Larger workspace โ More optimization opportunities โ Longer build time
- Recommended: 1-4GB for most models
opset_version
ONNX opset version for export.
Guidelines:
- Default: Latest stable opset
- Specify only if you need a particular opset for compatibility
optimization_level
TensorRT builder optimization level (0-5).
Levels:
0: No optimization3: Default (balanced)5: Maximum optimization (longer build time)
compatibility_level
Hardware compatibility level for the engine.
import tensorrt as trt
config = TensorRTBackendConfig(
compatibility_level=trt.HardwareCompatibilityLevel.AMPERE_PLUS,
)
Options:
None: Optimized for current GPU- Specific level: Portable across compatible GPUs
timing_cache
Path to timing cache for faster subsequent builds.
from pathlib import Path
config = TensorRTBackendConfig(
timing_cache=Path("/path/to/timing_cache.bin"),
)
Benefits:
- Faster engine rebuilds
- Reuse timing information across builds
- Especially useful during development
profiles
Optimization profiles for dynamic shapes.
from aitune.torch.backend.tensorrt import ProfileMode, TensorRTProfile
# Single profile (default)
config = TensorRTBackendConfig(
profiles=ProfileMode.SINGLE,
)
# Multiple profiles from samples
config = TensorRTBackendConfig(
profiles=ProfileMode.SAMPLES_USED,
)
# Custom profiles
config = TensorRTBackendConfig(
profiles=[
TensorRTProfile()
.add_input_shape("input", (1, 3, 224, 224), (4, 3, 224, 224), (8, 3, 224, 224)),
]
)
See Optimization Profiles section for details.
device
Device for TensorRT engine.
quantization_config
TensorRT backend supports multiple quantization methods through TensorRT Model Optimizer integration.
config = TensorRTBackendConfig(
quantization_config=ONNXAutoCastConfig(precision="fp16"),
)
# or
config = TensorRTBackendConfig(
quantization_config=ONNXQuantizationConfig(precision="fp16"),
)
# or
config = TensorRTBackendConfig(
quantization_config=TorchQuantizationConfig(quantization_config="FP8_DEFAULT_CFG"),
)
For a detailed information take a look at Model Optimizer documentation.
enable_tf32
Enable TF32 tensor cores on Ampere+ GPUs.
Benefits:
- Faster FP32 operations on Ampere and newer GPUs
- No accuracy loss for most models
- Recommended to keep enabled
use_cuda_graphs
Enable CUDA graph capture for inference.
Benefits:
- Reduced CPU overhead
- Better performance for small models
- Automatic re-capture on shape changes
Limitations:
- First inference is slower (graph capture)
- Shape changes trigger re-capture
- Not beneficial for very large models
Optimization Profiles
Optimization profiles define the range of input shapes TensorRT will optimize for. They are essential for models with dynamic input sizes.
Profile Modes
SINGLE (Default)
Automatically generates a single profile from recorded samples:
- Min shape: Minimum observed across all samples
- Opt shape: Most common shape
- Max shape: Maximum observed across all samples
SAMPLES_USED
Generates one profile per unique input shape:
Important: Increase max_num_samples_stored:
from aitune.torch.config import config as global_config
global_config.max_num_samples_stored = 100 # Or float("inf")
Use case: When you have distinct input shape categories that need separate optimization.
Custom Profiles
Define exact optimization profiles:
from aitune.torch.backend.tensorrt import TensorRTProfile
profiles = [
# Profile for small inputs
TensorRTProfile()
.add_input_shape(
"args_0",
min_shape=(1, 3, 224, 224),
opt_shape=(4, 3, 224, 224),
max_shape=(8, 3, 224, 224),
),
# Profile for large inputs
TensorRTProfile()
.add_input_shape(
"args_0",
min_shape=(1, 3, 512, 512),
opt_shape=(4, 3, 512, 512),
max_shape=(8, 3, 512, 512),
),
]
config = TensorRTBackendConfig(profiles=profiles)
Finding Input Names
Input tensor names are shown in tuning logs:
INFO - ๐ Tuning graph `0` for module `my-model`:
INFO - graph_spec:
INFO - input_spec:
Tensors:
โโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโ
โ Locator โ Name โ Shape โ Min Shape โ Max Shape โ Dtype โ
โโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโก
โ [0] โ args_0 โ ['batch0', 3, 'dim2', 'dim3'] โ [2, 3, 224, 224] โ [8, 3, 448, 448] โ torch.float32 โ
โโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโ
Use the Name column value (e.g., args_0) in your profiles.
Best Practices for Profiles
- Min < Opt < Max: Ensure min โค opt โค max for all dimensions
- Opt = Typical: Set opt to your most common input size
- Range Coverage: Ensure your runtime inputs fall within [min, max]
- Multiple Profiles: Use for distinct size categories, not slight variations
- Test Runtime Shapes: Verify your production shapes are covered
Troubleshooting
Issue: ONNX export fails
Solution: Try disabling Dynamo export:
Issue: Engine build fails due to memory
Solution: Reduce workspace size:
Issue: Runtime shape not supported
Error: Input shape X exceeds max profile shape Y
Solution: Update profiles to cover your runtime shapes:
profiles = [
TensorRTProfile()
.add_input_shape("args_0", min_shape=(1, 3, 224, 224), opt_shape=(4, 3, 224, 224), max_shape=(16, 3, 224, 224))
]
config = TensorRTBackendConfig(profiles=profiles)
Issue: Slow first inference
Cause: This is expected when using CUDA graphs (graph capture overhead).
Solution: Warmup with a few inference calls before measuring performance.
Issue: INT8 accuracy drop
Solution: Try different quantization algorithms:
# Try 'minmax' or 'entropy' instead of 'max'
quantization_config = QuantizationConfig(
algorithm="entropy",
quant_format="int8",
)
Best Practices
- Use FP16: Enable FP16 precision for best performance without accuracy loss
- Enable TF32: Keep
enable_tf32=Trueon Ampere+ GPUs - Profile Carefully: Ensure optimization profiles cover all runtime shapes
- Timing Cache: Use timing cache during development for faster iteration
- CUDA Graphs: Enable for latency-sensitive small models
- Workspace Size: Start with 1-2 GB and, increase if the build fails
- Quantization: Validate accuracy with a representative test set
Next Steps
- Learn about Torch-TensorRT JIT Backend
- Learn about Torch-TensorRT AOT Backend
- Explore Tune Strategies
- Review Deployment Guide