TensorRT Backend Guide

The TensorRT backend provides highly optimized inference using NVIDIA's TensorRT engine. It offers the best performance for production deployments on NVIDIA GPUs and seamlessly integrates TensorRT Model Optimizer for advanced quantization workflows.

Overview

The TensorRT backend:

High Performance: Maximum inference speed on NVIDIA GPUs
Dynamic Shapes: Supports optimization profiles for variable input sizes
Quantization: INT8, FP16, and mixed precision support
CUDA Graphs: Optional CUDA graph capture for reduced CPU overhead
Model Optimizer Integration: Advanced quantization via TensorRT Model Optimizer
Flexible Export: Supports both Dynamo and script-based ONNX export

Quick Start

Basic Usage

from aitune.torch.backend import TensorRTBackend, TensorRTBackendConfig, ONNXAutoCastConfig, ONNXQuantizationConfig, TorchQuantizationConfig
import aitune.torch as ait

# Configure TensorRT backend
config = TensorRTBackendConfig(use_dynamo=True)
backend = TensorRTBackend(config)

# Use with tuning
from aitune.torch.tune_strategy import OneBackendStrategy
strategy = OneBackendStrategy(backend=backend)

model = ait.Module(model, "my-model", strategy=strategy)
ait.tune(model, input_data)

With FP16 Precision

config = TensorRTBackendConfig(
    quantization_config=ONNXAutoCastConfig(precision="fp16"),
    workspace_size=1 << 30,  # 1GB workspace
)
backend = TensorRTBackend(config)

With CUDA Graphs

config = TensorRTBackendConfig(
    use_cuda_graphs=True,  # Enable CUDA graphs
)
backend = TensorRTBackend(config)

Configuration Options

TensorRTBackendConfig

@dataclass
class TensorRTBackendConfig(BackendConfig):
    use_dynamo: bool = True
    workspace_size: int | None = None
    opset_version: int | None = None
    optimization_level: int | None = None
    compatibility_level: int | None = None
    timing_cache: Path | None = None
    profiles: ProfileMode | list[TensorRTProfile] = ProfileMode.SINGLE
    device: str = "cuda"
    quantization_config: ONNXAutoCastConfig | ONNXQuantizationConfig | TorchQuantizationConfig | None = None
    enable_tf32: bool = True
    use_cuda_graphs: bool = False

use_dynamo

Use torch.dynamo for ONNX export (recommended).

# Use Dynamo export (recommended)
config = TensorRTBackendConfig(use_dynamo=True)

# Use script-based export (fallback)
config = TensorRTBackendConfig(use_dynamo=False)

When to use:

True (default): Better compatibility with modern PyTorch models
False: Legacy models or when Dynamo export fails

workspace_size

Maximum memory workspace for TensorRT engine building.

config = TensorRTBackendConfig(
    workspace_size=1 << 30,  # 1GB
)

# Or larger for complex models
config = TensorRTBackendConfig(
    workspace_size=4 << 30,  # 4GB
)

Guidelines:

Default: TensorRT chooses automatically
Larger workspace → More optimization opportunities → Longer build time
Recommended: 1-4GB for most models

opset_version

ONNX opset version for export.

config = TensorRTBackendConfig(
    opset_version=17,  # Use ONNX opset 17
)

Guidelines:

Default: Latest stable opset
Specify only if you need a particular opset for compatibility

optimization_level

TensorRT builder optimization level (0-5).

config = TensorRTBackendConfig(
    optimization_level=5,  # Maximum optimization
)

Levels:

0: No optimization
3: Default (balanced)
5: Maximum optimization (longer build time)

compatibility_level

Hardware compatibility level for the engine.

import tensorrt as trt

config = TensorRTBackendConfig(
    compatibility_level=trt.HardwareCompatibilityLevel.AMPERE_PLUS,
)

Options:

None: Optimized for current GPU
Specific level: Portable across compatible GPUs

timing_cache

Path to timing cache for faster subsequent builds.

from pathlib import Path

config = TensorRTBackendConfig(
    timing_cache=Path("/path/to/timing_cache.bin"),
)

Benefits:

Faster engine rebuilds
Reuse timing information across builds
Especially useful during development

profiles

Optimization profiles for dynamic shapes.

from aitune.torch.backend.tensorrt import ProfileMode, TensorRTProfile

# Single profile (default)
config = TensorRTBackendConfig(
    profiles=ProfileMode.SINGLE,
)

# Multiple profiles from samples
config = TensorRTBackendConfig(
    profiles=ProfileMode.SAMPLES_USED,
)

# Custom profiles
config = TensorRTBackendConfig(
    profiles=[
        TensorRTProfile()
            .add_input_shape("input", (1, 3, 224, 224), (4, 3, 224, 224), (8, 3, 224, 224)),
    ]
)

See Optimization Profiles section for details.

device

Device for TensorRT engine.

config = TensorRTBackendConfig(
    device="cuda",  # Default
)

quantization_config

TensorRT backend supports multiple quantization methods through TensorRT Model Optimizer integration.

config = TensorRTBackendConfig(
    quantization_config=ONNXAutoCastConfig(precision="fp16"),
)

# or

config = TensorRTBackendConfig(
    quantization_config=ONNXQuantizationConfig(precision="fp16"),
)

# or

config = TensorRTBackendConfig(
    quantization_config=TorchQuantizationConfig(quantization_config="FP8_DEFAULT_CFG"),
)

For a detailed information take a look at Model Optimizer documentation.

enable_tf32

Enable TF32 tensor cores on Ampere+ GPUs.

config = TensorRTBackendConfig(
    enable_tf32=True,  # Default
)

Benefits:

Faster FP32 operations on Ampere and newer GPUs
No accuracy loss for most models
Recommended to keep enabled

use_cuda_graphs

Enable CUDA graph capture for inference.

config = TensorRTBackendConfig(
    use_cuda_graphs=True,
)

Benefits:

Reduced CPU overhead
Better performance for small models
Automatic re-capture on shape changes

Limitations:

First inference is slower (graph capture)
Shape changes trigger re-capture
Not beneficial for very large models

Optimization Profiles

Optimization profiles define the range of input shapes TensorRT will optimize for. They are essential for models with dynamic input sizes.

Profile Modes

SINGLE (Default)

Automatically generates a single profile from recorded samples:

config = TensorRTBackendConfig(
    profiles=ProfileMode.SINGLE,
)

Min shape: Minimum observed across all samples
Opt shape: Most common shape
Max shape: Maximum observed across all samples

SAMPLES_USED

Generates one profile per unique input shape:

config = TensorRTBackendConfig(
    profiles=ProfileMode.SAMPLES_USED,
)

Important: Increase max_num_samples_stored:

from aitune.torch.config import config as global_config

global_config.max_num_samples_stored = 100  # Or float("inf")

Use case: When you have distinct input shape categories that need separate optimization.

Custom Profiles

Define exact optimization profiles:

from aitune.torch.backend.tensorrt import TensorRTProfile

profiles = [
    # Profile for small inputs
    TensorRTProfile()
        .add_input_shape(
            "args_0",
            min_shape=(1, 3, 224, 224),
            opt_shape=(4, 3, 224, 224),
            max_shape=(8, 3, 224, 224),
        ),
    # Profile for large inputs
    TensorRTProfile()
        .add_input_shape(
            "args_0",
            min_shape=(1, 3, 512, 512),
            opt_shape=(4, 3, 512, 512),
            max_shape=(8, 3, 512, 512),
        ),
]

config = TensorRTBackendConfig(profiles=profiles)

Finding Input Names

Input tensor names are shown in tuning logs:

INFO - 🚀 Tuning graph `0` for module `my-model`:
INFO -   graph_spec:
INFO -     input_spec:
 Tensors:
╒═══════════╤════════╤═══════════════════════════════╤══════════════════╤══════════════════╤═══════════════╕
│ Locator   │ Name   │ Shape                         │ Min Shape        │ Max Shape        │ Dtype         │
╞═══════════╪════════╪═══════════════════════════════╪══════════════════╪══════════════════╪═══════════════╡
│ [0]       │ args_0 │ ['batch0', 3, 'dim2', 'dim3'] │ [2, 3, 224, 224] │ [8, 3, 448, 448] │ torch.float32 │
╘═══════════╧════════╧═══════════════════════════════╧══════════════════╧══════════════════╧═══════════════╛

Use the Name column value (e.g., args_0) in your profiles.

Best Practices for Profiles

Min < Opt < Max: Ensure min ≤ opt ≤ max for all dimensions
Opt = Typical: Set opt to your most common input size
Range Coverage: Ensure your runtime inputs fall within [min, max]
Multiple Profiles: Use for distinct size categories, not slight variations
Test Runtime Shapes: Verify your production shapes are covered

Troubleshooting

Issue: ONNX export fails

Solution: Try disabling Dynamo export:

config = TensorRTBackendConfig(use_dynamo=False)

Issue: Engine build fails due to memory

Solution: Reduce workspace size:

config = TensorRTBackendConfig(workspace_size=512 << 20)  # 512MB

Issue: Runtime shape not supported

Error: Input shape X exceeds max profile shape Y

Solution: Update profiles to cover your runtime shapes:

profiles = [
    TensorRTProfile()
        .add_input_shape("args_0", min_shape=(1, 3, 224, 224), opt_shape=(4, 3, 224, 224), max_shape=(16, 3, 224, 224))
]
config = TensorRTBackendConfig(profiles=profiles)

Issue: Slow first inference

Cause: This is expected when using CUDA graphs (graph capture overhead).

Solution: Warmup with a few inference calls before measuring performance.

Issue: INT8 accuracy drop

Solution: Try different quantization algorithms:

# Try 'minmax' or 'entropy' instead of 'max'
quantization_config = QuantizationConfig(
    algorithm="entropy",
    quant_format="int8",
)

Best Practices

Use FP16: Enable FP16 precision for best performance without accuracy loss
Enable TF32: Keep enable_tf32=True on Ampere+ GPUs
Profile Carefully: Ensure optimization profiles cover all runtime shapes
Timing Cache: Use timing cache during development for faster iteration
CUDA Graphs: Enable for latency-sensitive small models
Workspace Size: Start with 1-2 GB and, increase if the build fails
Quantization: Validate accuracy with a representative test set

TensorRT Backend Guide

Overview

Quick Start

Basic Usage

With FP16 Precision

With CUDA Graphs

Configuration Options

TensorRTBackendConfig

use_dynamo

workspace_size

opset_version

optimization_level

compatibility_level

timing_cache

profiles

device

quantization_config

enable_tf32

use_cuda_graphs

Optimization Profiles

Profile Modes

SINGLE (Default)

SAMPLES_USED

Custom Profiles

Finding Input Names

Best Practices for Profiles

Troubleshooting

Issue: ONNX export fails

Issue: Engine build fails due to memory

Issue: Runtime shape not supported

Issue: Slow first inference

Issue: INT8 accuracy drop

Best Practices

Next Steps