TorchAO Backend Guide
The TorchAO backend leverages PyTorch's torchao library for quantization-based model tuning. It provides various quantization schemes for weight-only and dynamic quantization.
Overview
- Weight-Only Quantization: INT8, FP8
- Dynamic Quantization: INT8 and FP8 with dynamic activations
- Easy Configuration: Predefined quantization types
- Pure PyTorch: No external dependencies beyond torchao
Quick Start
from aitune.torch.backend import TorchAOBackend, TorchAOBackendConfig
import aitune.torch as ait
# Configure with FP8 weight-only quantization
config = TorchAOBackendConfig(quantization="fp8wo")
backend = TorchAOBackend(config)
# Use in tuning
strategy = ait.OneBackendStrategy(backend=backend)
model = ait.Module(model, "my-model", strategy=strategy)
ait.tune(model, input_data)
Quantization Types
Weight-Only Quantization
# INT8 weight-only
config = TorchAOBackendConfig(quantization="int8wo")
# FP8 weight-only (default)
config = TorchAOBackendConfig(quantization="fp8wo")
Dynamic Quantization
# INT8 dynamic (activations + weights)
config = TorchAOBackendConfig(quantization="int8dq")
# FP8 dynamic (activations + weights)
config = TorchAOBackendConfig(quantization="fp8dq")
Configuration Options
Using Predefined Types
Custom Configuration
from torchao.quantization import Int8WeightOnlyConfig
custom_config = Int8WeightOnlyConfig()
config = TorchAOBackendConfig(
quantization_config=custom_config,
)
Quantization Comparison
| Type | Weights | Activations | Memory Reduction | Speed | Accuracy |
|---|---|---|---|---|---|
| int8wo | INT8 | FP16/FP32 | ~2x | High | Better |
| int8dq | INT8 | INT8 | ~2x | Very High | Good |
| fp8wo | FP8 | FP16/FP32 | ~2x | Very High | Excellent |
| fp8dq | FP8 | FP8 | ~2x | Very High | Excellent |
Best Practices
- Start with FP8: Best accuracy/performance trade-off
- Use INT8 for Memory: When memory is critical
- Dynamic Quantization: Better accuracy, slightly higher overhead
- Validate Accuracy: Always test quantized model accuracy
- Calibration Data: Use representative samples
Troubleshooting
Issue: Accuracy loss too high
Solution: Try less aggressive quantization:
Issue: Not enough speed improvement
Solution: Try dynamic quantization:
Next Steps
- Learn about TensorRT Backend for maximum performance
- Compare with Torch Inductor Backend
- Review Deployment Guide