用于 `torch.compile` 的 TensorRT 后端¶

本指南介绍 Torch-TensorRT torch.compile 后端：一个深度学习编译器，使用 TensorRT 加速各种模型的 JIT 风格工作流程。

主要特性¶

Torch-TensorRT torch.compile 后端的主要目标是结合 torch.compile API 的简单性和 TensorRT 的性能，从而实现即时编译 (Just-In-Time compilation) 工作流程。调用 torch.compile 后端非常简单，只需导入 torch_tensorrt 包并指定后端即可

import torch_tensorrt
...
optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False)

注意

用户可以使用许多额外的自定义选项。这些选项将在本指南中进一步深入讨论。

该后端可以处理各种复杂的模型结构，并提供一个易于使用的界面，以有效加速模型。此外，它还提供许多自定义选项，以确保编译过程适合特定的用例。

可自定义设置¶

class torch_tensorrt.dynamo.CompilationSettings(enabled_precisions: ~typing.Set[~torch_tensorrt._enums.dtype] = <factory>, debug: bool = False, workspace_size: int = 0, min_block_size: int = 5, torch_executed_ops: ~typing.Collection[~typing.Union[~typing.Callable[[...], ~typing.Any], str]] = <factory>, pass_through_build_failures: bool = False, max_aux_streams: ~typing.Optional[int] = None, version_compatible: bool = False, optimization_level: ~typing.Optional[int] = None, use_python_runtime: ~typing.Optional[bool] = False, truncate_double: bool = False, use_fast_partitioner: bool = True, enable_experimental_decompositions: bool = False, device: ~torch_tensorrt._Device.Device = <factory>, require_full_compilation: bool = False, disable_tf32: bool = False, assume_dynamic_shape_support: bool = False, sparse_weights: bool = False, engine_capability: ~torch_tensorrt._enums.EngineCapability = <factory>, num_avg_timing_iters: int = 1, dla_sram_size: int = 1048576, dla_local_dram_size: int = 1073741824, dla_global_dram_size: int = 536870912, dryrun: ~typing.Union[bool, str] = False, hardware_compatible: bool = False, timing_cache_path: str = '/tmp/torch_tensorrt_engine_cache/timing_cache.bin', lazy_engine_init: bool = False, cache_built_engines: bool = False, reuse_cached_engines: bool = False, use_explicit_typing: bool = False, use_fp32_acc: bool = False, refit_identical_engine_weights: bool = False, strip_engine_weights: bool = False, immutable_weights: bool = True, enable_weight_streaming: bool = False, enable_cross_compile_for_windows: bool = False, tiling_optimization_level: str = 'none', l2_limit_for_tiling: int = -1, use_distributed_mode_trace: bool = False)[source]¶

Torch-TensorRT Dynamo 路径的编译设置

参数

enabled_precisions (Set[dpython:type]) – 可用的内核 dtype 精度
debug (bool) – 是否打印详细的调试信息
workspace_size (python:int) – TRT 允许用于模块的工作空间大小 (0 为默认值)
min_block_size (python:int) – 每个 TRT 引擎块的最小操作符数量
torch_executed_ops (Collection[Target]) – 无论转换器是否覆盖，都在 Torch 中运行的操作集合
pass_through_build_failures (bool) – 是否在 TRT 引擎构建失败时继续 (True) 或终止 (False)
max_aux_streams (Optional[python:int]) – 每个引擎允许的最大辅助 TRT 流数量
version_compatible (bool) – 为引擎 plan 文件提供版本向前兼容性
optimization_level (Optional[python:int]) – 构建器优化级别 0-5，级别越高意味着构建时间越长，会搜索更多优化选项。TRT 默认为 3
use_python_runtime (Optional[bool]) – 是否严格使用 Python 运行时或 C++ 运行时。要根据 C++ 依赖项的存在情况自动选择运行时（优先选择 C++ 运行时，如果可用），请将此参数保留为 None
truncate_double (bool) – 是否将 float64 TRT 引擎输入或权重截断为 float32
use_fast_partitioner (bool) – 是否使用快速或全局图划分系统
enable_experimental_decompositions (bool) – 是否启用所有核心 aten 分解，还是仅启用选定子集
device (Device) – 用于编译模型的 GPU
require_full_compilation (bool) – 是否要求图在 TensorRT 中完全编译。仅适用于 ir=”dynamo”；对 torch.compile 路径无效
assume_dynamic_shape_support (bool) – 将此设置为 True 会使转换器同时支持动态和静态形状。默认值：False
disable_tf32 (bool) – 是否禁用 TRT 层的 TF32 计算
sparse_weights (bool) – 是否允许构建器使用稀疏权重
engine_capability (trt.EngineCapability) – 将内核选择限制为安全的 GPU 内核或安全的 DLA 内核
num_avg_timing_iters (python:int) – 用于选择内核的平均计时迭代次数
dla_sram_size (python:int) – DLA 用于层内通信的快速软件管理 RAM。
dla_local_dram_size (python:int) – DLA 用于在操作间共享中间张量数据的主机 DRAM
dla_global_dram_size (python:int) – DLA 用于存储权重和元数据以供执行的主机 DRAM
dryrun (Union[bool, str]) – 切换“空运行”(Dryrun) 模式，该模式会运行除转换为 TRT 引擎之外的所有分区过程。打印详细的图结构和分区性质日志。如果指定字符串路径，则可以选择将输出保存到文件
hardware_compatible (bool) – 构建与构建引擎所用 GPU 架构之外的 GPU 架构兼容的 TensorRT 引擎（目前适用于 NVIDIA Ampere 及更新的架构）
timing_cache_path (str) – 计时缓存的路径（如果存在）或编译后将保存计时缓存的位置
cache_built_engines (bool) – 是否将编译好的 TRT 引擎保存到存储
reuse_cached_engines (bool) – 是否从存储加载缓存的 TRT 引擎
use_strong_typing (bool) – 此标志启用 TensorRT 编译中的强类型检查，该检查会遵循 PyTorch 模型中设置的精度。当用户具有混合精度图时，这非常有用。
use_fp32_acc (bool) – 此选项在 matmul 层周围插入转换为 FP32 的节点，TensorRT 确保 matmul 的累积在 FP32 中进行。仅当在 enabled_precisions 中配置了 FP16 精度时才使用此选项。
refit_identical_engine_weights (bool) – 是否使用相同权重重新适配引擎
strip_engine_weights (bool) – 是否剥离引擎权重
immutable_weights (bool) – 构建不可重新适配的引擎。这对于一些不可重新适配的层很有用。如果此参数设置为 true，则 strip_engine_weights 和 refit_identical_engine_weights 将被忽略
enable_weight_streaming (bool) – 启用权重流式加载。
enable_cross_compile_for_windows (bool) – 默认情况下为 False，意味着 TensorRT 引擎只能在构建它的同一平台上执行。设置为 True 将启用跨平台兼容性，允许引擎在 Linux 上构建并在 Windows 上运行
tiling_optimization_level (str) – 切片 (Tiling) 策略的优化级别。级别越高，TensorRT 会花费更多时间搜索更好的切片策略。我们目前支持 [“none”, “fast”, “moderate”, “full”]。
l2_limit_for_tiling (python:int) – 切片优化的目标 L2 缓存使用限制（以字节为单位）（默认值为 -1，表示无限制）。
use_distributed_mode_trace (bool) – 使用 aot_autograd 跟踪图。当分布式模型中存在 DTensors 或分布式张量时，此选项被启用

自定义设置使用¶

import torch_tensorrt
...
optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False,
                                options={"truncate_long_and_double": True,
                                         "enabled_precisions": {torch.float, torch.half},
                                         "debug": True,
                                         "min_block_size": 2,
                                         "torch_executed_ops": {"torch.ops.aten.sub.Tensor"},
                                         "optimization_level": 4,
                                         "use_python_runtime": False,})

注意

量化/INT8 支持计划在未来版本中推出；目前，我们支持 FP16 和 FP32 精度层。

编译¶

通过向模型传递输入来触发编译，如下所示

import torch_tensorrt
...
# Causes model compilation to occur
first_outputs = optimized_model(*inputs)

# Subsequent inference runs with the same, or similar inputs will not cause recompilation
# For a full discussion of this, see "Recompilation Conditions" below
second_outputs = optimized_model(*inputs)

编译后¶

编译对象可用于在 Python 会话中进行推理，并将根据下文详细说明的重新编译条件进行重新编译。除了通用推理外，编译过程还可以帮助确定模型性能、当前操作符覆盖范围以及序列化的可行性。下文将详细介绍这些方面。

模型性能¶

从 torch.compile 返回的优化模型对于模型基准测试很有用，因为它可以自动处理编译上下文的变化或可能需要重新编译的不同输入。当对不同分布、批处理大小或其他标准的输入进行基准测试时，这可以节省时间。

操作符覆盖范围¶

编译也是确定特定模型操作符覆盖范围的有用工具。例如，以下编译命令将显示每个图的操作符覆盖范围，但不会编译模型——有效地提供了一种“空运行”(dryrun) 机制

import torch_tensorrt
...
optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False,
                                options={"debug": True,
                                         "min_block_size": float("inf"),})

如果模型的关键操作符不受支持，请参阅 dynamo_conversion 贡献您自己的转换器，或在此处提交问题：https://github.com/pytorch/TensorRT/issues。

序列化的可行性¶

编译还可以帮助展示图中断 (graph breaks) 以及特定模型的序列化可行性。例如，如果一个模型没有图中断，并且使用 Torch-TensorRT 后端成功编译，那么该模型应该可以通过 torch_tensorrt Dynamo IR 进行编译和序列化，如使用 Torch-TensorRT 处理动态形状中所述。要确定模型中的图中断数量，torch._dynamo.explain 函数非常有用

import torch
import torch_tensorrt
...
explanation = torch._dynamo.explain(model)(*inputs)
print(f"Graph breaks: {explanation.graph_break_count}")
optimized_model = torch.compile(model, backend="torch_tensorrt", dynamic=False, options={"truncate_long_and_double": True})

动态形状支持¶

Torch-TensorRT torch.compile 后端目前会为遇到的每个新的批处理大小重新编译，并且建议在使用此后端进行编译时使用 dynamic=False 参数。计划在未来版本中提供完整的动态形状支持。

重新编译条件¶

模型编译完成后，后续具有相同形状和数据类型且以相同方式遍历图的推理输入将无需重新编译。此外，每次新的重新编译都将在 Python 会话期间被缓存。例如，如果向模型提供了批处理大小为 4 和 8 的输入，导致两次重新编译，则在同一会话中进行推理时，未来具有这些批处理大小的输入将无需进一步重新编译。计划在未来版本中支持引擎缓存序列化。

重新编译通常由两个事件之一触发：遇到不同大小的输入或以不同方式遍历模型代码的输入。后一种情况发生在模型代码包含条件逻辑、复杂循环或数据依赖形状时。torch.compile 处理这两种情况下的保护措施 (guarding)，并确定何时需要重新编译。

用于 `torch.compile` 的 TensorRT 后端¶

主要特性¶

可自定义设置¶

自定义设置使用¶

编译¶

编译后¶

模型性能¶

操作符覆盖范围¶

序列化的可行性¶

动态形状支持¶

重新编译条件¶

文档

教程

资源

用于 torch.compile 的 TensorRT 后端¶

主要特性¶

可自定义设置¶

自定义设置使用¶

编译¶

编译后¶

模型性能¶

操作符覆盖范围¶

序列化的可行性¶

动态形状支持¶

重新编译条件¶

文档

教程

资源

用于 `torch.compile` 的 TensorRT 后端¶