Torch-TensorRT (FX 前端) 用户指南¶

Torch-TensorRT (FX 前端) 是一个工具，可以通过 torch.fx 将 PyTorch 模型转换为针对在 Nvidia GPU 上运行进行优化的 TensorRT 引擎。TensorRT 是 NVIDIA 开发的推理引擎，包含各种优化，如内核融合、图优化、低精度等。该工具在 Python 环境中开发，使得研究人员和工程师可以非常方便地使用此工作流程。用户在使用此工具时需要经历几个阶段，我们将在此介绍这些阶段。

> Torch-TensorRT (FX 前端) 目前处于 Beta 阶段，建议与 PyTorch nightly 版本配合使用。

# Test an example by
$ python py/torch_tensorrt/fx/example/lower_example.py

将 PyTorch 模型转换为 TensorRT 引擎¶

通常，用户可以使用 compile() 完成从模型到 TensorRT 引擎的转换。这是一个包装 API，包含完成此转换所需的主要步骤。请参阅 examples/fx 目录下 lower_example.py 文件中的示例用法。

def compile(
    module: nn.Module,
    input,
    max_batch_size=2048,
    max_workspace_size=33554432,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
) -> nn.Module:

    """
    Takes in original module, input and lowering setting, run lowering workflow to turn module
    into lowered module, or so called TRTModule.

    Args:
        module: Original module for lowering.
        input: Input for module.
        max_batch_size: Maximum batch size (must be >= 1 to be set, 0 means not set)
        max_workspace_size: Maximum size of workspace given to TensorRT.
        explicit_batch_dimension: Use explicit batch dimension in TensorRT if set True, otherwise use implicit batch dimension.
        lower_precision: lower_precision config given to TRTModule.
        verbose_log: Enable verbose log for TensorRT if set True.
        timing_cache_prefix: Timing cache file name for timing cache used by fx2trt.
        save_timing_cache: Update timing cache with current timing cache data if set to True.
        cuda_graph_batch_size: Cuda graph batch size, default to be -1.
        dynamic_batch: batch dimension (dim=0) is dynamic.
    Returns:
        A torch.nn.Module lowered by TensorRT.
    """

在本节中，我们将通过一个示例来说明 FX 路径使用的主要步骤。用户可以参考 examples/fx 目录下 fx2trt_example.py 文件。

步骤 1：使用 acc_tracer 跟踪模型

Acc_tracer 是一个继承自 FX tracer 的跟踪器。它带有参数归一化器，用于将所有 args 转换为 kwargs 并传递给 TRT 转换器。

import torch_tensorrt.fx.tracer.acc_tracer.acc_tracer as acc_tracer

# Build the model which needs to be a PyTorch nn.Module.
my_pytorch_model = build_model()

# Prepare inputs to the model. Inputs have to be a List of Tensors
inputs = [Tensor, Tensor, ...]

# Trace the model with acc_tracer.
acc_mod = acc_tracer.trace(my_pytorch_model, inputs)

常见错误

符号跟踪的变量不能用作控制流的输入。这意味着模型包含动态控制流。请参阅 FX 指南中的“动态控制流”一节。

步骤 2：构建 TensorRT 引擎

关于 TensorRT 如何处理批次维度，有两种不同的模式：显式批次维度和隐式批次维度。隐式批次维度模式由早期版本的 TensorRT 使用，现已弃用，但为了向后兼容仍提供支持。在显式批次维度模式下，所有维度都是显式的并且可以是动态的，这意味着它们的长度可以在执行时改变。许多新特性，如动态形状和循环，仅在此模式下可用。当在 compile() 中设置 explicit_batch_dimension=False 时，用户仍然可以选择使用隐式批次维度模式。我们不建议使用它，因为它在未来的 TensorRT 版本中将缺乏支持。

显式批次维度是默认模式，必须为动态形状设置。对于大多数视觉任务，如果用户想获得与隐式模式类似的效果（即仅批次维度改变），可以在 compile() 中选择启用 dynamic_batch。它有一些要求：1. 输入、输出和激活的形状固定，除了批次维度。2. 输入、输出和激活以批次维度作为主要维度。3. 模型中所有操作符不修改批次维度（如 permute, transpose, split 等）或在批次维度上进行计算（如 sum, softmax 等）。

对于最后一种情况，如果我们有一个形状为 (batch, sequence, dimension) 的 3D 张量 t，操作如 torch.transpose(0, 2) 就属于此例。如果这三点中的任何一点不满足，我们就需要将 InputTensorSpec 指定为具有动态范围的输入。

import deeplearning.trt.fx2trt.converter.converters
from torch.fx.experimental.fx2trt.fx2trt import InputTensorSpec, TRTInterpreter

# InputTensorSpec is a dataclass we use to store input information.
# There're two ways we can build input_specs.
# Option 1, build it manually.
input_specs = [
  InputTensorSpec(shape=(1, 2, 3), dtype=torch.float32),
  InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]
# Option 2, build it using sample_inputs where user provide a sample
inputs = [
torch.rand((1,2,3), dtype=torch.float32),
torch.rand((1,4,5), dtype=torch.float32),
]
input_specs = InputTensorSpec.from_tensors(inputs)

# IMPORTANT: If dynamic shape is needed, we need to build it slightly differently.
input_specs = [
    InputTensorSpec(
        shape=(-1, 2, 3),
        dtype=torch.float32,
        # Currently we only support one set of dynamic range. User may set other dimensions but it is not promised to work for any models
        # (min_shape, optimize_target_shape, max_shape)
        # For more information refer to fx/input_tensor_spec.py
        shape_ranges = [
            ((1, 2, 3), (4, 2, 3), (100, 2, 3)),
        ],
    ),
    InputTensorSpec(shape=(1, 4, 5), dtype=torch.float32),
]

# Build a TRT interpreter. Set explicit_batch_dimension accordingly.
interpreter = TRTInterpreter(
    acc_mod, input_specs, explicit_batch_dimension=True/False
)

# The output of TRTInterpreter run() is wrapped as TRTInterpreterResult.
# The TRTInterpreterResult contains required parameter to build TRTModule,
# and other informational output from TRTInterpreter run.
class TRTInterpreterResult(NamedTuple):
    engine: Any
    input_names: Sequence[str]
    output_names: Sequence[str]
    serialized_cache: bytearray

#max_batch_size: set accordingly for maximum batch size you will use.
#max_workspace_size: set to the maximum size we can afford for temporary buffer
#lower_precision: the precision model layers are running on (TensorRT will choose the best perforamnce precision).
#sparse_weights: allow the builder to examine weights and use optimized functions when weights have suitable sparsity
#force_fp32_output: force output to be fp32
#strict_type_constraints: Usually we should set it to False unless we want to control the precision of certain layer for numeric #reasons.
#algorithm_selector: set up algorithm selection for certain layer
#timing_cache: enable timing cache for TensorRT
#profiling_verbosity: TensorRT logging level
trt_interpreter_result = interpreter.run(
    max_batch_size=64,
    max_workspace_size=1 << 25,
    sparse_weights=False,
    force_fp32_output=False,
    strict_type_constraints=False,
    algorithm_selector=None,
    timing_cache=None,
    profiling_verbosity=None,
)

常见错误

RuntimeError: 尚不支持函数 xxx 的转换！ - 这意味着我们尚不支持此 xxx 操作符。有关进一步说明，请参阅下面的“如何添加缺失的操作符”一节。

步骤 3：运行模型

一种方法是使用 TRTModule，它本质上是一个 PyTorch nn.Module。

from torch_tensorrt.fx import TRTModule
mod = TRTModule(
    trt_interpreter_result.engine,
    trt_interpreter_result.input_names,
    trt_interpreter_result.output_names)
# Just like all other PyTorch modules
outputs = mod(*inputs)
torch.save(mod, "trt.pt")
reload_trt_mod = torch.load("trt.pt")
reload_model_output = reload_trt_mod(*inputs)

至此，我们详细解释了将 PyTorch 模型转换为 TensorRT 引擎的主要步骤。用户可以参考源代码以获取一些参数的解释。在转换方案中，有两个重要动作。一个是 acc tracer，它帮助我们将 PyTorch 模型转换为 acc graph。另一个是 FX path converter，它帮助将 acc graph 的操作转换为相应的 TensorRT 操作并构建 TensorRT 引擎。

Acc Tracer¶

Acc tracer 是一个自定义的 FX 符号跟踪器。与普通的 FX 符号跟踪器相比，它做了更多的事情。我们主要依赖它将 PyTorch ops 或内置 ops 转换为 acc ops。fx2trt 使用 acc ops 的主要目的有两个

PyTorch ops 和内置 ops 中有许多执行类似操作的 ops，例如 torch.add, builtin.add 和 torch.Tensor.add。使用 acc tracer，我们将这三个 ops 归一化为单个 acc_ops.add。这有助于减少我们需要编写的转换器的数量。
acc ops 只有 kwargs，这使得编写转换器更容易，因为我们不需要添加额外的逻辑来查找 args 和 kwargs 中的参数。

FX2TRT¶

符号跟踪后，我们得到了 PyTorch 模型的图表示。fx2trt 利用了 fx.Interpreter 的能力。fx.Interpreter 逐节点遍历整个图，并调用该节点表示的函数。fx2trt 通过为每个节点调用相应的转换器来覆盖调用函数的原始行为。每个转换器函数添加相应的 TensorRT 层。

下面是一个转换器函数的示例。装饰器用于将此转换器函数注册到相应的节点。在此示例中，我们将此转换器注册到目标为 acc_ops.sigmoid 的 FX 节点。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

如何添加缺失的操作符¶

实际上，你可以在任何地方添加它，只需记住导入文件，以便在用 acc_tracer 跟踪之前注册所有 acc ops 和映射器。

步骤 1：添加新的 acc op

TODO：需要更多地解释 acc op 的逻辑，例如何时拆分一个 op 以及何时重用其他 ops。

在 acc tracer 中，如果节点注册有到 acc op 的映射，我们会将图中的节点转换为 acc ops。

为了实现到 acc ops 的转换，需要满足两个条件。一是需要定义一个 acc op 函数，二是需要注册一个映射。

定义 acc op 很简单，首先只需要一个函数，并通过此装饰器 acc_normalizer.py 将该函数注册为 acc op。例如，以下代码添加了一个名为 foo() 的 acc op，用于将两个给定输入相加。

# NOTE: all acc ops should only take kwargs as inputs, therefore we need the "*"
# at the beginning.
@register_acc_op
def foo(*, input, other, alpha):
    return input + alpha * other

有两种方法注册映射。一种是 register_acc_op_mapping()。我们将 torch.add 映射到上面创建的 foo()。我们需要为其添加装饰器 register_acc_op_mapping。

this_arg_is_optional = True

@register_acc_op_mapping(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

op_and_target 决定哪个节点会触发此映射。op 和 target 是 FX 节点的属性。在 acc_normalization 中，当我们看到一个节点的 op 和 target 与 op_and_target 中设置的相同，就会触发映射。由于我们想从 torch.add 进行映射，因此 op 为 call_function，target 为 torch.add。arg_replacement_tuples 决定如何使用原始节点的 args 和 kwargs 为新的 acc op 节点构建 kwargs。arg_replacement_tuples 中的每个元组代表一个参数映射规则。它包含两个或三个元素。第三个元素是一个布尔变量，决定此 kwarg 在原始节点中是否是可选的。只有当它为 True 时，我们才需要指定第三个元素。第一个元素是原始节点中的参数名，它将被用作 acc op 节点的参数，该参数的名称是元组中的第二个元素。元组的顺序很重要，因为元组的位置决定了参数在原始节点 args 中的位置。我们使用此信息将原始节点的 args 映射到 acc op 节点的 kwargs。如果以下条件都不满足，我们无需指定 arg_replacement_tuples。

原始节点和 acc op 节点的 kwargs 名称不同。
存在可选参数。

注册映射的另一种方法是通过 register_custom_acc_mapper_fn()。这种方法旨在减少重复的 op 注册，因为它允许你使用一个函数通过某种组合映射到一个或多个现有的 acc ops。在函数中，你可以做任何你想做的事情。让我们用一个例子来解释它是如何工作的。

@register_acc_op
def foo(*, input, other, alpha=1.0):
    return input + alpha * other

@register_custom_acc_mapper_fn(
    op_and_target=("call_function", torch.add),
    arg_replacement_tuples=[
        ("input", "input"),
        ("other", "other"),
        ("alpha", "alpha", this_arg_is_optional),
    ],
)
def custom_mapper(node: torch.fx.Node, _: nn.Module) -> torch.fx.Node:
    """
    `node` is original node, which is a call_function node with target
    being torch.add.
    """
    alpha = 1
    if "alpha" in node.kwargs:
        alpha = node.kwargs["alpha"]
    foo_kwargs = {"input": node["input"], "other": node["other"], "alpha": alpha}
    with node.graph.inserting_before(node):
        foo_node = node.graph.call_function(foo, kwargs=foo_kwargs)
        foo_node.meta = node.meta.copy()
        return foo_node

在自定义映射函数中，我们构建一个 acc op 节点并返回它。这里返回的节点将接管原始节点的所有子节点 acc_normalizer.py。

最后一步是为我们添加的新 acc op 或映射器函数添加单元测试。添加单元测试的位置在这里 test_acc_tracer.py。

步骤 2：添加新的转换器

所有为 acc ops 开发的转换器都在 acc_op_converter.py 中。它可以为你提供如何添加转换器的好例子。

本质上，转换器是将 acc ops 映射到 TensorRT 层的映射机制。如果我们能够找到所有需要的 TensorRT 层，就可以开始使用 TensorRT API 为节点添加转换器。

@tensorrt_converter(acc_ops.sigmoid)
def acc_ops_sigmoid(network, target, args, kwargs, name):
    """
    network: TensorRT network. We'll be adding layers to it.

    The rest arguments are attributes of fx node.
    """
    input_val = kwargs['input']

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(f'Sigmoid received input {input_val} that is not part '
                        'of the TensorRT region!')

    layer = network.add_activation(input=input_val, type=trt.ActivationType.SIGMOID)
    layer.name = name
    return layer.get_output(0)

我们需要使用 tensorrt_converter 装饰器注册转换器。装饰器的参数是我们需要转换的 FX 节点的目标。在转换器中，我们可以在 kwargs 中找到 FX 节点的输入。如示例所示，原始节点是 acc_ops.sigmoid，它在 acc_ops.py 中只有一个参数“input”。我们获取输入并检查它是否是 TensorRT 张量。之后，我们将一个 sigmoid 层添加到 TensorRT 网络并返回该层的输出。我们返回的输出将由 fx.Interpreter 传递给 acc_ops.sigmoid 的子节点。

如果我们无法在 TensorRT 中找到与该节点功能相同的对应层怎么办。

在这种情况下，我们需要做更多工作。TensorRT 提供了作为自定义层的插件。我们尚未实现此功能。功能启用后我们将更新。

最后一步是为我们添加的新转换器添加单元测试。用户可以在此文件夹中添加相应的单元测试。

Torch-TensorRT (FX 前端) 用户指南¶

将 PyTorch 模型转换为 TensorRT 引擎¶

Acc Tracer¶

FX2TRT¶

如何添加缺失的操作符¶

文档

教程

资源