在 C++ 中运行 ExecuTorch 模型教程¶

作者: Jacob Szwejbka

在本教程中，我们将介绍加载 ExecuTorch 模型、准备 MemoryManager、设置输入、执行模型和检索输出的 API。

有关 ExecuTorch 运行时的概述，请参阅运行时概述，有关每个 API 的更深入文档，请参阅运行时 API 参考。这里是一个功能齐全的 C++ 模型运行器版本，设置 ExecuTorch 文档展示了如何构建和运行它。

先决条件¶

您将需要一个 ExecuTorch 模型才能继续。我们将使用从导出到 ExecuTorch 教程生成的模型 SimpleConv。

模型加载¶

运行模型的第一步是加载它。ExecuTorch 使用一个名为 DataLoader 的抽象来处理检索 .pte 文件数据的细节，然后 Program 代表加载状态。

用户可以定义自己的 DataLoader 来满足其特定系统的需求。在本教程中，我们将使用 FileDataLoader，但您可以查看示例数据加载器实现以查看 ExecuTorch 项目提供的其他选项。

对于 FileDataLoader，我们只需要向构造函数提供一个文件路径。

using namespace torch::executor;

Result<util::FileDataLoader> loader =
        util::FileDataLoader::from("/tmp/model.pte");
assert(loader.ok());

Result<Program> program =
      torch::executor::Program::load(loader.get());
assert(program.ok());

设置 MemoryManager¶

接下来，我们将设置 MemoryManager。

ExecuTorch 的原则之一是让用户控制运行时使用的内存来自何处。今天（2023 年底），用户需要提供 2 个不同的分配器

方法分配器：一个 MemoryAllocator 用于在 Method 加载时分配运行时结构。诸如张量元数据、内部指令链和其他运行时状态都来自此。
计划内存：一个 HierarchicalAllocator 包含一个或多个内存区域，内部可变张量数据缓冲区放置在其中。在 Method 加载时，内部张量的数据指针被分配到其中的各种偏移量。这些偏移量的位置和区域的大小由提前的内存规划决定。

对于此示例，我们将从 Program 动态检索计划内存区域的大小，但对于无堆环境，用户可以提前从 Program 中检索此信息并静态分配区域。我们还将使用基于 malloc 的分配器作为方法分配器。

// Method names map back to Python nn.Module method names. Most users will only have the singular method "forward".
const char* method_name = "forward";

// MethodMeta is a lightweight structure that lets us gather metadata
// information about a specific method. In this case we are looking to
// get the required size of the memory planned buffers for the method
// "forward".
Result<MethodMeta> method_meta = program->method_meta(method_name);
assert(method_meta.ok());

std::vector<std::unique_ptr<uint8_t[]>> planned_buffers; // Owns the Memory
std::vector<Span<uint8_t>> planned_arenas; // Passed to the allocator

size_t num_memory_planned_buffers = method_meta->num_memory_planned_buffers();

// It is possible to have multiple layers in our memory hierarchy; for example, SRAM and DRAM.
for (size_t id = 0; id < num_memory_planned_buffers; ++id) {
  // .get() will always succeed because id < num_memory_planned_buffers.
  size_t buffer_size =
      static_cast<size_t>(method_meta->memory_planned_buffer_size(id).get());
  planned_buffers.push_back(std::make_unique<uint8_t[]>(buffer_size));
  planned_arenas.push_back({planned_buffers.back().get(), buffer_size});
}
HierarchicalAllocator planned_memory(
    {planned_arenas.data(), planned_arenas.size()});

// Version of MemoryAllocator that uses malloc to handle allocations
// rather then a fixed buffer.
util::MallocMemoryAllocator method_allocator;

// Assemble all of the allocators into the MemoryManager that the Executor
// will use.
MemoryManager memory_manager(&method_allocator, &planned_memory);

加载方法¶

在 ExecuTorch 中，我们以方法粒度从 Program 加载和初始化。许多程序只有一个方法“forward”。load_method 是完成初始化的地方，从设置张量元数据到初始化委托等等。

Result<Method> method = program->load_method(method_name);
assert(method.ok());

设置输入¶

现在我们有了方法，我们需要在执行推理之前设置其输入。在这种情况下，我们知道我们的模型需要一个大小为 (1, 3, 256, 256) 的单一浮点张量。

根据您的模型是如何进行内存规划的，计划内存可能包含或可能不包含您的输入和输出的缓冲区空间。

如果输出没有进行内存规划，那么用户将需要使用“set_output_data_ptr”设置输出数据指针。在这种情况下，我们将假设我们的模型是使用输入和输出由内存计划处理的方式导出的。

// Create our input tensor.
float data[1 * 3 * 256 * 256];
Tensor::SizesType sizes[] = {1, 3, 256, 256};
Tensor::DimOrderType dim_order = {0, 1, 2, 3};
TensorImpl impl(
    ScalarType::Float, // dtype
    4, // number of dimensions
    sizes,
    data,
    dim_order);
Tensor t(&impl);

// Implicitly casts t to EValue
Error set_input_error = method->set_input(t, 0);
assert(set_input_error == Error::Ok);

执行推理¶

现在我们的方法已加载，并且我们的输入已设置，我们可以执行推理。我们通过调用 execute 来完成此操作。

Error execute_error = method->execute();
assert(execute_error == Error::Ok);

检索输出¶

推理完成后，我们可以检索输出。我们知道我们的模型只返回一个输出张量。这里一个潜在的陷阱是，我们得到的输出是 Method 拥有的。用户在对输出进行任何修改之前，应该注意克隆他们的输出，或者如果他们需要它具有与 Method 分开的生命周期。

EValue output = method->get_output(0);
assert(output.isTensor());

结论¶

在本教程中，我们介绍了在 C++ 中加载和执行 ExecuTorch 模型所需的 API 和步骤。