使用 Instrumentation and Tracing Technology (ITT) API 对 PyTorch 工作负载进行性能分析¶

创建日期：2022 年 10 月 27 日 | 最后更新日期：2024 年 10 月 24 日 | 最后验证：未验证

在本范例中，你将学习

什么是 Intel® VTune™ Profiler
什么是 Instrumentation and Tracing Technology (ITT) API
如何在 Intel® VTune™ Profiler 中可视化 PyTorch 模型层级
一段简短的示例代码，展示如何使用 PyTorch ITT API

要求¶

PyTorch 1.13 或更高版本
Intel® VTune™ Profiler

PyTorch 的安装说明可在 pytorch.org 找到。

什么是 Intel® VTune™ Profiler¶

Intel® VTune™ Profiler 是一款用于串行和多线程应用程序的性能分析工具。对于熟悉 Intel 架构的用户，Intel® VTune™ Profiler 提供了一套丰富的指标，帮助用户了解应用程序如何在 Intel 平台上执行，从而对性能瓶颈有所了解。

包括入门指南在内的更多详细信息可在 Intel 网站上获取。

什么是 Instrumentation and Tracing Technology (ITT) API¶

Intel® VTune™ Profiler 提供的 Instrumentation and Tracing Technology API (ITT API) 使目标应用程序能够在执行期间生成和控制跟踪数据的收集。

ITT 功能的优势在于能够在 Intel® VTune™ Profiler GUI 上标记单个 PyTorch 算子以及自定义区域的时间跨度。当用户发现任何异常时，这有助于定位是哪个算子出现了意外行为。

注意

ITT API 已自 PyTorch 1.13 版本起集成到 PyTorch 中。用户无需调用原始的 ITT C/C++ API，只需调用 PyTorch 中的 Python API 即可。更多详细信息可在 PyTorch 文档中找到。

如何在 Intel® VTune™ Profiler 中可视化 PyTorch 模型层级¶

PyTorch 中提供了两种使用方式：

隐式调用：默认情况下，所有遵循 PyTorch 算子注册机制注册的算子在启用 ITT 功能后会自动被标记。
显式调用：如果需要自定义标记，用户可以明确使用 PyTorch 文档中提到的 API 来标记所需的范围。

要启用显式调用，期望被标记的代码应在 torch.autograd.profiler.emit_itt() 范围内调用。例如

with torch.autograd.profiler.emit_itt():
  <code-to-be-profiled...>

启动 Intel® VTune™ Profiler¶

要验证此功能，你需要启动一个 Intel® VTune™ Profiler 实例。请查阅 Intel® VTune™ Profiler 用户指南以了解启动 Intel® VTune™ Profiler 的步骤。

注意

用户也可以按照 Intel® VTune™ Profiler Web Server UI 指南使用 Web 服务器 UI，例如：vtune-backend –web-port=8080 –allow-remote-access –enable-server-profiling

启动 Intel® VTune™ Profiler GUI 后，应看到如下所示的用户界面：

左侧导航栏的 sample (matrix) 项目下有三个示例结果。如果你不想让性能分析结果出现在这个默认的示例项目中，可以通过蓝色 Configure Analysis… 按钮下的 New Project… 按钮创建一个新项目。要开始新的性能分析，点击蓝色的 Configure Analysis… 按钮启动性能分析配置。

配置 CPU 性能分析¶

点击 Configure Analysis… 按钮后，应看到如下所示的屏幕：

窗口的右侧分为 3 个部分：WHERE（左上）、WHAT（左下）和 HOW（右侧）。使用 WHERE，可以指定要运行性能分析的机器。使用 WHAT，可以设置要分析的应用程序路径。要分析 PyTorch 脚本，建议将所有手动步骤，包括激活 Python 环境和设置必要的环境变量，都封装到一个 bash 脚本中，然后分析这个 bash 脚本。在上面的截图中，我们将所有步骤封装到了 launch.sh bash 脚本中，并分析了 bash，参数为 <path_of_launch.sh>。右侧的 HOW，可以选择任何你想进行的性能分析类型。Intel® VTune™ Profiler 提供了多种性能分析类型供你选择。详细信息可在 Intel® VTune™ Profiler 用户指南中找到。

配置 XPU 性能分析¶

选择 GPU Offload Profiling 类型而不是 Hotspots，然后按照与 CPU 相同的说明启动应用程序。

读取性能分析结果¶

成功使用 ITT 进行性能分析后，可以打开性能分析结果的 Platform 选项卡，在 Intel® VTune™ Profiler 时间轴中查看标签。

时间轴顶部显示主线程为 python 线程，下方显示各个 OpenMP 线程。标记的 PyTorch 算子和自定义区域显示在主线程行中。所有以 aten:: 开头的算子都是由 PyTorch 中的 ITT 功能隐式标记的算子。iteration_N 标签是使用特定 API torch.profiler.itt.range_push()、torch.profiler.itt.range_pop() 或 torch.profiler.itt.range() 范围显式标记的。详细信息请查阅下一节的示例代码。

注意

标有 convolution 和 reorder 的红色框来自 Intel® oneAPI Deep Neural Network Library (oneDNN) 的标记。

如右侧导航栏所示，时间轴行中的棕色部分显示了各个线程的 CPU 使用情况。在某个时间戳上，棕色部分占据线程行高度的百分比与该时间戳上该线程的 CPU 使用率相对应。因此，从这个时间轴可以直观地了解以下内容：

每个线程的 CPU 核利用率如何。
所有线程的 CPU 核利用是否均衡。所有线程的 CPU 使用率是否都很好？
OpenMP 线程的同步情况如何。OpenMP 线程启动或结束时是否存在抖动。

当然，Intel® VTune™ Profiler 提供了更丰富的性能分析功能集，可以帮助你理解性能问题。当你理解了性能问题的根本原因，就可以解决它。更多详细的使用说明可在 Intel® VTune™ Profiler 用户指南中找到。

读取 XPU 性能分析结果¶

成功使用 ITT 进行性能分析后，可以打开性能分析结果的 Platform 选项卡，在 Intel® VTune™ Profiler 时间轴中查看标签。

时间轴顶部显示主线程为 python 线程。标记的 PyTorch 算子和自定义区域显示在主线程行中。所有以 aten:: 开头的算子都是由 PyTorch 中的 ITT 功能隐式标记的算子。时间轴还显示了顶部的 GPU 计算队列，用户可以看到不同的 XPU Kernel 被分派到 GPU 队列中。

一段简短的示例代码，展示如何使用 PyTorch ITT API¶

下面的示例代码是上面截图中用于性能分析的脚本。

该拓扑由两个算子构成，Conv2d 和 Linear。执行了三次推理迭代。每次迭代都使用 PyTorch ITT API 以文本字符串 iteration_N 进行标记。使用一对 torch.profile.itt.range_push 和 torch.profile.itt.range_pop 或使用 torch.profile.itt.range 范围都可以实现自定义标记功能。

# sample.py

import torch
import torch.nn as nn

class ITTSample(nn.Module):
  def __init__(self):
    super(ITTSample, self).__init__()
    self.conv = nn.Conv2d(3, 5, 3)
    self.linear = nn.Linear(292820, 1000)

  def forward(self, x):
    x = self.conv(x)
    x = x.view(x.shape[0], -1)
    x = self.linear(x)
    return x

def main():
  m = ITTSample
  # unmark below code for XPU
  # m = m.to("xpu")
  x = torch.rand(10, 3, 244, 244)
  # unmark below code for XPU
  # x = x.to("xpu")
  with torch.autograd.profiler.emit_itt():
    for i in range(3)
      # Labeling a region with pair of range_push and range_pop
      #torch.profiler.itt.range_push(f'iteration_{i}')
      #m(x)
      #torch.profiler.itt.range_pop()

      # Labeling a region with range scope
      with torch.profiler.itt.range(f'iteration_{i}'):
        m(x)

if __name__ == '__main__':
  main()

下面显示了在 Intel® VTune™ Profiler GUI 截图中提到的、用于封装所有手动步骤的 launch.sh bash 脚本。

# launch.sh

#!/bin/bash

# Retrieve the directory path where the path contains both the sample.py and launch.sh so that this bash script can be invoked from any directory
BASEFOLDER=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
<Activate a Python environment>
cd ${BASEFOLDER}
python sample.py