剖析以了解 torch.compile 性能¶

使用 torch.profiler 的用途：¶

torch.profiler 有助于理解程序在内核级别粒度上的性能 - 例如，它可以显示程序级别上的图中断和 GPU 利用率。分析器提供的数据通常可以帮助用户了解在何处进一步调查以了解模型性能。

为了了解内核级别性能，还存在其他工具。可以使用 NVIDIA 的 ncu 工具或 inductor 的分析工具。

使用 torch.profiler 和查看跟踪的基本知识¶

示例程序：我们将使用此示例对 resnet18 进行分析。请注意此示例程序的以下部分

包含一个预热运行以等待编译完成（这将预热诸如 CUDA 缓存分配器之类的系统）
使用 torch.profiler.profile() 上下文来分析我们感兴趣的部分
使用 prof.export_chrome_trace("trace.json") 来导出分析工件。

import torch
from torchvision.models import resnet18

model = resnet18().cuda()
inputs = [torch.randn((5, 3, 224, 224), device='cuda') for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace.json")

查看 Chrome 跟踪：在 Chrome 浏览器中，打开 chrome://tracing 并加载 json 文件。使用“w”和“s”键放大和缩小，使用“a”和“d”键向左和向右滚动。“?”将显示一个包含快捷键列表的“帮助”屏幕。

Example of a basic chrome trace, visualized in the chrome://tracing viewer

在这里，我们观察到：* CompiledFunction 和 CompiledFunctionBackward 事件，它们对应于 dynamo 编译区域。* 顶部的 CPU 事件，底部的 GPU 事件。

CPU 和 GPU 事件之间的流程

GPU 上的每个内核都在 CPU 上运行的代码启动后发生。分析器可以在 GPU 和 CPU 事件之间绘制连接（即“流程”)，以显示哪个 CPU 事件启动了 GPU 内核。这特别有用，因为除了少数例外情况，GPU 内核是异步启动的。

要查看流程连接，请单击 GPU 内核，然后单击“ac2g”

Visualization in the chrome://trace viewer, showing an async flow between a kernel and its launching location.

或者，使用顶部的“流程事件”下拉菜单打开所有流程。

解决 CUDA 图表分析问题¶

启用 CUDA 图表时，某些 cuda 配置（驱动程序版本低于 525.85.12 或 CUDA < 12）在分析工具和 CUDA 图表之间可能会遇到问题。要解决这些问题，请在程序的顶部添加一个空的分析上下文

import torch

torch.profiler._utils._init_for_cuda_graphs()

# ... rest of program

了解编译时间¶

要了解编译为何需要很长时间，可以分析 torch.compile-ed 程序的第一次调用。请记住，编译的分析跟踪可能比典型的分析跟踪更失真，因为编译工作负载可能与典型的 PyTorch 工作负载有很大不同。在某些情况下，跟踪文件也可能非常大。跟踪文件大小 > 1GB 可能难以使用 Chrome 跟踪工具打开。

注意：几乎相同的信息也可以使用 torch._dynamo.utils.compile_times() 以非图形格式获取。此实用程序不会显示编译步骤发生的时机，但会显示在每个步骤上花费的时间 - 并且时间不会受到任何分析开销的影响。

请参见下面的示例

import torch
from torchvision.models import resnet18

model = resnet18().cuda()
inputs = [torch.randn((5, 3, 224, 224), device='cuda') for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

def warmup_compile():
    def fn(x):
        return x.sin().relu()

    x = torch.rand((2, 2), device='cuda', requires_grad=True)
    fn_c = torch.compile(fn)
    out = fn_c(x)
    out.sum().backward()

with torch.profiler.profile() as prof:
    with torch.profiler.record_function("warmup compile"):
        warmup_compile()

    with torch.profiler.record_function("resnet18 compile"):
        fwd_bwd(inputs[0])

prof.export_chrome_trace("trace_compile.json")

A visualization in the chrome://trace viewer, showing dynamo and inductor compilation steps

请注意以下几点

第一次调用应发生在分析过程中，以便捕获编译
添加一个预热编译，以便初始化任何需要延迟初始化的系统。

查找图中断：“Torch-Compiled Region” 和“CompiledFunction”¶

尽管存在用于识别图中断的日志记录工具，但分析器提供了一种快速直观的识别图中断的方法。有两个分析器事件需要查找：Torch-Compiled Region 和 CompiledFunction。

Torch-Compiled Region - 在 PyTorch 2.2 中引入 - 是一个分析器事件，它涵盖整个编译区域。图中断几乎总是看起来相同：嵌套的“Torch-Compiled Region”事件。

如果对两个独立的函数分别应用 torch.compile()，则通常应该期望看到两个相邻的（即非堆叠/嵌套）Torch-Compiled Region。同时，如果遇到图中断（或 disable()’ed/skipped 区域），则应期望看到嵌套的“Torch-Compiled Region”事件。

编译函数（CompiledFunction） - 在 PyTorch 2.0 中引入 - 是一个性能分析器事件，当任何输入需要梯度时出现。每个图断点都会中断 CompiledFunction 块，将其分成两部分。CompiledFunction 事件仅在涉及自动微分（Autograd）时才会出现，即图的一些输入张量具有 requires_grad=True。

当 CompiledFunction 出现在跟踪中时，它通常与反向传播过程中的 CompiledFunctionBackward 事件配对。如果调用了反向函数，跟踪中应该会显示连接这两个事件的“正向-反向链接”。

如果您的用例包含一个不需要梯度的图，并且不包含“Torch-Compiled Region”事件，那么可能难以识别 torch.compile 是否正在正确应用。一个线索可能是 Inductor 生成的 Triton 内核的存在。

请查看以下合成示例以进行演示

import torch
import torch._dynamo

class ModelWithBreaks(torch.nn.Module):
    def __init__(self):
        super().__init__()
        def create_sequential():
            return torch.nn.Sequential(
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
                torch.nn.Linear(128, 128),
                torch.nn.ReLU(),
            )
        self.mod1 = create_sequential()
        self.mod2 = create_sequential()
        self.mod3 = create_sequential()
        self.mod4 = create_sequential()

    def forward(self, inp):
        mod1 = self.mod1(inp)
        torch._dynamo.graph_break()
        mod2 = self.mod2(mod1)
        torch._dynamo.graph_break()
        mod3 = self.mod3(mod2)
        torch._dynamo.graph_break()
        mod4 = self.mod4(mod3)
        return mod4


model = ModelWithBreaks().cuda()
inputs = [torch.randn((128, 128), device='cuda') for _ in range(10)]

model_c = torch.compile(model)

def fwd_bwd(inp):
    out = model_c(inp)
    out.sum().backward()

# warm up
fwd_bwd(inputs[0])

with torch.profiler.profile() as prof:
    for i in range(1, 4):
        fwd_bwd(inputs[i])
        prof.step()

prof.export_chrome_trace("trace_break.json")

Visualization in the chrome://trace viewer, showing nested Torch-Compiled Region events and multiple CompiledFunction events - indicating graph breaks.

算子内核¶

当启动一个算子时，我们预计会看到一些事件

CPU 端事件
内核启动（如果涉及 GPU 内核）
GPU 端事件

Visualization in the chrome://trace viewer, showing the three types of events: CPU-side event, kernel launch, and GPU-side event

Inductor 生成的 Triton 内核：1. CPU 端事件应该显示为以“triton_”为前缀的事件。这些事件目前信息量有限 - 内核名称和启动，但信息量少于典型的 aten 内核启动（包含输入形状、类型等）。 2. 内核启动应该显示为 cuLaunchKernel 而不是 cudaLaunchKernel（cudaLaunchKernel 通常用于 aten 操作） 3. GPU 端事件应该显示，其名称描述性强弱取决于 inductor 配置中的 unique_kernel_names。

非 Inductor 生成的 Triton 内核

CPU 端事件可能不会出现在跟踪中；自动插入性能分析器事件的机制目前在 Inductor 级别实现，因此绕过 Inductor 的 Triton 内核可能不会出现在跟踪中，除非用户手动对其进行了标注。
内核启动应该显示为 cuLaunchKernel 而不是 cudaLaunchKernel（cudaLaunchKernel 通常用于 aten 操作）。
GPU 端事件应该显示，名称与编写的 triton 内核相似。

Inductor 生成的 CPU 内核

CPU 端事件不会出现在跟踪中；我们还没有为此添加性能分析功能。
内核启动和 GPU 端事件不存在。

非 Triton 内核（例如 aten 内核或自定义操作）也可能出现在跟踪中。有时，Inductor 会回退到原始操作实现，在这种情况下，您将看到对 aten 操作的调用。

启动开销¶

一个常见问题是 GPU 利用率低。快速识别此问题的方法是在 GPU 上的内核之间存在较大的间隙。

Visualization in the chrome://trace viewer, showing large gaps between GPU kernels. This indicates that the model is CPU bound, likely due to overhead during kernel launches.

这通常是 CPU 开销的结果，例如，如果在内核启动之间花费在 CPU 上的时间大于 GPU 处理内核花费的时间。对于小批量大小，这个问题更为常见。

使用 inductor 时，当启动开销是一个问题时，启用 CUDA 图可以有效提高性能。