在 `torch.compile` 中进行编译时缓存¶

创建日期：2024 年 6 月 20 日 | 最后更新：2025 年 2 月 27 日 | 最后验证：2024 年 11 月 5 日

作者： Oguz Ulgen

引言¶

PyTorch 编译器提供了多种缓存方案，以减少编译延迟。本文将详细介绍这些方案，帮助用户根据自己的用例选择最佳选项。

有关如何配置这些缓存的信息，请参阅编译时缓存配置。

另请查看我们的缓存性能基准测试：PT CacheBench 基准测试。

先决条件¶

在开始本文之前，请确保你具备以下条件：

对 torch.compile 有基本了解。请参阅：
PyTorch 2.4 或更高版本

缓存方案¶

torch.compile 提供以下缓存方案：

端到端缓存（也称为 Mega-Cache）
`TorchDynamo`、TorchInductor 和 Triton 的模块化缓存

需要注意的是，缓存会验证缓存工件是否与相同的 PyTorch 和 Triton 版本一起使用，并且当设备设置为 cuda 时，还会验证是否与相同的 GPU 一起使用。

`torch.compile` 端到端缓存 (`Mega-Cache`)¶

端到端缓存，在此后称为 Mega-Cache，是寻找可移植缓存解决方案的用户的理想选择，这种解决方案可以将缓存存储在数据库中，并在稍后（可能在不同的机器上）获取。

`Mega-Cache` 提供两个编译器 API：

torch.compiler.save_cache_artifacts()
torch.compiler.load_cache_artifacts()

预期的使用场景是：在编译并执行模型后，用户调用 torch.compiler.save_cache_artifacts()，该 API 将以可移植的形式返回编译器工件。稍后，可能在不同的机器上，用户可以使用这些工件调用 torch.compiler.load_cache_artifacts() 来预填充 torch.compile 缓存，以便快速启动缓存。

请考虑以下示例。首先，编译并保存缓存工件。

@torch.compile
def fn(x, y):
    return x.sin() @ y

a = torch.rand(100, 100, dtype=dtype, device=device)
b = torch.rand(100, 100, dtype=dtype, device=device)

result = fn(a, b)

artifacts = torch.compiler.save_cache_artifacts()

assert artifacts is not None
artifact_bytes, cache_info = artifacts

# Now, potentially store artifact_bytes in a database
# You can use cache_info for logging

之后，你可以通过以下方式快速启动缓存：

# Potentially download/fetch the artifacts from the database
torch.compiler.load_cache_artifacts(artifact_bytes)

此操作将填充下一节中讨论的所有模块化缓存，包括 PGO、AOTAutograd、Inductor、Triton 和 Autotuning。

`TorchDynamo`、`TorchInductor` 和 `Triton` 的模块化缓存¶

前面提到的 Mega-Cache 由各个组件组成，这些组件无需用户干预即可使用。默认情况下，PyTorch 编译器为 TorchDynamo、TorchInductor 和 Triton 提供本地磁盘缓存。这些缓存包括：

FXGraphCache：编译中使用的基于图的 IR 组件的缓存。
TritonCache：Triton 编译结果的缓存，包括 Triton 生成的 cubin 文件和其他缓存工件。
InductorCache：FXGraphCache 和 Triton 缓存的捆绑。
AOTAutogradCache：联合图工件的缓存。
PGO-cache：用于减少重新编译次数的动态形状决策缓存。

所有这些缓存工件都写入 TORCHINDUCTOR_CACHE_DIR，默认情况下看起来像 /tmp/torchinductor_myusername。

远程缓存¶

我们还为希望利用基于 Redis 的缓存的用户提供了远程缓存选项。请参阅编译时缓存配置，了解如何启用基于 Redis 的缓存。

结论¶

在本文中，我们了解到 PyTorch Inductor 的缓存机制通过利用本地和远程缓存显着减少了编译延迟，这些缓存在后台无缝运行，无需用户干预。

在 `torch.compile` 中进行编译时缓存¶

引言¶

先决条件¶

缓存方案¶

`torch.compile` 端到端缓存 (`Mega-Cache`)¶

`TorchDynamo`、`TorchInductor` 和 `Triton` 的模块化缓存¶

远程缓存¶

结论¶

文档

教程

资源

在 torch.compile 中进行编译时缓存¶

引言¶

先决条件¶

缓存方案¶

torch.compile 端到端缓存 (Mega-Cache)¶

`TorchDynamo`、TorchInductor 和 Triton 的模块化缓存¶

远程缓存¶

结论¶

文档

教程

资源

在 `torch.compile` 中进行编译时缓存¶

`torch.compile` 端到端缓存 (`Mega-Cache`)¶

`TorchDynamo`、`TorchInductor` 和 `Triton` 的模块化缓存¶