注意
点击此处下载完整的示例代码
(beta)运行带有 LR 调度器的已编译优化器¶
创建于:2024 年 5 月 21 日 | 最后更新:2024 年 5 月 21 日 | 最后验证:2024 年 11 月 05 日
作者: Michael Lazos
优化器是训练任何深度学习模型的关键算法。在此示例中,我们将展示如何将已使用 torch.compile
编译的优化器与 LR 调度器配对,以加速训练收敛。
注意
本教程需要 PyTorch 2.3.0 或更高版本。
模型设置¶
在此示例中,我们将使用简单的线性层序列。
import torch
# Create simple model
model = torch.nn.Sequential(
*[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
)
input = torch.rand(1024, device="cuda")
# run forward pass
output = model(input)
# run backward to populate the grads for our optimizer below
output.sum().backward()
设置和运行带有 LR 调度器的已编译优化器¶
在本节中,我们将使用带有 LinearLR 调度器的 Adam 优化器,并创建一个辅助函数来包装 torch.compile()
中每个优化器的 step()
调用。
注意
torch.compile
仅在计算能力为 7.0 或更高的 CUDA 设备上受支持。
# exit cleanly if we are on a device that doesn't support ``torch.compile``
if torch.cuda.get_device_capability() < (7, 0):
print("Exiting because torch.compile is not supported on this device.")
import sys
sys.exit(0)
# !!! IMPORTANT !!! Wrap the lr in a Tensor if we are pairing the
# the optimizer with an LR Scheduler.
# Without this, torch.compile will recompile as the value of the LR
# changes.
opt = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01))
sched = torch.optim.lr_scheduler.LinearLR(opt, total_iters=5)
@torch.compile(fullgraph=False)
def fn():
opt.step()
sched.step()
# Warmup runs to compile the function
for _ in range(5):
fn()
print(opt.param_groups[0]["lr"])
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
tensor(0.0047)
tensor(0.0060)
tensor(0.0073)
tensor(0.0087)
tensor(0.0100)
扩展:非张量 LR 会发生什么?¶
对于好奇的人,我们将展示当我们不将 LR 包装在张量中时,torch.compile
会发生什么。
# No longer wrap the LR in a tensor here
opt = torch.optim.Adam(model.parameters(), lr=0.01)
sched = torch.optim.lr_scheduler.LinearLR(opt, total_iters=5)
@torch.compile(fullgraph=False)
def fn():
opt.step()
sched.step()
# Setup logging to view recompiles
torch._logging.set_logs(recompiles=True)
# Warmup runs to compile the function
# We will now recompile on each iteration
# as the value of the lr is mutated.
for _ in range(5):
fn()
[rank0]:V0203 17:14:48.091000 634 torch/_dynamo/guards.py:2791] [34/1] [__recompiles] Recompiling function wrapper in /usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py:473
[rank0]:V0203 17:14:48.091000 634 torch/_dynamo/guards.py:2791] [34/1] [__recompiles] triggered by the following guard failure(s):
[rank0]:V0203 17:14:48.091000 634 torch/_dynamo/guards.py:2791] [34/1] [__recompiles] - 34/0: Cache line invalidated because L['args'][0] got deallocated
[rank0]:V0203 17:14:48.109000 634 torch/_dynamo/guards.py:2791] [35/1] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:210
[rank0]:V0203 17:14:48.109000 634 torch/_dynamo/guards.py:2791] [35/1] [__recompiles] triggered by the following guard failure(s):
[rank0]:V0203 17:14:48.109000 634 torch/_dynamo/guards.py:2791] [35/1] [__recompiles] - 35/0: Cache line invalidated because L['self'] got deallocated
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
[rank0]:V0203 17:14:51.280000 634 torch/_dynamo/guards.py:2791] [35/2] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:210
[rank0]:V0203 17:14:51.280000 634 torch/_dynamo/guards.py:2791] [35/2] [__recompiles] triggered by the following guard failure(s):
[rank0]:V0203 17:14:51.280000 634 torch/_dynamo/guards.py:2791] [35/2] [__recompiles] - 35/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0203 17:14:51.280000 634 torch/_dynamo/guards.py:2791] [35/2] [__recompiles] - 35/0: Cache line invalidated because L['self'] got deallocated
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
[rank0]:V0203 17:14:53.642000 634 torch/_dynamo/guards.py:2791] [35/3] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:210
[rank0]:V0203 17:14:53.642000 634 torch/_dynamo/guards.py:2791] [35/3] [__recompiles] triggered by the following guard failure(s):
[rank0]:V0203 17:14:53.642000 634 torch/_dynamo/guards.py:2791] [35/3] [__recompiles] - 35/2: L['self'].param_groups[0]['lr'] == 0.004666666666666667
[rank0]:V0203 17:14:53.642000 634 torch/_dynamo/guards.py:2791] [35/3] [__recompiles] - 35/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0203 17:14:53.642000 634 torch/_dynamo/guards.py:2791] [35/3] [__recompiles] - 35/0: Cache line invalidated because L['self'] got deallocated
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
[rank0]:V0203 17:14:55.994000 634 torch/_dynamo/guards.py:2791] [35/4] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:210
[rank0]:V0203 17:14:55.994000 634 torch/_dynamo/guards.py:2791] [35/4] [__recompiles] triggered by the following guard failure(s):
[rank0]:V0203 17:14:55.994000 634 torch/_dynamo/guards.py:2791] [35/4] [__recompiles] - 35/3: L['self'].param_groups[0]['lr'] == 0.006000000000000001
[rank0]:V0203 17:14:55.994000 634 torch/_dynamo/guards.py:2791] [35/4] [__recompiles] - 35/2: L['self'].param_groups[0]['lr'] == 0.004666666666666667
[rank0]:V0203 17:14:55.994000 634 torch/_dynamo/guards.py:2791] [35/4] [__recompiles] - 35/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0203 17:14:55.994000 634 torch/_dynamo/guards.py:2791] [35/4] [__recompiles] - 35/0: Cache line invalidated because L['self'] got deallocated
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] Recompiling function step in /usr/local/lib/python3.10/dist-packages/torch/optim/adam.py:210
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] triggered by the following guard failure(s):
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] - 35/4: L['self'].param_groups[0]['lr'] == 0.007333333333333335
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] - 35/3: L['self'].param_groups[0]['lr'] == 0.006000000000000001
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] - 35/2: L['self'].param_groups[0]['lr'] == 0.004666666666666667
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] - 35/1: L['self'].param_groups[0]['lr'] == 0.003333333333333333
[rank0]:V0203 17:14:58.354000 634 torch/_dynamo/guards.py:2791] [35/5] [__recompiles] - 35/0: Cache line invalidated because L['self'] got deallocated
('Grad tensors ["L['self'].param_groups[0]['params'][0].grad", "L['self'].param_groups[0]['params'][1].grad", "L['self'].param_groups[0]['params'][2].grad", "L['self'].param_groups[0]['params'][3].grad", "L['self'].param_groups[0]['params'][4].grad", "L['self'].param_groups[0]['params'][5].grad", "L['self'].param_groups[0]['params'][6].grad", "L['self'].param_groups[0]['params'][7].grad", "L['self'].param_groups[0]['params'][8].grad", "L['self'].param_groups[0]['params'][9].grad"] will be copied during cudagraphs execution.If using cudagraphs and the grad tensor addresses will be the same across runs, use torch._dynamo.decorators.mark_static_address to elide this copy.',)
通过此示例,我们可以看到由于 param_groups[0]
中 lr
上的 guard 失败,我们重新编译优化器几次。
结论¶
在本教程中,我们展示了如何将使用 torch.compile
编译的优化器与 LR 调度器配对,以加速训练收敛。我们使用由简单线性层序列组成的模型,并将 Adam 优化器与 LinearLR 调度器配对,以演示 LR 在迭代过程中的变化。
另请参阅
已编译优化器教程 - 已编译优化器的介绍。
使用 PT2 编译优化器 - 有关已编译优化器的更深入的技术细节。
脚本的总运行时间: ( 0 分 15.574 秒)