注意

点击此处下载完整的示例代码

（原型）使用 MaskedTensor 为 Adagrad 高效编写“稀疏”语义¶

创建于：2022 年 10 月 28 日 | 最后更新于：2022 年 10 月 28 日 | 最后验证：未验证

在学习本教程之前，请先阅读 MaskedTensor 概述和稀疏性教程。

引言和动机¶

Issue 1369 讨论了在为 Adagrad 编写“稀疏”语义时引入的额外代码行，但实际上，代码是使用稀疏性作为掩码语义的代理，而不是稀疏性的预期用例：一种压缩和优化技术。之前，我们通过引入一次性语义和算子来弥补正式掩码语义的缺失，同时强制用户了解索引和值等存储细节。

现在我们有了掩码语义，就可以更好地指出何时将稀疏性用作语义扩展。我们还将比较和对比使用 MaskedTensor 编写的等效代码。最后，将重复显示代码片段，但不包含额外注释，以展示代码简洁性的差异。

准备工作¶

import torch
import warnings

# Disable prototype warnings and such
warnings.filterwarnings(action='ignore', category=UserWarning)

# Some hyperparameters
eps = 1e-10
clr = 0.1

i = torch.tensor([[0, 1, 1], [2, 0, 2]])
v = torch.tensor([3, 4, 5], dtype=torch.float32)
grad = torch.sparse_coo_tensor(i, v, [2, 4])

使用 MaskedTensor 简化代码¶

在我们深入细节之前，让我们更具体地介绍一下这个问题。我们将考察 PyTorch 中 Adagrad（函数式）的实现，最终目标是简化并更忠实地表示掩码方法。

作为参考，这是没有掩码梯度或稀疏性的常规密集代码路径

state_sum.addcmul_(grad, grad, value=1)
std = state_sum.sqrt().add_(eps)
param.addcdiv_(grad, std, value=-clr)

针对稀疏张量的原生实现是

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()

state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))   # a different _make_sparse per layout
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

而 MaskedTensor 将代码精简为以下片段

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)
std2 = std2.sqrt().add(eps)
param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

在本教程中，我们将逐行讲解每种实现，但乍一看，我们可以注意到 (1) MaskedTensor 实现代码简洁得多，以及 (2) 它如何避免密集张量和稀疏张量之间的转换。

原始稀疏实现¶

现在，让我们通过一些内联注释来分解代码

def _make_sparse(grad, grad_indices, values):
    size = grad.size()
    if grad_indices.numel() == 0 or values.numel() == 0:
        return torch.empty_like(grad)
    return torch.sparse_coo_tensor(grad_indices, values, size)

# We don't support sparse gradients
param = torch.arange(8).reshape(2, 4).float()
state_sum = torch.full_like(param, 0.5)  # initial value for state sum

grad = grad.coalesce()  # the update is non-linear so indices must be unique
grad_indices = grad._indices()
grad_values = grad._values()
# pow(2) has the same semantics for both sparse and dense memory layouts since 0^2 is zero
state_sum.add_(_make_sparse(grad, grad_indices, grad_values.pow(2)))

# We take care to make std sparse, even though state_sum clearly is not.
# This means that we're only applying the gradient to parts of the state_sum
# for which it is specified. This further drives the point home that the passed gradient is not sparse, but masked.
# We currently dodge all these concerns using the private method `_values`.
std = state_sum.sparse_mask(grad)
std_values = std._values().sqrt_().add_(eps)

# Note here that we currently don't support div for sparse Tensors because zero / zero is not well defined,
# so we're forced to perform `grad_values / std_values` outside the sparse semantic and then convert back to a
# sparse tensor with `make_sparse`.
# We'll later see that MaskedTensor will actually handle these operations for us as well as properly denote
# undefined / undefined = undefined!
param.add_(_make_sparse(grad, grad_indices, grad_values / std_values), alpha=-clr)

tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])

倒数第三行 – std = state_sum.sparse_mask(grad) – 是一个非常重要的分歧点。

eps 的添加在技术上应该应用于所有值，但实际上只应用于指定的值。在这里，我们将稀疏性用作语义扩展，并强制执行某种已定义值和未定义值的模式。如果梯度的部分值为零，即使它们可以通过其他稀疏存储布局进行压缩，在具体化时仍会被包含。这在理论上是相当脆弱的！话虽如此，有人可能会认为 eps 总是非常小，所以在实践中可能不太重要。

此外，作为存储布局和压缩方案的稀疏性的 add_ 实现应该导致密集化，但为了性能，我们强制它不这样做。对于这个一次性的情况来说还好……直到我们想引入新的压缩方案，例如 CSC、BSR 或 BSC。那时我们将需要为每种格式引入单独的 Tensor 类型，并为使用不同存储格式压缩的梯度编写变体，这是不方便且不太可扩展也不够整洁的。

MaskedTensor 稀疏实现¶

我们一直在混淆将稀疏性作为一种优化与将稀疏性作为 PyTorch 的语义扩展。MaskedTensor 提出将稀疏性优化与语义扩展解耦；例如，目前我们无法实现稀疏存储的密集语义或密集存储的掩码语义。MaskedTensor 通过有意将存储与语义分离来实现这些想法。

考虑使用掩码梯度的上述示例

# Let's now import MaskedTensor!
from torch.masked import masked_tensor

# Create an entirely new set of parameters to avoid errors
param2 = torch.arange(8).reshape(2, 4).float()
state_sum2 = torch.full_like(param, 0.5)  # initial value for state sum

mask = (grad.to_dense() != 0).to_sparse()
masked_grad = masked_tensor(grad, mask)

state_sum2 = state_sum2 + masked_grad.pow(2).get_data()
std2 = masked_tensor(state_sum2.to_sparse(), mask)

# We can add support for in-place operations later. Notice how this doesn't
# need to access any storage internals and is in general a lot shorter
std2 = std2.sqrt().add(eps)

param2 = param2.add((masked_grad / std2).get_data(), alpha=-clr)

请注意，这两种实现看起来非常相似，但 MaskedTensor 实现更短、更简单。特别是，围绕 _make_sparse 的许多样板代码（以及需要为每种布局提供单独实现）都由 MaskedTensor 为用户处理了。

现在，让我们打印此版本和原始版本，以便更容易比较

print("state_sum:\n", state_sum)
print("state_sum2:\n", state_sum2)

state_sum:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])
state_sum2:
 tensor([[ 0.5000,  0.5000,  9.5000,  0.5000],
        [16.5000,  0.5000, 25.5000,  0.5000]])

print("std:\n", std)
print("std2:\n", std2)

std:
 tensor(indices=tensor([[0, 1, 1],
                       [2, 0, 2]]),
       values=tensor([3.0822, 4.0620, 5.0498]),
       size=(2, 4), nnz=3, layout=torch.sparse_coo)
std2:
 MaskedTensor(
  [
    [      --,       --,   3.0822,       --],
    [  4.0620,       --,   5.0498,       --]
  ]
)

print("param:\n", param)
print("param2:\n", param2)

param:
 tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])
param2:
 tensor([[0.0000, 1.0000, 1.9027, 3.0000],
        [3.9015, 5.0000, 5.9010, 7.0000]])

结论¶

在本教程中，我们讨论了原生掩码语义如何为 PyTorch 中 Adagrad 的现有实现提供更简洁的开发体验，该实现曾使用稀疏性作为编写掩码语义的代理。但更重要的是，通过 MaskedTensor 使掩码语义成为一等公民，消除了对稀疏性或不可靠技巧来模拟掩码的依赖，从而实现了适当的独立性和开发，同时支持了稀疏语义，就像本例所示。

进一步阅读¶

要继续了解更多信息，您可以查看我们（目前）关于 MaskedTensor 高级语义的最后回顾，以了解 MaskedTensor 与 NumPy 的 MaskedArray 在设计决策上的一些差异，以及归约语义。

脚本总运行时间： ( 0 分钟 0.011 秒)

由 Sphinx-Gallery 生成的 Gallery