张量并行 - torch.distributed.tensor.parallel¶

张量并行 (TP) 建立在 PyTorch DistributedTensor (DTensor) 之上，并提供不同的并行样式：按列、按行和序列并行。

警告

张量并行 API 仍处于实验阶段，可能会发生更改。

使用张量并行对 nn.Module 进行并行化的入口点是

torch.distributed.tensor.parallel.parallelize_module(module, device_mesh, parallelize_plan)[source]¶

通过基于用户指定计划对模块或子模块进行并行化，在 PyTorch 中应用张量并行。

我们基于 parallelize_plan 对模块或子模块进行并行化。parallelize_plan 包含 ParallelStyle，它指示用户希望如何对模块或子模块进行并行化。

用户还可以为每个模块完全限定名称 (FQN) 指定不同的并行样式。

请注意，parallelize_module 仅接受一维 DeviceMesh，如果您有一个二维或 N 维 DeviceMesh，请先将 DeviceMesh 切片为一维子 DeviceMesh，然后传递给此 API（即 device_mesh["tp"]）

参数

module (nn.Module) – 要并行的模块。
device_mesh (DeviceMesh) – 描述 DTensor 设备网格拓扑的对象。
parallelize_plan (Union[ParallelStyle, Dict[str, ParallelStyle]]) – 用于并行化模块的计划。它可以是 ParallelStyle 对象，其中包含我们如何为张量并行准备输入/输出，或者它可以是模块 FQN 及其对应的 ParallelStyle 对象的字典。

返回

并行化的 nn.Module 对象。

返回类型

模块

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>>
>>> # Define the module.
>>> m = Model(...)
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>> m = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel(), "w2": RowwiseParallel()})
>>>

注意

对于注意力、MLP 层等复杂的模块架构，我们建议将不同的 ParallelStyles 组合在一起（即 ColwiseParallel 和 RowwiseParallel），并作为 parallelize_plan 传递，以实现所需的切分计算。

张量并行支持以下并行样式

class torch.distributed.tensor.parallel.ColwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)[source]¶

以列方式对兼容的 nn.Module 进行分区。目前支持 nn.Linear 和 nn.Embedding。用户可以将其与 RowwiseParallel 结合使用，以实现更复杂模块的分片。（即 MLP、注意力）

关键字参数

input_layouts（放置，可选）- nn.Module 输入张量的 DTensor 布局，这用于将输入张量注释为 DTensor。如果未指定，我们假设输入张量已复制。
output_layouts（放置，可选）- nn.Module 输出的 DTensor 布局，这用于确保 nn.Module 输出具有用户所需的布局。如果未指定，输出张量将分片在最后一个维度上。
use_local_output（bool，可选）- 是否为模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：True。

返回

表示 nn.Module 的按列分片的 ParallelStyle 对象。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "w1" nn.Linear submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "w1" Linear will be converted to Replicated DTensor
>>> # and the output of "w1" will return :class:`torch.Tensor` that shards on the last dim.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel()})
>>> ...

注意

默认情况下，如果未指定 output_layouts，则 ColwiseParallel 输出将在最后一个维度上分片，如果有一些算子需要特定的张量形状（即在配对的 RowwiseParallel 之前），请记住，如果输出被分片，则可能需要根据分片大小调整算子。

类 torch.distributed.tensor.parallel.RowwiseParallel(*, input_layouts=无, output_layouts=无, use_local_output=真)[源代码]¶

按行方式对兼容的 nn.Module 进行分区。目前支持 nn.Linear 和 nn.Embedding。用户可以将其与 ColwiseParallel 结合使用，以实现更复杂模块的分片。（即 MLP、注意力）

关键字参数

input_layouts (放置，可选) – nn.Module 的输入张量的 DTensor 布局，这用于将输入张量注释为 DTensor。如果未指定，我们假设输入张量在最后一个维度上被分片。
output_layouts (放置，可选) – nn.Module 的输出的 DTensor 布局，这用于确保 nn.Module 的输出具有用户希望的布局。如果未指定，则复制输出张量。
use_local_output（bool，可选）- 是否为模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：True。

返回

一个 ParallelStyle 对象，表示 nn.Module 的按行分片。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "w2" nn.Linear submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "w2" Linear will be converted to DTensor that shards on the last dim
>>> # and the output of "w2" will return a replicated :class:`torch.Tensor`.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"w2": RowwiseParallel()}),
>>> ...

类 torch.distributed.tensor.parallel.SequenceParallel(*, sequence_dim=1, use_local_output=False)[源代码]¶

SequenceParallel 复制兼容的 nn.Module 参数，并对在序列维度上分片的输入运行分片计算。目前支持 nn.LayerNorm、nn.Dropout 和 RMSNorm Python 实现

此样式实现论文 Reducing Activation Recomputation in Large Transformer Models 中描述的操作

nn.Module 的输入和输出都将在序列维度上分片。

关键字参数

sequence_dim (int, 可选) – nn.Module 输入张量的序列维度，用于将输入张量标注为在序列维度上分片的一个 DTensor，默认值：1。
use_local_output (bool, 可选) – 是否为模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：False。

返回

一个 ParallelStyle 对象，表示 nn.Module 的序列并行。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, SequenceParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "norm" nn.LayerNorm submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "norm" will be converted to DTensor that shards on the sequence dim
>>> # and the output of "norm" will return a sharded on sequence dimension :class:`DTensor`.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"norm": SequenceParallel()}),
>>> ...

注意

如果 nn.Module 中有权重（即 nn.LayerNorm 或 RMSNorm，并且它们默认具有一个初始化），则 SequenceParallel 样式假定一个初始化。如果您对这些模块上的权重有自定义初始化，则需要在并行化之前/之后广播权重以确保它们被复制。

要仅使用 DTensor 布局配置 nn.Module 的输入和输出并执行必要的布局重新分配，而不将模块参数分配给 DTensor，可以在调用 parallelize_module 时在 parallelize_plan 中使用以下 ParallelStyle s

class torch.distributed.tensor.parallel.PrepareModuleInput(*, input_layouts, desired_input_layouts, use_local_output=False)[source]¶

配置 nn.Module 的输入，以根据 input_layouts 在运行时将 nn.Module 的输入张量转换为 DTensor，并根据 desired_input_layouts 执行布局重新分配。

关键字参数

input_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 的输入张量的 DTensor 布局，这用于将输入张量转换为 DTensor。如果某些输入不是 torch.Tensor 或不需要转换为 DTensor，则需要指定 None 作为占位符。
desired_input_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 输入张量的所需 DTensor 布局，用于确保 nn.Module 的输入具有所需的 DTensor 布局。此参数的长度需要与 input_layouts 相同。
use_local_output (bool, optional) – 是否为模块输入使用本地 torch.Tensor 而不是 DTensor，默认值：False。

返回

一个 ParallelStyle 对象，用于准备 nn.Module 输入的分片布局。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleInput
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> block = TransformerBlock(...)  # block is a nn.Module that contains an "attn" Attention submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     block, # this can be a submodule or module
>>>     tp_mesh,
>>>     parallelize_plan={
>>>         "attn": PrepareModuleInput(
>>>             input_layouts=(Shard(0), None, None, ...),
>>>             desired_input_layouts=(Replicate(), None, None, ...)
>>>         ),
>>>     }
>>> )

class torch.distributed.tensor.parallel.PrepareModuleOutput(*, output_layouts, desired_output_layouts, use_local_output=True)[source]¶

配置 nn.Module 的输出，以便根据 output_layouts 在运行时将 nn.Module 的输出张量转换为 DTensor，并根据 desired_output_layouts 执行布局重新分配。

关键字参数

output_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 输出张量的 DTensor 布局，如果输出张量是 torch.Tensor，则用于将输出张量转换为 DTensor。如果某些输出不是 torch.Tensor 或不需要转换为 DTensor，则需要将 None 指定为占位符。
desired_output_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 输出张量的所需 DTensor 布局，用于确保 nn.Module 的输出具有所需的 DTensor 布局。
use_local_output (bool, optional) – 是否为模块输出使用本地 torch.Tensor 而非 DTensor，默认值：True。

返回

一个 ParallelStyle 对象，用于准备 nn.Module 输出的分片布局。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleOutput
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> block = TransformerBlock(...)  # block is a nn.Module that contains an "attn" Attention submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # According to the style specified below, the output of the TransformerBlock will be converted to Replicated DTensor
>>> # and then redistributed to Sharded DTensor.
>>> parallelize_module(
>>>     block, # this can be a submodule or module
>>>     tp_mesh,
>>>     parallelize_plan = PrepareModuleOutput(
>>>         output_layouts=Replicate(),
>>>         desired_output_layouts=Shard(0)
>>>     )
>>> )

注意

当使用 Shard(dim) 作为上述 ParallelStyle 的输入/输出布局时，我们假设输入/输出激活张量在 TP 运行的 DeviceMesh 上的张量维度 dim 上被均匀分片。例如，由于 RowwiseParallel 接受在最后一个维度上被分片的输入，因此它假设输入张量已经在最后一个维度上被均匀分片。对于不均匀分片激活张量的情况，可以将 DTensor 直接传递到已分区的模块，并使用 use_local_output=False 在每个 ParallelStyle 之后返回 DTensor，其中 DTensor 可以跟踪不均匀分片信息。

对于 Transformer 等模型，我们建议用户在 parallelize_plan 中同时使用 ColwiseParallel 和 RowwiseParallel，以实现整个模型（即注意力和 MLP）的所需分片。

通过以下上下文管理器支持并行交叉熵损失计算（损失并行性）

torch.distributed.tensor.parallel.loss_parallel()[source]¶

一个启用损失并行性的上下文管理器，当输入在类别维度上被分片时，可以执行高效的并行损失计算。目前仅支持交叉熵损失。

在此上下文管理器中，可以像往常一样使用 cross_entropy() 或 CrossEntropyLoss，并对输入参数进行以下假设。相应的 backward() 调用（如果有）也需要在此上下文管理器下进行。

参数

input (DTensor) – 输入 logits。假设在类别维度上已分片。
target (Union[torch.Tensor, DTensor]) – 必须是真实类别索引（当前不支持类别概率）。假设在 DeviceMesh 中已复制。
weight (Union[torch.Tensor, DTensor], 可选) – 如果给出，则假设已在 DeviceMesh 中复制。
label_smoothing – 当前不支持。

返回

复制的 DTensor。

示例

此处手动创建了一个分片 DTensor 以展示用法。在实践中，它通常是 TP 模块的输出。

>>> from torch.distributed.tensor.parallel import loss_parallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> device_mesh = init_device_mesh("cuda", (8,))
>>> input = torch.randn(4, 16, device="cuda", requires_grad=True)
>>> dist_input = distribute_tensor(input, device_mesh, placements=[Shard(1)])
>>> target = torch.randint(16, (4,), device="cuda")
>>> with loss_parallel():
>>>     loss = F.cross_entropy(dist_input, target, reduction="mean")
>>>     loss.backward()
>>> ...

警告

loss_parallel API 处于实验阶段，可能会发生更改。

张量并行 - torch.distributed.tensor.parallel¶

文档

教程

资源