张量并行 - torch.distributed.tensor.parallel¶

张量并行（TP）构建在 PyTorch 分布式张量（DTensor）之上，提供不同的并行化风格：列并行（Colwise）、行并行（Rowwise）和序列并行（Sequence Parallelism）。

警告

张量并行 API 处于实验阶段，可能会发生变化。

使用张量并行并行化 `nn.Module` 的入口点是

torch.distributed.tensor.parallel.parallelize_module(module, device_mesh=None, parallelize_plan=None, *, src_data_rank=0)[source][source]¶

在 PyTorch 中应用张量并行，根据用户指定的计划并行化模块或子模块。

我们根据 parallelize_plan 并行化模块或子模块。parallelize_plan 包含 `ParallelStyle`，它指示用户希望如何并行化模块或子模块。

用户还可以为每个模块的完全限定名称（FQN）指定不同的并行化风格。

请注意，`parallelize_module` 只接受一维的 `DeviceMesh`，如果您有二维或 N 维的 `DeviceMesh`，请先将其切片为一维子 `DeviceMesh`，然后再传递给此 API（例如 `device_mesh["tp"]`）。

参数

module (nn.Module) – 要并行化的模块。
device_mesh (DeviceMesh, optional) – 描述 DTensor 设备网格拓扑的对象。如果未指定，则调用必须在 DeviceMesh 上下文下进行。
parallelize_plan (Union[ParallelStyle, Dict[str, ParallelStyle]], optional) – 用于并行化模块的计划。它可以是一个 `ParallelStyle` 对象，其中包含如何为张量并行准备输入/输出；或者它可以是一个字典，键是模块的 FQN，值是对应的 `ParallelStyle` 对象。如果未指定，当前调用将不执行任何操作。

关键字参数

src_data_rank (int, optional) – 逻辑/全局张量源数据的 rank，`distribute_tensor()` 使用它将分片/副本分散/广播到其他 rank。默认情况下，我们在每个 DeviceMesh 维度上使用 `group_rank=0` 作为源数据，以保留单设备语义。如果显式传递 `None`，`parallelize_module()` 将直接使用其本地数据，而不是试图通过分散/广播来保留单设备语义。默认值：0

返回值

一个已并行化的 `nn.Module` 对象。

返回类型

Module

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>>
>>> # Define the module.
>>> m = Model(...)
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>> m = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel(), "w2": RowwiseParallel()})
>>>

注意

对于 Attention、MLP 层等复杂的模块架构，我们建议将不同的 ParallelStyles（例如 `ColwiseParallel` 和 `RowwiseParallel`）组合在一起作为 parallelize_plan 传递，以实现所需的分片计算。

张量并行支持以下并行化风格

class torch.distributed.tensor.parallel.ColwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)[source][source]¶

以列方向（column-wise）的方式划分兼容的 nn.Module。目前支持 nn.Linear 和 nn.Embedding。用户可以将其与 RowwiseParallel 组合使用，以实现更复杂模块（如 MLP、Attention）的分片。

关键字参数

input_layouts (Placement, optional) – nn.Module 输入张量的 DTensor 布局，用于将输入张量标注为 DTensor。如果未指定，我们假定输入张量是复制的（replicated）。
output_layouts (Placement, optional) – nn.Module 输出张量的 DTensor 布局，用于确保 nn.Module 的输出具有用户期望的布局。如果未指定，输出张量将在最后一个维度上分片（sharded）。
use_local_output (bool, optional) – 是否对模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：True。

返回值

一个表示 nn.Module 列方向分片的 `ParallelStyle` 对象。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "w1" nn.Linear submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "w1" Linear will be converted to Replicated DTensor
>>> # and the output of "w1" will return :class:`torch.Tensor` that shards on the last dim.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel()})
>>> ...

注意

默认情况下，如果未指定 `output_layouts`，`ColwiseParallel` 的输出将在最后一个维度上分片；如果存在需要特定张量形状的算子（例如，在配对的 `RowwiseParallel` 之前），请记住，如果输出被分片，算子可能需要调整以适应分片后的尺寸。

class torch.distributed.tensor.parallel.RowwiseParallel(*, input_layouts=None, output_layouts=None, use_local_output=True)[source][source]¶

以行方向（row-wise）的方式划分兼容的 nn.Module。目前支持 nn.Linear 和 nn.Embedding。用户可以将其与 ColwiseParallel 组合使用，以实现更复杂模块（如 MLP、Attention）的分片。

关键字参数

input_layouts (Placement, optional) – nn.Module 输入张量的 DTensor 布局，用于将输入张量标注为 DTensor。如果未指定，我们假定输入张量在最后一个维度上分片。
output_layouts (Placement, optional) – nn.Module 输出张量的 DTensor 布局，用于确保 nn.Module 的输出具有用户期望的布局。如果未指定，输出张量是复制的。
use_local_output (bool, optional) – 是否对模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：True。

返回值

一个表示 nn.Module 行方向分片的 `ParallelStyle` 对象。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "w2" nn.Linear submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "w2" Linear will be converted to DTensor that shards on the last dim
>>> # and the output of "w2" will return a replicated :class:`torch.Tensor`.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"w2": RowwiseParallel()}),
>>> ...

class torch.distributed.tensor.parallel.SequenceParallel(*, sequence_dim=1, use_local_output=False)[source][source]¶

SequenceParallel 复制兼容的 `nn.Module` 参数，并使用在序列维度上分片的输入运行分片计算。目前支持 `nn.LayerNorm`、`nn.Dropout` 以及 RMSNorm 的 Python 实现。

此风格实现了论文《Reducing Activation Recomputation in Large Transformer Models》中描述的操作。

如果传递给此 `nn.Module` 的输入是 `torch.Tensor`，则假定输入已在序列维度上分片，并将其转换为在序列维度上分片的 `DTensor`。如果传递给此 `nn.Module` 的输入已经是 `DTensor` 但未在序列维度上分片，则会重新分布输入，使其在序列维度上分片。

`nn.Module` 的输出将在序列维度上分片。

关键字参数

sequence_dim (int, optional) – `nn.Module` 输入张量的序列维度，用于将输入张量标注为在序列维度上分片的 DTensor，默认值：1。
use_local_output (bool, optional) – 是否对模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：False。

返回值

一个表示 `nn.Module` 序列并行的 `ParallelStyle` 对象。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, SequenceParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "norm" nn.LayerNorm submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "norm" will be converted to DTensor that shards on the sequence dim
>>> # and the output of "norm" will return a sharded on sequence dimension :class:`DTensor`.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"norm": SequenceParallel()}),
>>> ...

注意

如果 nn.Module 中有权重（即 `nn.LayerNorm` 或 `RMSNorm`，它们默认使用全一初始化），SequenceParallel 风格假定使用全一初始化。如果您对这些模块的权重有自定义初始化，您需要在并行化之前/之后广播权重，以确保它们是复制的。

为了简单地使用 DTensor 布局配置 nn.Module 的输入和输出，并执行必要的布局重新分布，而不将模块参数分布到 DTensors，在调用 `parallelize_module` 时，可以在 `parallelize_plan` 中使用以下 `ParallelStyle` 风格：

class torch.distributed.tensor.parallel.PrepareModuleInput(*, input_layouts=None, desired_input_layouts=None, input_kwarg_layouts=None, desired_input_kwarg_layouts=None, use_local_output=False)[source][source]¶

配置 nn.Module 的输入，以便在运行时根据 `input_layouts` 将 nn.Module 的输入张量转换为 DTensors，并根据 `desired_input_layouts` 执行布局重新分布。

关键字参数

input_layouts (Union[Placement, Tuple[Optional[Placement]]]) – nn.Module 输入张量的 DTensor 布局，用于将输入张量转换为 DTensors。如果某些输入不是 torch.Tensor 或不需要转换为 DTensors，则需要指定 `None` 作为占位符。默认值：None。
desired_input_layouts (Union[Placement, Tuple[Optional[Placement]]]) – nn.Module 输入张量的期望 DTensor 布局，用于确保 nn.Module 的输入具有期望的 DTensor 布局。此参数需要与 `input_layouts` 的长度相同。默认值：None。
input_kwarg_layouts (Dict[str, Placement]) – nn.Module 输入关键字参数的 DTensor 布局，用于将输入关键字参数张量转换为 DTensors。默认值：None
desired_input_kwarg_layouts – (Dict[str, Placement])：nn.Module 输入关键字参数的期望 DTensor 布局，用于确保 nn.Module 的输入具有期望的 DTensor 布局。默认值：None。
use_local_output (bool, optional) – 是否对模块输入使用本地 torch.Tensor 而不是 DTensor，默认值：False。

返回值

一个准备 nn.Module 输入分片布局的 `ParallelStyle` 对象。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleInput
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> block = TransformerBlock(...)  # block is a nn.Module that contains an "attn" Attention submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     block, # this can be a submodule or module
>>>     tp_mesh,
>>>     parallelize_plan={
>>>         "attn": PrepareModuleInput(
>>>             input_layouts=(Shard(0), None, None, ...),
>>>             desired_input_layouts=(Replicate(), None, None, ...)
>>>         ),
>>>     }
>>> )

class torch.distributed.tensor.parallel.PrepareModuleOutput(*, output_layouts, desired_output_layouts, use_local_output=True)[source][source]¶

配置 nn.Module 的输出，以便在运行时根据 `output_layouts` 将 nn.Module 的输出张量转换为 DTensors，并根据 `desired_output_layouts` 执行布局重新分布。

关键字参数

output_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 输出张量的 DTensor 布局，用于在输出张量是 torch.Tensor 时将其转换为 DTensors。如果某些输出不是 torch.Tensor 或不需要转换为 DTensors，则需要指定 `None` 作为占位符。
desired_output_layouts (Union[Placement, Tuple[Placement]]) – nn.Module 输出张量的期望 DTensor 布局，用于确保 nn.Module 的输出具有期望的 DTensor 布局。
use_local_output (bool, optional) – 是否对模块输出使用本地 torch.Tensor 而不是 DTensor，默认值：True。

返回值

一个准备 nn.Module 输出分片布局的 ParallelStyle 对象。

示例：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleOutput
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> block = TransformerBlock(...)  # block is a nn.Module that contains an "attn" Attention submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # According to the style specified below, the output of the TransformerBlock will be converted to Replicated DTensor
>>> # and then redistributed to Sharded DTensor.
>>> parallelize_module(
>>>     block, # this can be a submodule or module
>>>     tp_mesh,
>>>     parallelize_plan = PrepareModuleOutput(
>>>         output_layouts=Replicate(),
>>>         desired_output_layouts=Shard(0)
>>>     )
>>> )

注意

当使用 `Shard(dim)` 作为上述 `ParallelStyle` 风格的输入/输出布局时，我们假定输入/输出激活张量在 TP 作用的 `DeviceMesh` 上，已在张量维度 `dim` 上均匀分片。例如，由于 `RowwiseParallel` 接受在最后一个维度上分片的输入，因此它假定输入张量已在最后一个维度上均匀分片。对于激活张量分片不均匀的情况，可以直接将 DTensor 传递给已分区的模块，并使用 `use_local_output=False` 在每个 `ParallelStyle` 之后返回 DTensor，DTensor 可以跟踪不均匀的分片信息。

对于 Transformer 等模型，我们建议用户在 parallelize_plan 中结合使用 `ColwiseParallel` 和 `RowwiseParallel`，以实现整个模型（如 Attention 和 MLP）的所需分片。

并行化的交叉熵损失计算（损失并行），通过以下上下文管理器支持：

torch.distributed.tensor.parallel.loss_parallel()[source][source]¶

一个启用损失并行的上下文管理器，当输入在类别维度上分片时，可以执行高效的并行化损失计算。目前仅支持交叉熵损失。

在此上下文管理器中，可以照常使用 cross_entropy() 或 CrossEntropyLoss，但输入参数需满足以下假设。相应的 `backward()` 调用（如果存在）也需要在该上下文管理器下进行。

参数

input (DTensor) – 输入 logits。假定在类别维度上分片。
target (Union[torch.Tensor, DTensor]) – 必须是真实类别索引（目前不支持类别概率）。假定在整个 DeviceMesh 上是复制的。
weight (Union[torch.Tensor, DTensor], optional) – 如果给定，假定在整个 DeviceMesh 上是复制的。
label_smoothing – 目前不支持。

返回值

一个复制的 DTensor。

示例

此处手动创建一个分片的 DTensor 以展示用法。在实践中，它通常是 TP 模块的输出。

>>> from torch.distributed.tensor.parallel import loss_parallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> device_mesh = init_device_mesh("cuda", (8,))
>>> input = torch.randn(4, 16, device="cuda", requires_grad=True)
>>> dist_input = distribute_tensor(input, device_mesh, placements=[Shard(1)])
>>> target = torch.randint(16, (4,), device="cuda")
>>> with loss_parallel():
>>>     loss = F.cross_entropy(dist_input, target, reduction="mean")
>>>     loss.backward()
>>> ...

警告

loss_parallel API 处于实验阶段，可能会发生变化。

张量并行 - torch.distributed.tensor.parallel¶

文档

教程

资源