TransformerDecoder¶

class torchtune.modules.TransformerDecoder(*, tok_embeddings: Embedding, layers: Union[Module, List[Module], ModuleList], max_seq_len: int, num_heads: int, head_dim: int, norm: Module, output: Union[Linear, Callable], num_layers: Optional[int] = None, output_hidden_states: Optional[List[int]] = None)[source]¶

派生自 Llama2 架构的 Transformer Decoder。

参数：

tok_embeddings (nn.Embedding) – PyTorch 嵌入层，用于将 token 移动到嵌入空间。
layers (Union[nn.Module, List[nn.Module], nn.ModuleList]) – 单个 Transformer Decoder 层、一个 nn.ModuleList 层列表或一个层列表。建议使用 nn.ModuleList。
max_seq_len (int) – 模型将运行的最大序列长度，由 KVCache() 使用
num_heads (int) – 查询头数量。对于 MHA，这也是 key 和 value 的头数量。这用于设置 KVCache()
head_dim (int) – 自注意力中每个头的嵌入维度。这用于设置 KVCache()
norm (nn.Module) – 可调用对象，用于在最终 MLP 之前对解码器输出应用归一化。
output (Union[nn.Linear, Callable]) – 可调用对象，用于对解码器输出应用线性变换。
num_layers (Optional[int]) – Transformer Decoder 层的数量，仅当 layers 不是列表时定义。
output_hidden_states (Optional[List[int]]) – 要包含在输出中的层（索引）列表

引发：

AssertionError – 如果设置了 num_layers 且 layer 是列表，**或** 未设置 num_layers 且 layer 是 nn.Module。

注意

参数值在使用它们的模块中检查其正确性（例如：attn_dropout 属于 [0,1]）。这有助于减少代码中的 raise 语句数量并提高可读性。

caches_are_enabled() → bool[source]¶: 检查 key value 缓存是否已启用。一旦设置了 KV-caches，相关的注意力模块将被“启用”，所有前向传播将更新缓存。可以通过使用 torchtune.modules.common_utils.disable_kv_cache() 来“禁用”KV-caches，从而禁用此行为而不改变 KV-caches 的状态，此时 caches_are_enabled 将返回 False。

caches_are_setup() → bool[source]¶: 检查 key value 缓存是否已设置。这意味着 setup_caches 已被调用，并且模型中相关的注意力模块已创建其 KVCache。

chunked_output(last_hidden_state: Tensor) → List[Tensor][source]¶

分块应用输出投影。这应与 CEWithChunkedOutputLoss 结合使用，因为 fp32 的向上转型在那里完成。

要使用此方法，您应首先调用 set_num_output_chunks()。

参数：

last_hidden_state (torch.Tensor) – 解码器的最后一个隐藏状态，形状为 [b, seq_len, embed_dim]。

返回：

num_chunks 个输出张量列表，每个张量的形状为: [b, seq_len/num_chunks, out_dim]，其中 out_dim 通常是词汇表大小。

返回类型：

List[torch.Tensor]

forward(tokens: Tensor, *, mask: Optional[Tensor] = None, encoder_input: Optional[Tensor] = None, encoder_mask: Optional[Tensor] = None, input_pos: Optional[Tensor] = None) → Union[Tensor, List[Tensor]][source]¶

参数：

tokens (torch.Tensor) – 形状为 [b x s] 的输入张量
mask (Optional[_MaskType]) –
用于在 query-key 乘法后、softmax 前掩盖分数。

如果在推理期间设置了缓存，则需要此参数。可以是形状为 [b x s x s]、[b x s x self.encoder_max_cache_seq_len] 或 [b x s x self.encoder_max_cache_seq_len] 的布尔张量，如果在 KV 缓存与 encoder/decoder 层结合使用时。行 i 列 j 中的 True 值表示 token i 注意 token j。False 值表示 token i 不注意 token j。如果未指定 mask，默认使用因果 mask。

用于在通过 create_block_mask 创建的打包序列中进行文档掩码的 BlockMask。在计算带有块掩码的注意力时，我们使用 flex_attention()。默认为 None。
encoder_input (Optional[torch.Tensor]) – 来自编码器的可选输入嵌入。形状为 [b x s_e x d_e]
encoder_mask (Optional[torch.Tensor]) – 定义 token 和编码器嵌入之间关系矩阵的布尔张量。位置 i,j 处的 True 值表示解码器中 token i 可以注意嵌入 j。mask 形状为 [b x s x s_e]。默认为 None，但在推理期间，如果模型已设置使用编码器嵌入的任何层并且缓存已设置，则此参数是必需的。
input_pos (Optional[torch.Tensor]) – 可选张量，包含每个 token 的位置 ID。在训练期间，这用于指示打包时每个 token 相对于其样本的位置，形状为 [b x s]。在推理期间，这指示当前 token 的位置。如果在推理期间设置了缓存，则需要此参数。默认为 None。

返回：

形状为 [b x s x v] 的输出张量或由 output_hidden_states 定义的层输出张量列表: 并将最终输出张量附加到列表中。

返回类型：

Union[torch.Tensor, List[torch.Tensor]]

注意

在推理的第一步，当为模型提供提示时，input_pos 应包含提示中所有 token 的位置。对于单 batch 提示，或长度相同的 batch 提示，这将是 torch.arange(prompt_length)。对于不同长度的 batch 提示，较短的提示左侧填充，位置 ID 相应地右移，因此位置 ID 的形状应为 [b, padded_prompt_length]。这是因为我们需要检索每个输入 ID 的位置嵌入。在后续步骤中，如果模型已设置 KV 缓存，input_pos 将包含当前 token 的位置 torch.tensor([padded_prompt_length])。否则，input_pos 将包含直到当前 token 的所有位置 ID。

形状表示

b: batch size
s: token sequence length
s_e: encoder sequence length
v: vocab size
d: token embed dim
d_e: encoder embed dim
m_s: max seq len

reset_caches()[source]¶

重置相关注意力模块上的 KV 缓存缓冲区为零，并将缓存位置重置为零，但不删除或重新分配缓存张量。

引发：: RuntimeError – 如果 KV 缓存未设置。使用 setup_caches() 先设置缓存。

set_num_output_chunks(num_output_chunks: int) → None[source]¶: 用于与 CEWithChunkedOutputLoss 结合使用以节省内存。应在第一次前向传播之前在 recipe 中调用此方法。

setup_caches(batch_size: int, dtype: dtype, *, encoder_max_seq_len: Optional[int] = None, decoder_max_seq_len: Optional[int] = None)[source]¶

为推理设置 key-value 注意力缓存。对于 self.layers 中的每一层

TransformerSelfAttentionLayer 将使用 decoder_max_seq_len。
TransformerCrossAttentionLayer 将使用 encoder_max_seq_len。
FusionLayer 将使用 decoder_max_seq_len 和 encoder_max_seq_len。

参数：

batch_size (int) – caches 的批处理大小。
dtype (torch.dpython:type) – caches 的 dtype。
encoder_max_seq_len (Optional[int]) – 编码器缓存的最大序列长度。
decoder_max_seq_len (Optional[int]) – 解码器缓存的最大序列长度。

TransformerDecoder¶

文档

教程

资源