clip_视觉编码器¶

torchtune.models.clip.clip_vision_encoder(tile_size: int, patch_size: int, embed_dim: int, num_layers: int, num_heads: int, activation: ~typing.Callable = <class 'torch.nn.modules.activation.SiLU'>, cls_output_dim: int = 512, attn_bias: bool = True, use_rope: bool = False, out_indices: ~typing.Optional[~typing.List[int]] = None, output_cls_projection: bool = False, max_num_tiles: int = 4, in_channels: int = 3, append_cls_token: bool = False, use_tile_pos_embed: bool = True) → VisionTransformer[source]¶

构建与 clip 模型关联的视觉编码器。这包括

TransformerEncoderLayer
位置嵌入
CLS 投影（可选）

有关详细信息，请查阅以下文档：torchtune.modules.vision_transformer.VisionTransformer。

参数：

tile_size (int) – 你的图像瓦片大小，如果图像已预先进行瓦片裁剪。否则，为输入图像的大小。在这种情况下，函数将把你的图像视为单个瓦片。
patch_size (int) – 每个图像块的大小。用于将瓦片分割成图像块。例如，对于 patch_size=40，形状为 (400, 400) 的瓦片将有 10x10 网格的图像块，每个图像块的形状为 (40, 40)。
embed_dim (int) – 每个图像块嵌入（token）的维度。
num_layers (int) – Transformer 层数。
num_heads (int) – 每个 Transformer 层中的注意力头数。
activation (Callable) – 在 MLP 层中使用的激活函数。
cls_output_dim (int) – CLS 投影模块输出张量的维度。
attn_bias (bool) – 是否在注意力模块中使用偏置的布尔值。默认为 True。
use_rope (bool) – 如果为 True，则在每个 Transformer 层中的注意力中包含 2D rope。默认为 False
out_indices (Optional[List[int]]) – 要返回的隐藏层的索引。如果提供，将返回 Transformer 层在进入下一层之前的中间结果。例如，out_indices=[0,3] 将返回进入第一层和第四层之前的 token。
output_cls_projection (bool) – 如果为 True，则仅输出 CLS token 的投影，而不是所有 token。默认为 False。
max_num_tiles (int) – 可以处理的最大瓦片数。这用于确定位置嵌入的大小。
in_channels (int) – 图像输入通道数。
append_cls_token (bool) – 如果为 True，则将 CLS token 嵌入添加到 vision transformer 序列的末尾。默认为 False，此时将 CLS token 添加到序列的开头。
use_tile_pos_embed (bool) – 如果为 True，如果 max_num_tiles > 1，则使用预瓦片、后瓦片和瓦片化 token 位置嵌入。如果为 False，则仅使用标准的 token 位置嵌入。

返回：

一个 VisionTransformer 对象。

抛出：

AssertionError – 如果 embed_dim 不能被 num_heads 整除。

clip_视觉编码器¶

文档

教程

资源