torchaudio.models.hubert_pretrain_model¶

torchaudio.models.hubert_pretrain_model(extractor_mode: str, extractor_conv_layer_config: Optional[List[Tuple[int, int, int]]], extractor_conv_bias: bool, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_pos_conv_kernel: int, encoder_pos_conv_groups: int, encoder_num_layers: int, encoder_num_heads: int, encoder_attention_dropout: float, encoder_ff_interm_features: int, encoder_ff_interm_dropout: float, encoder_dropout: float, encoder_layer_norm_first: bool, encoder_layer_drop: float, mask_prob: float, mask_selection: str, mask_other: float, mask_length: int, no_mask_overlap: bool, mask_min_space: int, mask_channel_prob: float, mask_channel_selection: str, mask_channel_other: float, mask_channel_length: int, no_mask_channel_overlap: bool, mask_channel_min_space: int, skip_masked: bool, skip_nomask: bool, num_classes: int, final_dim: int, feature_grad_mult: Optional[float]) → HuBERTPretrainModel[source]¶

从头开始构建自定义的 HuBERTPretrainModel 模型用于训练

注意

下面的“特征提取器”对应于原始 fairseq 实现中的 ConvFeatureExtractionModel。这在 wav2vec 2.0 [Baevski et al., 2020] 论文中被称为“（卷积）特征编码器”。

下面的“编码器”对应于 TransformerEncoder，这在论文中被称为“Transformer”。

参数:

extractor_mode (str) –
特征提取器的操作模式。有效值包括 "group_norm" 或 "layer_norm"。如果为 "group_norm"，则在第一个卷积块中应用单一归一化。否则，所有卷积块都将进行层归一化。

此选项对应于 fairseq 中的 extractor_mode。
extractor_conv_layer_config (list of python:integer tuples or None) –
特征提取器中卷积层的配置。卷积配置的列表，即 [(output_channel, kernel_size, stride), ...]

如果提供 None，则使用以下默认值。
```
[
  (512, 10, 5),
  (512, 3, 2),
  (512, 3, 2),
  (512, 3, 2),
  (512, 3, 2),
  (512, 2, 2),
  (512, 2, 2),
]
```
此选项对应于 fairseq 中的 conv_feature_layers。
extractor_conv_bias (bool) –
是否在每次卷积操作中包含偏置项。

此选项对应于 fairseq 中的 conv_bias。
encoder_embed_dim (int) –
编码器中嵌入的维度。

此选项对应于 fairseq 中的 encoder_embed_dim。
encoder_projection_dropout (float) –
输入特征投影到 encoder_embed_dim 后应用的 dropout 概率。

此选项对应于 fairseq 中的 dropout_input。
encoder_pos_conv_kernel (int) –
卷积位置嵌入的核大小。

此选项对应于 fairseq 中的 conv_pos。
encoder_pos_conv_groups (int) –
卷积位置嵌入的分组数量。

此选项对应于 fairseq 中的 conv_pos_groups。
encoder_num_layers (int) –
Transformer 块中自注意力层的数量。

此选项对应于 fairseq 中的 encoder_layers。
encoder_num_heads (int) –
自注意力层中的头数。

此选项对应于 fairseq 中的 encoder_attention_heads。
encoder_attention_dropout (float) –
自注意力层中 softmax 后应用的 dropout 概率。

此选项对应于 fairseq 中的 attention_dropout。
encoder_ff_interm_features (int) –
前馈层中隐藏特征的维度。

此选项对应于 fairseq 中的 encoder_ffn_embed_dim。
encoder_ff_interm_dropout (float) –
前馈层中应用的 dropout 概率。

此选项对应于 fairseq 中的 activation_dropout。
encoder_dropout (float) –
前馈层末尾应用的 dropout 概率。

此选项对应于 fairseq 中的 dropout。
encoder_layer_norm_first (bool) –
控制 Transformer 层和每个编码器层中层归一化的顺序。如果为 True，在 Transformer 层中，层归一化在将特征馈送到编码器层之前应用。在编码器层中，自注意力之前和之后各应用一个层归一化。如果为 False，在 Transformer 层中，层归一化在将特征馈送到编码器层之后应用。在编码器层中，自注意力之后、前馈层之前和之后各应用一个层归一化。

此选项对应于 fairseq 中的 layer_norm_first。
encoder_layer_drop (float) –
训练期间丢弃每个编码器层的概率。

此选项对应于 fairseq 中的 layerdrop。
mask_prob (float) –
每个 token 被选为要掩盖的 span 起点的概率。此概率将乘以时间步长数除以掩码 span 的长度，以大致掩盖所有元素的此百分比。然而，由于重叠，实际数量会更小（除非 no_overlap 为 True）。

此选项对应于 fairseq 中的 mask_prob。
mask_selection (str) –
如何选择掩码长度。选项：[static, uniform, normal, poisson]。

此选项对应于 fairseq 中的 mask_selection。
mask_other (float) –
次要掩码参数（用于更复杂的分布）。

此选项对应于 fairseq 中的 mask_other。
mask_length (int) –
掩码的长度。

此选项对应于 fairseq 中的 mask_length。
no_mask_overlap (bool) –
是否允许掩码重叠。

此选项对应于 fairseq 中的 no_mask_overlap。
mask_min_space (int) –
span 之间的最小间隔（如果未启用重叠）。

此选项对应于 fairseq 中的 mask_min_space。
mask_channel_prob –
(float): 将特征替换为 0 的概率。

此选项对应于 fairseq 中的 mask_channel_prob。
mask_channel_selection (str) –
如何选择通道掩码的掩码长度。选项：[static, uniform, normal, poisson]。

此选项对应于 fairseq 中的 mask_channel_selection。
mask_channel_other (float) –
通道掩码的次要掩码参数（用于更复杂的分布）。

此选项对应于 fairseq 中的 mask_channel_other。
mask_channel_length (int) –
通道掩码的 span 之间的最小间隔（如果未启用重叠）。

此选项对应于 fairseq 中的 mask_channel_length。
no_mask_channel_overlap (bool) –
是否允许通道掩码重叠。

此选项对应于 fairseq 中的 no_mask_channel_overlap。
mask_channel_min_space (int) –
通道掩码的 span 之间的最小间隔（如果未启用重叠）。

此选项对应于 fairseq 中的 mask_channel_min_space。
skip_masked (bool) –
如果为 True，跳过计算被掩码帧上的损失。

此选项对应于 fairseq 中的 skip_masked。
skip_nomask (bool) –
如果为 True，跳过计算未被掩码帧上的损失。

此选项对应于 fairseq 中的 skip_nomask。
num_classes (int) – 标签中的类别数。
final_dim (int) –
将最终表示和目标投影到 final_dim。

此选项对应于 fairseq 中的 final_dim。
feature_grad_mult (float 或 None) –
用于缩放卷积特征提取层梯度的因子。此缩放因子不会影响前向传播。

此选项对应于 fairseq 中的 feature_grad_mult。

返回:

生成的模型。

返回类型:

HuBERTPretrainModel

torchaudio.models.hubert_pretrain_model¶

文档

教程

资源