关于 Configs 的所有内容¶

本深度解析将指导你编写用于运行 Recipes 的配置（configs）。

本深度解析涵盖内容

如何编写 YAML config 并使用它运行 Recipe
如何使用 instantiate 和 parse API
如何有效使用 configs 和 CLI 覆盖来运行 Recipes

先决条件

参数存储在哪里？¶

有两个主要的入口点供你配置参数：configs 和 **CLI 覆盖**。Configs 是 YAML 文件，在一个位置定义了运行 Recipe 所需的所有参数。它们是重现运行的单一事实来源。可以使用 tune 命令在命令行上覆盖 config 参数，以便快速更改和实验，而无需修改 config 文件本身。

编写 configs¶

Configs 是在 torchtune 中运行 Recipes 的主要入口点。它们通常是 YAML 文件，简单地列出了你希望为特定运行定义的参数值。

seed: null
shuffle: True
device: cuda
dtype: fp32
enable_fsdp: True
...

使用 `instantiate` 配置组件¶

许多字段需要指定 torchtune 对象及其相关的关键字参数作为参数。模型、数据集、优化器和损失函数是常见的例子。你可以使用 _component_ 子字段轻松完成此操作。在 _component_ 中，你需要指定希望在 Recipe 中实例化的对象的点路径（dotpath）。点路径是你在 Python 文件中正常导入对象时使用的确切路径。例如，要在你的 config 中使用自定义参数指定 alpaca_dataset：

dataset:
  _component_: torchtune.datasets.alpaca_dataset
  train_on_input: False

在这里，我们将 train_on_input 的默认值从 True 更改为 False。

在你的 config 中指定 _component_ 后，可以在 Recipe 的设置中创建指定对象的实例，如下所示：

from torchtune import config

# Access the dataset field and create the object instance
dataset = config.instantiate(cfg.dataset)

这将自动使用 dataset 字段下指定的任何关键字参数。

如上面所写，前面的示例实际上会抛出错误。如果你查看 alpaca_dataset 的方法，你会注意到我们缺少一个必需的位置参数：分词器（tokenizer）。由于这是一个另一个可配置的 torchtune 对象，让我们通过查看 instantiate() API 来了解如何处理此问题。

def instantiate(
    config: DictConfig,
    *args: Any,
    **kwargs: Any,
)

instantiate() 也接受位置参数和关键字参数，并在创建对象时自动与 config 一起使用这些参数。这意味着我们不仅可以传入分词器（tokenizer），如果需要的话，还可以添加 config 中未指定的额外关键字参数

# Tokenizer is needed for the dataset, configure it first
tokenizer:
  _component_: torchtune.models.llama2.llama2_tokenizer
  path: /tmp/tokenizer.model

dataset:
  _component_: torchtune.datasets.alpaca_dataset

# Note the API of the tokenizer we specified - we need to pass in a path
def llama2_tokenizer(path: str) -> Llama2Tokenizer:

# Note the API of the dataset we specified - we need to pass in a model tokenizer
# and any optional keyword arguments
def alpaca_dataset(
    tokenizer: ModelTokenizer,
    train_on_input: bool = True,
    max_seq_len: int = 512,
) -> SFTDataset:

from torchtune import config

# Since we've already specified the path in the config, we don't need to pass
# it in
tokenizer = config.instantiate(cfg.tokenizer)
# We pass in the instantiated tokenizer as the first required argument, then
# we change an optional keyword argument
dataset = config.instantiate(
    cfg.dataset,
    tokenizer,
    train_on_input=False,
)

注意，额外的关键字参数将覆盖 config 中任何重复的键。

使用插值引用其他 config 字段¶

有时你需要对多个字段使用相同的值不止一次。你可以使用插值来引用另一个字段，instantiate() 将自动为你解析它。

output_dir: /tmp/alpaca-llama2-finetune
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}

验证你的 config¶

我们提供了一个方便的 CLI 工具，tune validate，用于快速验证你的 config 是否格式正确，并且所有组件都可以正确实例化。如果你想测试运行实验的精确命令，也可以传入覆盖（overrides）。如果任何参数格式不正确，tune validate 将列出所有发现错误的位置。

tune cp llama2/7B_lora_single_device ./my_config.yaml
tune validate ./my_config.yaml

编写 configs 的最佳实践¶

让我们讨论一些编写 configs 的准则，以便充分利用它们。

严密的 configs¶

虽然将尽可能多的内容放入 config 中可能会让你在实验中切换参数时具有最大的灵活性，但我们鼓励你只在 config 中包含那些将在 Recipe 中使用或实例化的字段。这确保了 Recipe 运行选项的完全清晰性，并将显著简化调试过程。

# dont do this
alpaca_dataset:
  _component_: torchtune.datasets.alpaca_dataset
slimorca_dataset:
  ...

# do this
dataset:
  # change this in config or override when needed
  _component_: torchtune.datasets.alpaca_dataset

只使用公共 API¶

如果你希望在 config 中指定的组件位于私有文件中，请在你的 config 中使用公共点路径（dotpath）。这些组件通常在其父模块的 __init__.py 文件中公开。这样，你可以保证在 config 中使用的 API 的稳定性。你的组件点路径中不应包含下划线。

# don't do this
dataset:
  _component_: torchtune.datasets._alpaca.alpaca_dataset

# do this
dataset:
  _component_: torchtune.datasets.alpaca_dataset

命令行覆盖¶

Configs 是收集运行 Recipe 所需所有参数的主要位置，但有时你可能希望快速尝试不同的值，而无需更新 config 文件本身。为了方便快速实验，你可以通过 tune 命令为 config 中的参数指定覆盖值。这些值应指定为键值对，例如 k1=v1 k2=v2 ...

例如，要使用自定义模型和分词器目录运行LoRA 单设备微调 Recipe，可以提供覆盖：

tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
checkpointer.checkpoint_dir=/home/my_model_checkpoint \
checkpointer.checkpoint_files=['file_1','file_2'] \
tokenizer.path=/home/my_tokenizer_path

覆盖组件¶

如果你想覆盖 config 中通过 _component_ 字段实例化的类或函数，可以直接赋值给参数名称来实现。组件中任何嵌套的字段都可以使用点记法（dot notation）进行覆盖。

dataset:
  _component_: torchtune.datasets.alpaca_dataset

# Change to slimorca_dataset and set train_on_input to True
tune run lora_finetune_single_device --config my_config.yaml \
dataset=torchtune.datasets.slimorca_dataset dataset.train_on_input=True

删除 config 字段¶

通过需要不同关键字参数的覆盖来更改组件时，你可能需要从 config 中删除某些参数。可以使用 ~ 标记并指定要删除的 config 字段的点路径（dotpath）来实现。例如，如果你想覆盖内置 config 并使用 bitsandbytes.optim.PagedAdamW8bit 优化器，你可能需要删除诸如 foreach 等特定于 PyTorch 优化器的参数。注意，此示例要求你已安装 bitsandbytes。

# In configs/llama3/8B_full.yaml
optimizer:
  _component_: torch.optim.AdamW
  lr: 2e-5
  foreach: False

# Change to PagedAdamW8bit and remove fused, foreach
tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full \
optimizer=bitsandbytes.optim.PagedAdamW8bit ~optimizer.foreach

关于 Configs 的所有内容¶

参数存储在哪里？¶

编写 configs¶

使用 `instantiate` 配置组件¶

使用插值引用其他 config 字段¶

验证你的 config¶

编写 configs 的最佳实践¶

严密的 configs¶

只使用公共 API¶

命令行覆盖¶

覆盖组件¶

删除 config 字段¶

文档

教程

资源

关于 Configs 的所有内容¶

参数存储在哪里？¶

编写 configs¶

使用 instantiate 配置组件¶

使用插值引用其他 config 字段¶

验证你的 config¶

编写 configs 的最佳实践¶

严密的 configs¶

只使用公共 API¶

命令行覆盖¶

覆盖组件¶

删除 config 字段¶

文档

教程

资源

使用 `instantiate` 配置组件¶