FullModelTorchTuneCheckpointer¶

class torchtune.training.FullModelTorchTuneCheckpointer(checkpoint_dir: str, checkpoint_files: List[str], model_type: str, output_dir: str, adapter_checkpoint: Optional[str] = None, recipe_checkpoint: Optional[str] = None, resume_from_checkpoint: bool = False, should_load_recipe_state: bool = False)[source]¶

检查点器，用于以与 torchtune 兼容的格式读取和写入检查点。无需进行权重转换。

目前仅支持读取单个检查点文件。随着我们添加对更大模型的支持，这可能会发生变化。

参数:

checkpoint_dir (str) – 包含检查点文件的目录
checkpoint_files (List[str]) – 要加载的检查点文件列表。由于检查点器负责按文件 ID 排序，因此此列表中的顺序无关紧要
model_type (str) – 正在加载检查点的模型的模型类型，例如 LLAMA3。
output_dir (str) – 保存检查点文件的目录
adapter_checkpoint (Optional[str]) – adapter 权重的路径。如果为 None，且 should_load_recipe_state=True，则在 output_dir/epoch_{largest_epoch} 中查找 adapter_model.pt。默认为 None。
recipe_checkpoint (Optional[str]) – 训练状态检查点文件的路径。如果为 None，且 should_load_recipe_state=True，则在 output_dir/RECIPE_STATE_DIRNAME 中查找 recipe_state.pt。默认为 None。
resume_from_checkpoint (bool) – 如果为 True，检查点器将从先前的运行中加载与训练状态对应的附加检查点文件。默认为 False。此标志已弃用。请改用 should_load_recipe_state 标志。
should_load_recipe_state (bool) – 如果为 True，检查点器将从先前的运行中加载与训练状态对应的附加检查点文件。默认为 False

引发:

ValueError – 如果提供了多个检查点文件

load_checkpoint(weights_only: bool = True) → Dict[str, Any][source]¶

从文件中加载 torchtune 检查点。目前仅支持从单个文件加载。

输出的 state_dict 具有以下格式，其中除“model”之外的键仅在 should_load_recipe_state 为 True 时存在

>>>     {
>>>         "model": {
>>>             "key_1": weight
>>>             ...
>>>         },
>>>         "optimizer": {...},
>>>         ...
>>>     }

参数:: weights_only (bool) – 传递给 torch.load 的标志。我们公开此标志，因为量化模型不能在 weights_only=True 的情况下加载
返回:: 输入检查点中的 state_dict
返回类型:: Dict[str, Any]

save_checkpoint(state_dict: Dict[str, Any], epoch: int, intermediate_checkpoint: bool = False, adapter_only: bool = False) → None[source]¶

将 torchtune 检查点保存到文件。如果 intermediate_checkpoint 为 True，则在 _output_dir/RECIPE_STATE_DIRNAME 中创建一个额外的检查点文件 recipe_state.pt，其中包含训练状态。输出的 state_dict 具有以下格式

>>> # Model
>>> {
>>>     "key_1": weight
>>>     ...
>>> }
>>>
>>> # Recipe state
>>> {
>>>     "optimizer": ...,
>>>     "epoch": ...,
>>>     ...
>>> }

参数:

state_dict (Dict[str, Any]) – 包含模型和（可选）训练状态的 state_dict
epoch (int) – 当前 epoch 编号。这将添加到检查点文件名中，以确保我们不会覆盖中间检查点文件
intermediate_checkpoint (bool) – 如果为 True，则保存一个包含训练状态的附加检查点文件
adapter_only (bool) – 如果为 True，则仅保存 adapter 权重。默认为 False

引发:

ValueError – 如果 adapter_only 为 True 且 state_dict 中未找到 adapter 检查点。

FullModelTorchTuneCheckpointer¶

文档

教程

资源