torchtune 中的 Llama3¶

您将学习如何

下载 Llama3-8B 权重和分词器
使用 LoRA 和 QLoRA 微调 Llama3-8B
评估您微调的 Llama3-8B 模型
使用您微调的模型生成文本
量化您的模型以加快生成速度

先决条件

熟悉 torchtune
确保安装 torchtune

Llama3-8B¶

Llama3-8B 是 Meta AI 发布的一个新模型，它在一系列不同的基准测试中，超越了 Llama2 模型系列的性能。Llama2-7B 和 Llama3-8B 模型之间存在一些主要变化。

Llama3-8B 使用分组查询注意力，而不是 Llama2-7B 中的标准多头注意力。
Llama3-8B 具有更大的词汇量（128,256，而不是 Llama2 模型的 32,000）。
Llama3-8B 使用与 Llama2 模型不同的分词器（tiktoken 而不是 sentencepiece）。
Llama3-8B 在其 MLP 层中使用更大的中间维度，而不是 Llama2-7B。
Llama3-8B 使用更高的基值来计算其旋转位置嵌入中的 theta。

获取 Llama3-8B 的访问权限¶

首先，让我们从 Hugging Face 下载模型。您需要按照官方 Meta 页面上的说明来获取模型的访问权限。接下来，确保您从这里获取您的 Hugging Face 令牌。

tune download meta-llama/Meta-Llama-3-8B \
    --output-dir <checkpoint_dir> \
    --hf-token <ACCESS TOKEN>

在 torchtune 中微调 Llama3-8B¶

torchtune 提供了 LoRA、QLoRA 和用于在一个或多个 GPU 上微调 Llama3-8B 的完整微调配方。有关 torchtune 中 LoRA 的更多信息，请参阅我们的 LoRA 教程。有关 torchtune 中 QLoRA 的更多信息，请参阅我们的 QLoRA 教程。

让我们看看如何使用 torchtune 在单个设备上使用 LoRA 微调 Llama3-8B。在本例中，我们将对一个常见的指令数据集进行一个 epoch 的微调，以说明目的。单个设备 LoRA 微调的基本命令是

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

注意

要查看所有配方及其对应配置的完整列表，只需从命令行运行 tune ls。

我们也可以根据需要添加命令行覆盖，例如

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=<checkpoint_dir> \
    tokenizer.path=<checkpoint_dir>/tokenizer.model \
    checkpointer.output_dir=<checkpoint_dir>

这将从 <checkpoint_dir> 加载 Llama3-8B 检查点和分词器，该检查点在上面的 tune download 命令中使用，然后以原始格式在同一目录中保存最终检查点。有关 torchtune 支持的检查点格式的更多详细信息，请参阅我们的检查点深入分析。

注意

要查看此（和其他）配置的完整可配置参数集，我们可以使用 tune cp 复制（并修改）默认配置。 tune cp 也可以与配方脚本一起使用，如果您想进行更多自定义更改，而这些更改无法通过直接修改现有可配置参数来实现。有关 tune cp 的更多信息，请参阅有关修改配置的部分。

训练完成后，模型检查点将被保存，其位置将被记录。对于 LoRA 微调，最终检查点将包含合并的权重，并且仅 LoRA 权重（更小）的副本将单独保存。

在我们的实验中，我们观察到峰值内存使用量为 18.5 GB。默认配置可以在具有 24 GB VRAM 的消费级 GPU 上进行训练。

如果您有多个 GPU 可用，您可以运行配方的分布式版本。torchtune 利用来自 PyTorch Distributed 的 FSDP API 来对模型、优化器状态和梯度进行分片。这应该使您能够增加批次大小，从而加快整体训练速度。例如，在两个设备上

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora

最后，如果我们想使用更少的内存，我们可以通过以下方式利用 TorchTune 的 QLoRA 配方

tune run lora_finetune_single_device --config llama3/8B_qlora_single_device

由于我们的默认配置启用了完整的 bfloat16 训练，因此所有上述命令都可以在具有至少 24 GB VRAM 的设备上运行，实际上 QLoRA 配方的峰值分配内存应该低于 10 GB。您还可以尝试不同的 LoRA 和 QLoRA 配置，甚至运行完整的微调。试试吧！

使用 EleutherAI 的评估工具评估微调后的 Llama3-8B 模型¶

现在我们已经对 Llama3-8B 进行了微调，接下来该怎么办？让我们从上一节中获取我们的 LoRA 微调模型，并看看我们可以通过几种不同的方式来评估它在我们关心的任务上的性能。

首先，torchtune 提供了与 EleutherAI 的评估工具的集成，用于在常见基准任务上评估模型。

注意

请确保您已通过 pip install "lm_eval==0.4.*" 安装了评估工具。

在本教程中，我们将使用来自 harness 的 truthfulqa_mc2 任务。此任务衡量模型在回答问题时的真实性倾向，并衡量模型在问题后跟一个或多个真实答案和一个或多个错误答案时的零样本准确率。首先，让我们复制配置，以便我们可以将 YAML 文件指向我们微调的检查点文件。

tune cp eleuther_evaluation ./custom_eval_config.yaml

接下来，我们修改 custom_eval_config.yaml 以包含微调的检查点。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    consolidated.00.pth
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

最后，我们可以使用修改后的配置运行评估。

tune run eleuther_eval --config ./custom_eval_config.yaml

自己尝试一下，看看你的模型获得了什么准确率！

使用我们微调的 Llama3-8B 模型生成文本¶

接下来，让我们看看另一种评估模型的方法：生成文本！torchtune 提供了生成配方。

类似于我们所做的，让我们复制并修改默认的生成配置。

tune cp generation ./custom_generation_config.yaml

现在我们修改 custom_generation_config.yaml 以指向我们的检查点和分词器。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    consolidated.00.pth
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

使用我们的 LoRA 微调模型运行生成，我们看到以下输出

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Sarah and I am a busy working mum of two young children, living in the North East of England.
...
[generate.py:135] Time for inference: 10.88 sec total, 18.94 tokens/sec
[generate.py:138] Bandwidth achieved: 346.09 GB/s
[generate.py:139] Memory used: 18.31 GB

通过量化实现更快的生成¶

我们可以看到，模型花费了不到 11 秒，每秒生成近 19 个标记。我们可以通过量化模型来加快速度。在这里，我们将使用 torchao 提供的 4 位权重量化。

如果你一直关注到这里，你应该知道该怎么做。让我们复制量化配置并将其指向我们微调的模型。

tune cp quantization ./custom_quantization_config.yaml

并使用以下内容更新 custom_quantization_config.yaml

# Model arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    consolidated.00.pth
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

要量化模型，我们现在可以运行

tune run quantize --config ./custom_quantization_config.yaml

[quantize.py:90] Time for quantization: 2.93 sec
[quantize.py:91] Memory used: 23.13 GB
[quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-hf/consolidated-4w.pt

我们可以看到，模型现在小于 5 GB，或者每个 8B 参数只有 4 位多一点。

注意

与微调的检查点不同，量化配方输出单个检查点文件。这是因为我们的量化 API 目前不支持跨格式的任何转换。因此，你无法在 torchtune 之外使用这些量化模型。但你应该能够在 torchtune 中使用这些模型与生成和评估配方一起使用。这些结果将有助于确定你应该在你的首选推理引擎中使用哪些量化方法。

让我们使用我们的量化模型并再次运行相同的生成。首先，我们将对我们的 custom_generation_config.yaml 进行一次更改。

checkpointer:
  # we need to use the custom TorchTune checkpointer
  # instead of the HF checkpointer for loading
  # quantized models
  _component_: torchtune.utils.FullModelTorchTuneCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files point to the quantized model
  checkpoint_files: [
    consolidated-4w.pt,
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# we also need to update the quantizer to what was used during
# quantization
quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

让我们重新运行生成！

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Jake.
I am a multi-disciplined artist with a passion for creating, drawing and painting.
...
Time for inference: 1.62 sec total, 57.95 tokens/sec

通过对模型进行量化并运行 torch.compile，我们获得了超过 3 倍的速度提升！

这仅仅是使用 torchtune 和更广泛的生态系统来使用 Llama3-8B 的开始。我们期待看到你构建的内容！