在 torchtune 中使用 Meta Llama3¶

您将学习如何

下载 Llama3-8B-Instruct 权重和分词器
使用 LoRA 和 QLoRA 微调 Llama3-8B-Instruct
评估您微调的 Llama3-8B-Instruct 模型
使用您微调的模型生成文本
量化您的模型以加快生成速度

先决条件

熟悉 torchtune
确保您已安装 torchtune

Llama3-8B¶

Meta Llama 3 是 Meta AI 发布的新模型系列，在一系列不同基准测试中的性能优于 Llama2 模型系列。目前，Meta Llama 3 有两种不同的尺寸：8B 和 70B。在本教程中，我们将重点介绍 8B 尺寸的模型。Llama2-7B 和 Llama3-8B 模型之间有一些主要变化

Llama3-8B 使用分组查询注意力，而不是 Llama2-7B 中的标准多头注意力
Llama3-8B 具有更大的词典大小（128,256，而不是 Llama2 模型中的 32,000）
Llama3-8B 使用与 Llama2 模型不同的分词器 (tiktoken 而不是 sentencepiece)
Llama3-8B 在其 MLP 层中使用比 Llama2-7B 更大的中间维度
Llama3-8B 使用更高的基本值来计算其旋转位置嵌入中的 theta

获取 Llama3-8B-Instruct 的访问权限¶

在本教程中，我们将使用 Llama3-8B 的指令调整版本。首先，让我们从 Hugging Face 下载模型。您需要按照 Meta 官方页面上的说明获取对模型的访问权限。接下来，确保您从此处获取您的 Hugging Face 令牌。

tune download meta-llama/Meta-Llama-3-8B-Instruct \
    --output-dir <checkpoint_dir> \
    --hf-token <ACCESS TOKEN>

在 torchtune 中微调 Llama3-8B-Instruct¶

torchtune 提供 LoRA、QLoRA 和完整微调食谱，用于在单个或多个 GPU 上微调 Llama3-8B。有关 torchtune 中 LoRA 的更多信息，请参阅我们的 LoRA 教程。有关 torchtune 中 QLoRA 的更多信息，请参阅我们的 QLoRA 教程。

让我们看看如何使用 torchtune 在单个设备上使用 LoRA 微调 Llama3-8B-Instruct。在此示例中，为了说明目的，我们将使用一个通用的指令数据集进行一个 epoch 的微调。在单个设备上进行 LoRA 微调的基本命令是

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

注意

要查看所有食谱及其相应的配置，只需从命令行运行 tune ls。

我们也可以根据需要添加命令行覆盖，例如

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=<checkpoint_dir> \
    tokenizer.path=<checkpoint_dir>/tokenizer.model \
    checkpointer.output_dir=<checkpoint_dir>

这将加载 Llama3-8B-Instruct 检查点和分词器，这些检查点和分词器来自 <checkpoint_dir>，这些检查点和分词器在上面的 tune download 命令中使用，然后按照原始格式在同一目录中保存最终检查点。有关 torchtune 支持的检查点格式的更多详细信息，请参阅我们的检查点深入分析。

注意

要查看此（以及其他）配置的所有可配置参数集，我们可以使用 tune cp 来复制（并修改）默认配置。 tune cp 也可以与食谱脚本一起使用，以防您想要进行无法通过直接修改现有可配置参数来实现的更多自定义更改。有关 tune cp 的更多信息，请参阅有关修改配置的部分。

训练完成后，模型检查点将被保存，其位置将被记录。对于 LoRA 微调，最终检查点将包含合并后的权重，并会单独保存仅包含（小得多的）LoRA 权重的副本。

在我们的实验中，我们观察到峰值内存使用率为 18.5 GB。默认配置可以在具有 24 GB VRAM 的消费级 GPU 上进行训练。

如果您有多个 GPU 可用，您可以运行该食谱的分布式版本。torchtune 利用了 PyTorch Distributed 中的 FSDP API 来对模型、优化器状态和梯度进行分片。这将使您能够增加批量大小，从而加快整体训练速度。例如，在两个设备上

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora

最后，如果我们想使用更少的内存，我们可以利用 torchtune 的 QLoRA 食谱，方法是

tune run lora_finetune_single_device --config llama3/8B_qlora_single_device

由于我们的默认配置启用了完整的 bfloat16 训练，因此所有上述命令都可以使用至少 24 GB VRAM 的设备运行，实际上 QLoRA 食谱的峰值分配内存应该低于 10 GB。您也可以尝试使用 LoRA 和 QLoRA 的不同配置，甚至运行完整的微调。试试看吧！

使用 EleutherAI 的评估工具评估微调后的 Llama3-8B 模型¶

现在我们已经微调了我们的模型，接下来是什么？让我们从上一节中使用 LoRA 微调的模型，看看我们可以通过几种不同的方式来评估它在我们关注的任务上的表现。

首先，torchtune 提供了与 EleutherAI 的评估工具的集成，用于在常见基准任务上评估模型。

注意

确保您已通过 pip install "lm_eval==0.4.*" 安装了评估工具。

在本教程中，我们将使用评估工具中的 truthfulqa_mc2 任务。此任务测量模型在回答问题时的真实性倾向，并测量模型在问题后面跟着一个或多个真实响应和一个或多个虚假响应的情况下，模型的零样本准确率。首先，让我们复制配置，以便我们可以将 YAML 文件指向我们微调的检查点文件。

tune cp eleuther_evaluation ./custom_eval_config.yaml

接下来，我们将修改 custom_eval_config.yaml 以包含微调的检查点。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

最后，我们可以使用修改后的配置运行评估。

tune run eleuther_eval --config ./custom_eval_config.yaml

自己试试看，看看您的模型能达到什么准确率！

使用我们微调的 Llama3 模型生成文本¶

接下来，让我们看看我们可以通过另一种方式来评估我们的模型：生成文本！torchtune 还提供了一个用于生成文本的食谱。

与我们之前所做类似，让我们复制并修改默认的生成配置。

tune cp generation ./custom_generation_config.yaml

现在，我们将修改 custom_generation_config.yaml 以指向我们的检查点和分词器。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

使用我们的 LoRA 微调模型运行生成，我们会看到以下输出

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Sarah and I am a busy working mum of two young children, living in the North East of England.
...
[generate.py:135] Time for inference: 10.88 sec total, 18.94 tokens/sec
[generate.py:138] Bandwidth achieved: 346.09 GB/s
[generate.py:139] Memory used: 18.31 GB

通过量化实现更快的生成¶

我们可以看到模型花费了不到 11 秒，每秒生成近 19 个 token。我们可以通过量化模型来加快速度。这里我们将使用 torchao 提供的 4 位权重量化。

如果你一直关注到这里，你应该知道该怎么做。让我们复制量化配置并将其指向我们的微调模型。

tune cp quantization ./custom_quantization_config.yaml

并用以下内容更新 custom_quantization_config.yaml

# Model arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

要量化模型，现在我们可以运行

tune run quantize --config ./custom_quantization_config.yaml

[quantize.py:90] Time for quantization: 2.93 sec
[quantize.py:91] Memory used: 23.13 GB
[quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-Instruct-hf/consolidated-4w.pt

我们可以看到模型现在小于 5 GB，或者每个 8B 参数仅略高于 4 位。

注意

与微调检查点不同，量化配方输出单个检查点文件。这是因为我们当前的量化 API 不支持跨格式的任何转换。因此，您将无法在 torchtune 之外使用这些量化模型。但是，您应该能够在 torchtune 中使用生成和评估配方来使用这些模型。这些结果将有助于告知您应该使用哪种量化方法与您最喜欢的推理引擎一起使用。

让我们采用量化后的模型并再次运行相同的生成。首先，我们将对 custom_generation_config.yaml 进行另一个更改。

checkpointer:
  # we need to use the custom torchtune checkpointer
  # instead of the HF checkpointer for loading
  # quantized models
  _component_: torchtune.utils.FullModelTorchTuneCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files point to the quantized model
  checkpoint_files: [
    consolidated-4w.pt,
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# we also need to update the quantizer to what was used during
# quantization
quantizer:
  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
  groupsize: 256

让我们重新运行生成！

tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"

[generate.py:122] Hello, my name is Jake.
I am a multi-disciplined artist with a passion for creating, drawing and painting.
...
Time for inference: 1.62 sec total, 57.95 tokens/sec

通过量化模型并运行 torch.compile，我们可以获得超过 3 倍的速度提升！

这仅仅是使用 Meta Llama3、torchtune 和更广泛的生态系统可以做到的开始。我们期待看到您构建的内容！