torchtune 中的 Meta Llama3¶

你将学习如何

下载 Llama3-8B-Instruct 权重和分词器
使用 LoRA 和 QLoRA 微调 Llama3-8B-Instruct
评估你微调过的 Llama3-8B-Instruct 模型
使用微调模型生成文本
量化你的模型以加速生成

先决条件

熟悉 torchtune
确保已安装 torchtune

Llama3-8B¶

Meta Llama 3 是 Meta AI 发布的新模型系列，在各种基准测试中提高了 Llama2 模型系列的性能。目前 Meta Llama 3 有两种不同的尺寸：8B 和 70B。在本教程中，我们将重点介绍 8B 尺寸的模型。Llama2-7B 和 Llama3-8B 模型之间有一些主要变化

Llama3-8B 使用分组查询注意力（grouped-query attention），而不是 Llama2-7B 的标准多头注意力（multi-head attention）
Llama3-8B 拥有更大的词汇量（128,256，而 Llama2 模型为 32,000）
Llama3-8B 使用与 Llama2 模型不同的分词器（tiktoken，而不是 sentencepiece）
Llama3-8B 在其 MLP 层中使用比 Llama2-7B 更大的中间维度
Llama3-8B 在其旋转位置嵌入（rotary positional embeddings）中计算 theta 时使用更高的基值

获取 Llama3-8B-Instruct 的访问权限¶

对于本教程，我们将使用 Llama3-8B 的指令微调版本。首先，让我们从 Hugging Face 下载模型。你需要按照官方 Meta 页面上的说明来获取模型访问权限。然后，确保从这里获取你的 Hugging Face token。

tune download meta-llama/Meta-Llama-3-8B-Instruct \
    --output-dir <checkpoint_dir> \
    --hf-token <ACCESS TOKEN>

在 torchtune 中微调 Llama3-8B-Instruct¶

torchtune 提供了 LoRA、QLoRA 和完整微调 Recipes，用于在单 GPU 或多 GPU 上微调 Llama3-8B。有关 torchtune 中 LoRA 的更多信息，请参阅我们的 LoRA 教程。有关 torchtune 中 QLoRA 的更多信息，请参阅我们的 QLoRA 教程。

让我们看看如何在 torchtune 中使用 LoRA 在单个设备上微调 Llama3-8B-Instruct。在此示例中，我们将出于说明目的在一个常见的指令数据集上进行一个 epoch 的微调。单设备 LoRA 微调的基本命令是

tune run lora_finetune_single_device --config llama3/8B_lora_single_device

注意

要查看 Recipes 及其对应配置的完整列表，只需从命令行运行 tune ls。

我们也可以根据需要添加命令行覆盖，例如

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
    checkpointer.checkpoint_dir=<checkpoint_dir> \
    tokenizer.path=<checkpoint_dir>/tokenizer.model \
    checkpointer.output_dir=<checkpoint_dir>

这将从上面 tune download 命令中使用的 <checkpoint_dir> 加载 Llama3-8B-Instruct 检查点和分词器，然后以原始格式将最终检查点保存到同一目录中。有关 torchtune 中支持的检查点格式的更多详细信息，请参阅我们的检查点深度解析。

注意

要查看此（和其他）配置的完整可配置参数集，我们可以使用 tune cp 复制（和修改）默认配置。 tune cp 也可以用于 recipe 脚本，以便进行无法通过直接修改现有可配置参数实现的更自定义的更改。有关 tune cp 的更多信息，请参阅我们“微调你的第一个 LLM”教程中关于修改配置的部分。

训练完成后，模型检查点将保存并记录其位置。对于 LoRA 微调，最终检查点将包含合并后的权重，而仅包含（小得多的）LoRA 权重的副本将单独保存。

在我们的实验中，我们观察到峰值内存使用量为 18.5 GB。默认配置可以在具有 24 GB 显存的消费级 GPU 上训练。

如果你有多个可用的 GPU，可以运行分布式版本的 recipe。 torchtune 利用 PyTorch Distributed 的 FSDP API 来分片模型、优化器状态和梯度。这应该能让你增加批处理大小，从而加快整体训练速度。例如，在两个设备上

tune run --nproc_per_node 2 lora_finetune_distributed --config llama3/8B_lora

最后，如果我们想使用更少的内存，可以通过 torchtune 的 QLoRA recipe 来实现

tune run lora_finetune_single_device --config llama3/8B_qlora_single_device

由于我们的默认配置启用了完整的 bfloat16 训练，所有上述命令都可以在至少具有 24 GB 显存的设备上运行，并且实际上 QLoRA recipe 的峰值分配内存应低于 10 GB。你还可以尝试 LoRA 和 QLoRA 的不同配置，甚至运行完整的微调。来试试吧！

使用 EleutherAI 的 Eval Harness 评估微调后的 Llama3-8B 模型¶

现在我们已经微调了模型，接下来做什么？让我们拿出上一节中 LoRA 微调的模型，看看几种不同的方法来评估其在我们关注的任务上的性能。

首先，torchtune 提供了与 EleutherAI 评估 harness 的集成，用于在常见基准任务上评估模型。

注意

确保你已经通过 pip install "lm_eval==0.4.*" 安装了评估 harness。

在本教程中，我们将使用 harness 中的 truthfulqa_mc2 任务。此任务衡量模型在回答问题时是否倾向于说真话，并衡量模型在问题后面跟着一个或多个真实回答和一个或多个虚假回答时的零样本准确率。首先，让我们复制配置，以便将 YAML 文件指向我们微调后的检查点文件。

tune cp eleuther_evaluation ./custom_eval_config.yaml

接下来，我们修改 custom_eval_config.yaml 以包含微调后的检查点。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

最后，我们可以使用修改后的配置运行评估。

tune run eleuther_eval --config ./custom_eval_config.yaml

亲自试一试，看看你的模型能达到多高的准确率！

使用我们微调后的 Llama3 模型生成文本¶

接下来，我们来看看评估模型的另一种方法：生成文本！torchtune 也提供了一个用于生成文本的 recipe。

与我们之前所做类似，让我们复制并修改默认生成配置。

tune cp generation ./custom_generation_config.yaml

现在我们修改 custom_generation_config.yaml，使其指向我们的检查点和分词器。

model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.training.FullModelMetaCheckpointer

  # directory with the checkpoint files
  # this should match the output_dir specified during
  # fine-tuning
  checkpoint_dir: <checkpoint_dir>

  # checkpoint files for the fine-tuned model. These will be logged
  # at the end of your fine-tune
  checkpoint_files: [
    meta_model_0.pt
  ]

  output_dir: <checkpoint_dir>
  model_type: LLAMA3

# Make sure to update the tokenizer path to the right
# checkpoint directory as well
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: <checkpoint_dir>/tokenizer.model

使用我们 LoRA 微调后的模型运行生成，我们会看到以下输出

tune run generate --config ./custom_generation_config.yaml \
prompt.user="Hello, my name is"

[generate.py:122] Hello, my name is Sarah and I am a busy working mum of two young children, living in the North East of England.
...
[generate.py:135] Time for inference: 10.88 sec total, 18.94 tokens/sec
[generate.py:138] Bandwidth achieved: 346.09 GB/s
[generate.py:139] Memory used: 18.31 GB

通过量化加速生成¶

我们依赖 torchao 进行训练后量化。安装 torchao 后，要量化微调后的模型，我们可以运行以下命令

# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize_, int4_weight_only
quantize_(model, int4_weight_only())

量化后，我们依靠 torch.compile 来加速。有关更多详细信息，请参阅此示例用法。

torchao 还提供了此表格，列出了 llama2 和 llama3 的性能和准确率结果。

对于 Llama 模型，你可以直接在 torchao 中使用其 generate.py 脚本对量化模型运行生成，如本 README 中所述。这样你就可以将自己的结果与之前链接的表格中的结果进行比较。

这只是你使用 torchtune 和更广泛的生态系统可以利用 Meta Llama3 做到的事情的开始。我们期待看到你的成果！