配置用于微调的数据集¶

本教程将指导您完成如何设置用于微调的数据集。

您将学到什么

如何快速开始使用内置数据集
如何使用来自 Hugging Face Hub 的任何数据集
如何使用指令、聊天或文本补全数据集
如何通过代码、配置或命令行配置数据集
如何完全自定义您自己的数据集

先决条件

了解如何从配置中配置组件

数据集是微调工作流程的核心组件，充当“方向盘”，引导 LLM 为特定用例生成内容。许多公开共享的开源数据集已成为微调 LLM 的热门选择，并作为训练模型的良好起点。torchtune 为您提供了下载外部社区数据集、加载自定义本地数据集或创建您自己的数据集的工具。

内置数据集¶

要使用库中的内置数据集之一，只需导入并调用数据集构建器函数即可。您可以查看此处列出的所有受支持数据集。

from torchtune.datasets import alpaca_dataset

# Load in tokenizer
tokenizer = ...
dataset = alpaca_dataset(tokenizer)

# YAML config
dataset:
  _component_: torchtune.datasets.alpaca_dataset

# Command line
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.alpaca_dataset

Hugging Face 数据集¶

我们为 Hugging Face Hub 上的数据集提供了第一类支持。在后台，我们所有的内置数据集和数据集构建器都使用 Hugging Face 的 load_dataset() 加载您的数据，无论是本地数据还是 Hub 上的数据。

您可以将 Hugging Face 数据集路径传递给我们任何构建器中的 source 参数，以指定要从 Hub 下载或从本地目录路径使用哪个数据集（请参阅本地和远程数据集）。此外，所有构建器都接受 load_dataset() 支持的任何关键字参数。您可以在 Hugging Face 的文档中查看完整列表。

from torchtune.datasets import text_completion_dataset

# Load in tokenizer
tokenizer = ...
dataset = text_completion_dataset(
    tokenizer,
    source="allenai/c4",
    # Keyword-arguments that are passed into load_dataset
    split="train",
    data_dir="realnewslike",
)

# YAML config
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: allenai/c4
  split: train
  data_dir: realnewslike

# Command line
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.text_completion_dataset dataset.source=allenai/c4 \
dataset.split=train dataset.data_dir=realnewslike

设置最大序列长度¶

默认的合并器， padded_collate()，在我们所有训练食谱中使用，会将样本填充到批次内的最大序列长度，而不是全局填充。如果您希望全局设置最大序列长度的上限，可以在数据集构建器中使用 max_seq_len 指定它。数据集中任何长度超过 max_seq_len 的样本将在 truncate() 中被截断。分词器的 EOS 标识符确保为最后一个标记，除了 TextCompletionDataset。

通常，您希望每个数据样本中返回的最大序列长度与模型的上下文窗口大小匹配。您还可以根据硬件限制降低此值以减少内存使用量。

from torchtune.datasets import alpaca_dataset

# Load in tokenizer
tokenizer = ...
dataset = alpaca_dataset(
    tokenizer=tokenizer,
    max_seq_len=4096,
)

# YAML config
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  max_seq_len: 4096

# Command line
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset.max_seq_len=4096

样本打包¶

您可以通过传递 packed=True 在任何单个数据集构建器中使用样本打包。这需要对数据集进行一些预处理，这可能会减慢首批数据的处理速度，但根据数据集的不同，可能会带来显著的训练加速。

from torchtune.datasets import alpaca_dataset, PackedDataset

# Load in tokenizer
tokenizer = ...
dataset = alpaca_dataset(
    tokenizer=tokenizer,
    packed=True,
)
print(isinstance(dataset, PackedDataset))  # True

# YAML config
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: True

# Command line
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset.packed=True

自定义非结构化文本语料库¶

对于持续预训练，通常使用与预训练类似的数据设置来完成简单的文本补全任务。这意味着没有指令模板、聊天格式和最少的特殊标记（只有 BOS 和可选的 EOS）。要指定非结构化文本语料库，您可以使用 text_completion_dataset() 构建器以及 Hugging Face 数据集或自定义本地语料库。以下是为本地文件指定它的方法

from torchtune.datasets import text_completion_dataset

# Load in tokenizer
tokenizer = ...
dataset = text_completion_dataset(
    tokenizer,
    source="text",
    data_files="path/to/my_data.txt",
    split="train",
)

# YAML config
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: text
  data_files: path/to/my_data.txt
  split: train

# Command line
tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full \
dataset=torchtune.datasets.text_completion_dataset dataset.source=text \
dataset.data_files=path/to/my_data.txt dataset.split=train

自定义指令数据集和指令模板¶

如果您有库中未提供的自定义指令数据集，可以使用 instruct_dataset() 构建器并指定源路径。指令数据集通常包含多个文本列，这些列被格式化为提示模板。

要在一个特定任务上微调 LLM，一种常见的方法是创建一个固定指令模板，引导模型以特定目标生成输出。指令模板只是为模型构建输入的风格文本。它是与模型无关的，并且像任何其他文本一样被正常分词，但它可以帮助模型更好地适应预期格式。例如，AlpacaInstructTemplate 以以下方式构建数据

"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"

以下是用 AlpacaInstructTemplate 格式化的示例样本

from torchtune.data import AlpacaInstructTemplate

sample = {
    "instruction": "Classify the following into animals, plants, and minerals",
    "input": "Oak tree, copper ore, elephant",
}
prompt = AlpacaInstructTemplate.format(sample)
print(prompt)
# Below is an instruction that describes a task, paired with an input that provides further context.
# Write a response that appropriately completes the request.
#
# ### Instruction:
# Classify the following into animals, plants, and minerals
#
# ### Input:
# Oak tree, copper ore, elephant
#
# ### Response:
#

我们为常见任务（如摘要和语法校正）提供了其他指令模板。如果您需要为自定义任务创建自己的指令模板，可以继承 InstructTemplate 并创建您自己的类。

from torchtune.datasets import instruct_dataset
from torchtune.data import InstructTemplate

class CustomTemplate(InstructTemplate):
    # Define the template as string with {} as placeholders for data columns
    template = ...

    # Implement this method
    @classmethod
    def format(
        cls, sample: Mapping[str, Any], column_map: Optional[Dict[str, str]] = None
    ) -> str:
        ...

# Load in tokenizer
tokenizer = ...
dataset = instruct_dataset(
    tokenizer=tokenizer,
    source="my/dataset/path",
    template="import.path.to.CustomTemplate",
)

# YAML config
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: my/dataset/path
  template: import.path.to.CustomTemplate

# Command line
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.instruct_dataset dataset.source=my/dataset/path \
dataset.template=import.path.to.CustomTemplate

torchtune 使用 importlib.import_module（有关更多详细信息，请参阅 importlib 文档）从其点路径中定位组件。您可以将自定义模板类放在任何 Python 文件中，只要该文件可被 Python 的导入机制访问即可。这意味着该模块应该位于 Python 搜索路径 (sys.path) 中包含的目录中。这通常包括

运行 Python 解释器或脚本的当前目录。
安装 Python 包的目录（如 site-packages）。
在运行时使用 sys.path.append 或通过 PYTHONPATH 环境变量添加到 sys.path 的任何目录。

自定义聊天数据集和聊天格式¶

如果您拥有库中未提供的自定义聊天/对话数据集，您可以使用 chat_dataset() 构建器并指定源路径。聊天数据集通常只有一列，包含用户和助手之间多次来回的消息。

聊天格式类似于指令模板，只是它们将系统、用户和助手消息格式化为消息列表（有关对话数据集，请参见 ChatFormat）。这些可以与指令数据集非常相似地配置。

以下是使用 Llama2ChatFormat 格式化消息的方式

from torchtune.data import Llama2ChatFormat, Message

messages = [
    Message(
        role="system",
        content="You are a helpful, respectful, and honest assistant.",
    ),
    Message(
        role="user",
        content="I am going to Paris, what should I see?",
    ),
    Message(
        role="assistant",
        content="Paris, the capital of France, is known for its stunning architecture..."
    ),
]
formatted_messages = Llama2ChatFormat.format(messages)
print(formatted_messages)
# [
#     Message(
#         role="user",
#         content="[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant.\n<</SYS>>\n\n"
#         "I am going to Paris, what should I see? [/INST] ",
#     ),
#     Message(
#         role="assistant",
#         content="Paris, the capital of France, is known for its stunning architecture..."
#     ),
# ]

请注意，系统消息现在已包含在用户消息中。如果您创建自定义 ChatFormat，还可以添加更高级的行为。

from torchtune.datasets import chat_dataset
from torchtune.data import ChatFormat

class CustomChatFormat(ChatFormat):
    # Define templates for system, user, assistant messages
    # as strings with {} as placeholders for message content
    system = ...
    user = ...
    assistant = ...

    # Implement this method
    @classmethod
    def format(
        cls,
        sample: List[Message],
    ) -> List[Message]:
        ...

# Load in tokenizer
tokenizer = ...
dataset = chat_dataset(
    tokenizer=tokenizer,
    source="my/dataset/path",
    split="train",
    conversation_style="openai",
    chat_format="import.path.to.CustomChatFormat",
)

# YAML config
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: my/dataset/path
  conversation_style: openai
  chat_format: import.path.to.CustomChatFormat

# Command line
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.chat_dataset dataset.source=my/dataset/path \
dataset.conversation_style=openai dataset.chat_format=import.path.to.CustomChatFormat

多个内存数据集¶

还可以使用我们的 ConcatDataset 接口在多个数据集上进行训练并分别配置它们。您甚至可以混合指令和聊天数据集或其他自定义数据集。

# YAML config
dataset:
  - _component_: torchtune.datasets.instruct_dataset
    source: vicgalle/alpaca-gpt4
    template: torchtune.data.AlpacaInstructTemplate
    split: train
    train_on_input: True
  - _component_: torchtune.datasets.instruct_dataset
    source: samsum
    template: torchtune.data.SummarizeTemplate
    column_map:
      output: summary
    split: train
    train_on_input: False
  - _component_: torchtune.datasets.chat_dataset
    ...

本地和远程数据集¶

要使用保存在本地硬盘上的数据集，只需为 source 指定文件类型，并使用任何数据集构建器函数传入 data_files 参数。我们支持 Hugging Face 的 load_dataset 支持的所有文件类型，包括 csv、json、txt 等。

from torchtune.datasets import instruct_dataset

# Load in tokenizer
tokenizer = ...
# Local files
dataset = instruct_dataset(
    tokenizer=tokenizer,
    source="csv",
    split="train",
    template="import.path.to.CustomTemplate"
    data_files="path/to/my/data.csv",
)
# Remote files
dataset = instruct_dataset(
    tokenizer=tokenizer,
    source="json",
    split="train",
    template="import.path.to.CustomTemplate"
    data_files="https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json",
    # You can also pass in any kwarg that load_dataset accepts
    field="data",
)

# YAML config - local files
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: csv
  template: import.path.to.CustomTemplate
  data_files: path/to/my/data.csv

# YAML config - remote files
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: json
  template: import.path.to.CustomTemplate
  data_files: https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
  field: data

# Command line - local files
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.chat_dataset dataset.source=csv \
dataset.template=import.path.to.CustomTemplate dataset.data_files=path/to/my/data.csv

完全自定义的数据集¶

更高级的任务和数据集格式不适合 InstructDataset、ChatDataset 和 TextCompletionDataset 提供的模板和处理，可能需要您创建自己的数据集类以获得更大的灵活性。让我们以 PreferenceDataset 为例，它具有 RLHF 偏好数据的自定义功能，来了解您需要做什么。

如果您查看 PreferenceDataset 类的代码，您会注意到它与 InstructDataset 非常相似，只是对偏好数据中选择的和拒绝的样本进行了一些调整。

chosen_message = [
    Message(role="user", content=prompt, masked=True),
    Message(role="assistant", content=transformed_sample[key_chosen]),
]
rejected_message = [
    Message(role="user", content=prompt, masked=True),
    Message(role="assistant", content=transformed_sample[key_rejected]),
]

chosen_input_ids, c_masks = self._tokenizer.tokenize_messages(
    chosen_message, self.max_seq_len
)
chosen_labels = list(
    np.where(c_masks, CROSS_ENTROPY_IGNORE_IDX, chosen_input_ids)
)

rejected_input_ids, r_masks = self._tokenizer.tokenize_messages(
    rejected_message, self.max_seq_len
)
rejected_labels = list(
    np.where(r_masks, CROSS_ENTROPY_IGNORE_IDX, rejected_input_ids)
)

对于易于从配置中自定义的特定数据集，您可以创建一个构建器函数。这是 stack_exchanged_paired_dataset() 的构建器函数，它创建一个配置为使用来自 Hugging Face 的配对数据集的 PreferenceDataset。请注意，我们还必须添加一个自定义的指令模板。

def stack_exchanged_paired_dataset(
    tokenizer: ModelTokenizer,
    max_seq_len: int = 1024,
) -> PreferenceDataset:
    return PreferenceDataset(
        tokenizer=tokenizer,
        source="lvwerra/stack-exchange-paired",
        template=StackExchangedPairedTemplate(),
        column_map={
            "prompt": "question",
            "chosen": "response_j",
            "rejected": "response_k",
        },
        max_seq_len=max_seq_len,
        split="train",
        data_dir="data/rl",
    )

现在我们可以轻松地从配置或命令行指定我们的自定义数据集。

# This is how you would configure the Alpaca dataset using the builder
dataset:
  _component_: torchtune.datasets.stack_exchanged_paired_dataset
  max_seq_len: 512

# Command line - local files
tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.stack_exchanged_paired_dataset dataset.max_seq_len=512