对话数据集¶

对话数据集涉及用户和助手之间的多轮对话（多次一来一回）。

[
    {"role": "user", "content": "What is the answer to the ultimate question of life?"},
    {"role": "assistant", "content": "The answer is 42."},
    {"role": "user", "content": "That's ridiculous"},
    {"role": "assistant", "content": "Oh I know."},
]

这比模型通常预训练的自由文本关联更有结构，后者只是学习预测下一个 token，而不是准确地响应用户。

在 torchtune 中使用对话数据集进行微调的主要入口点是 chat_dataset() 构建器。这允许您直接通过配置指定一个遵循对话数据格式的本地或 Hugging Face 数据集，并在此数据集上训练您的 LLM。

对话数据集示例¶

# data/my_data.json
[
    {
        "conversations": [
            {
                "from": "human",
                "value": "What is the answer to life?"
            },
            {
                "from": "gpt",
                "value": "The answer is 42."
            },
            {
                "from": "human",
                "value": "That's ridiculous"
            },
            {
                "from": "gpt",
                "value": "Oh I know."
            }
        ]
    }
]

from torchtune.models.mistral import mistral_tokenizer
from torchtune.datasets import chat_dataset

m_tokenizer = mistral_tokenizer(
    path="/tmp/Mistral-7B-v0.1/tokenizer.model",
    prompt_template="torchtune.models.mistral.MistralChatTemplate",
    max_seq_len=8192,
)
ds = chat_dataset(
    tokenizer=m_tokenizer,
    source="json",
    data_files="data/my_data.json",
    split="train",
    conversation_column="conversations",
    conversation_style="sharegpt",
    # By default, user prompt is ignored in loss. Set to True to include it
    train_on_input=True,
    new_system_prompt=None,
)
tokenized_dict = ds[0]
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
print(m_tokenizer.decode(tokens))
# [INST] What is the answer to life?  [/INST] The answer is 42. [INST] That's ridiculous  [/INST] Oh I know.
print(labels)
# [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...]

# In config
tokenizer:
  _component_: torchtune.models.mistral.mistral_tokenizer
  path: /tmp/Mistral-7B-v0.1/tokenizer.model
  prompt_template: torchtune.models.mistral.MistralChatTemplate
  max_seq_len: 8192

dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: data/my_data.json
  split: train
  conversation_column: conversations
  conversation_style: sharegpt
  train_on_input: True
  new_system_prompt: null

对话数据集格式¶

对话数据集通常包含一个名为“conversations”或“messages”的单列，其中每个样本包含围绕单个主题的消息列表。消息列表可以包含系统提示、用户和助手之间的多轮对话以及工具调用/返回。

|  conversations                                               |
|--------------------------------------------------------------|
| [{"role": "user", "content": "What day is today?"},          |
|  {"role": "assistant", "content": "It is Tuesday."}]         |
| [{"role": "user", "content": "What about tomorrow?"},        |
|  {"role": "assistant", "content": "Tomorrow is Wednesday."}] |

例如，您可以查看 SlimOrca 数据集的 schema。

从 Hugging Face 加载对话数据集¶

您需要将数据集仓库名称传递给 source，在 conversation_style 中选择一种对话风格，并指定 conversation_column。对于大多数 HF 数据集，您还需要指定 split。

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="Open-Orca/SlimOrca-Dedup",
    conversation_column="conversations",
    conversation_style="sharegpt",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: Open-Orca/SlimOrca-Dedup
  conversation_column: conversations
  conversation_style: sharegpt
  split: train

加载本地和远程对话数据集¶

要通过 https 加载包含对话数据的本地或远程数据集，您还需要指定 data_files 和 split 参数。有关加载本地或远程文件的更多详细信息，请参阅 Hugging Face 的 load_dataset 文档。

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="conversations",
    conversation_style="sharegpt",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: conversations
  conversation_style: sharegpt
  data_files: data/my_data.json
  split: train

指定对话风格¶

原始数据集中的对话结构差异很大，角色名称和表示消息内容名称的字段各不相同。有一些跨许多数据集通用的标准化格式。我们提供了内置转换器，可以将这些标准化格式转换为遵循此格式的 torchtune Message 列表

[
    {
        "role": "system" | "user" | "assistant" | "ipython",
        "content": <message>,
    },
    ...
]

`"sharegpt"`¶

关联的消息转换是 ShareGPTToMessages。期望的格式是

{
    "conversations": [
        {
            "from": "system" | "human" | "gpt",
            "value": <message>,
        },
        ...
    ]
}

您可以在代码或配置中指定 conversation_style=sharegpt

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="conversations",
    conversation_style="sharegpt",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: conversations
  conversation_style: sharegpt
  data_files: data/my_data.json
  split: train

`"openai"`¶

关联的消息转换是 OpenAIToMessages。期望的格式是

{
    "messages": [
        {
            "role": "system" | "user" | "assistant",
            "content": <message>,
        },
        ...
    ]
}

您可以在代码或配置中指定 conversation_style=openai

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="conversations",
    conversation_style="openai",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: conversations
  conversation_style: openai
  data_files: data/my_data.json
  split: train

如果您的数据集不符合上述对话风格，则需要创建自定义消息转换。

重命名列¶

要指定包含对话数据的列，请使用 conversation_column。

# data/my_data.json
[
    {
        "dialogue": [
            {
                "from": "human",
                "value": "What is the answer to life?"
            },
            {
                "from": "gpt",
                "value": "The answer is 42."
            },
            {
                "from": "human",
                "value": "That's ridiculous"
            },
            {
                "from": "gpt",
                "value": "Oh I know."
            }
        ]
    }
]

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="dialogue",
    conversation_style="sharegpt",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: dialogue
  conversation_style: sharegpt
  data_files: data/my_data.json
  split: train

对话模板¶

对话模板的定义方式与 instruct_dataset() 中的指令模板相同。更多信息请参阅指令模板。

内置对话数据集¶

slimorca_dataset

对话数据集¶

对话数据集示例¶

对话数据集格式¶

从 Hugging Face 加载对话数据集¶

加载本地和远程对话数据集¶

指定对话风格¶

`"sharegpt"`¶

`"openai"`¶

重命名列¶

对话模板¶

内置对话数据集¶

文档

教程

资源

对话数据集¶

对话数据集示例¶

对话数据集格式¶

从 Hugging Face 加载对话数据集¶

加载本地和远程对话数据集¶

指定对话风格¶

"sharegpt"¶

"openai"¶

重命名列¶

对话模板¶

内置对话数据集¶

文档

教程

资源

`"sharegpt"`¶

`"openai"`¶