多模态数据集¶

多模态数据集包含不止一种数据模态，例如文本 + 图像，可用于训练基于 Transformer 的模型。torchtune 目前仅支持用于视觉-语言模型 (VLM) 的多模态文本+图像对话数据集。

torchtune 中使用多模态数据集进行微调的主要入口点是 multimodal_chat_dataset() 构建器。这使您可以直接从配置文件中指定遵循多模态对话数据格式的本地或 Hugging Face 数据集，并在其上训练您的 VLM。

多模态数据集示例¶

这是一个用于视觉问答任务的多模态对话数据集示例。请注意，文本中有一个占位符“<image>”，指示图像 token 应放置的位置。这将在下面的示例中被图像特殊 token <|image|> 替换。

# data/my_data.json
[
    {
        "dialogue": [
            {
                "from": "human",
                "value": "<image>What time is it on the clock?",
            },
            {
                "from": "gpt",
                "value": "It is 10:00 AM.",
            },
        ],
        "image_path": "images/clock.jpg",
    },
    ...,
]

from torchtune.models.llama3_2_vision import llama3_2_vision_transform
from torchtune.datasets.multimodal import multimodal_chat_dataset

model_transform = Llama3VisionTransform(
    path="/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model",
    prompt_template="torchtune.data.QuestionAnswerTemplate",
    max_seq_len=8192,
    image_size=560,
)
ds = multimodal_chat_dataset(
    model_transform=model_transform,
    source="json",
    data_files="data/my_data.json",
    column_map={
        "dialogue": "conversations",
        "image_path": "image",
    },
    image_dir="/home/user/dataset/",  # /home/user/dataset/images/clock.jpg
    image_tag="<image>",
    split="train",
)
tokenized_dict = ds[0]
print(model_transform.decode(tokenized_dict["tokens"], skip_special_tokens=False))
# '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nQuestion:<|image|>What time is it on the clock?Answer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nIt is 10:00AM.<|eot_id|>'
print(tokenized_dict["encoder_input"]["images"][0].shape)  # (num_tiles, num_channels, tile_height, tile_width)
# torch.Size([4, 3, 224, 224])

tokenizer:
  _component_: torchtune.models.llama3_2_vision_transform
  path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model
  prompt_template: torchtune.data.QuestionAnswerTemplate
  max_seq_len: 8192
  image_size: 560

dataset:
  _component_: torchtune.datasets.multimodal.multimodal_chat_dataset
  source: json
  data_files: data/my_data.json
  split: train
  column_map:
    dialogue: conversations
    image_path: image
  image_dir: /home/user/dataset/
  image_tag: "<image>"
  split: train

多模态数据集格式¶

目前，多模态数据集应遵循“sharegpt”对话格式，其中图像路径在一列中，用户-助手对话在另一列中。

|  conversations                     | image        |
|------------------------------------|--------------|
| [{"from": "human", "value": "Q1"}, | images/1.jpg |
|  {"from": "gpt", "value": "A1"}]   |              |

例如，您可以查看 ShareGPT4V 数据集的模式。

目前，multimodal_chat_dataset() 每个对话样本仅支持一个图像路径。

从 Hugging Face 加载多模态数据集¶

您只需将数据集仓库名称传递给 source，该名称随后会传递给 Hugging Face 的 load_dataset。对于大多数数据集，您还需要通过 name 指定 split 和/或子集。

# In code
from torchtune.models.llama3_2_vision import llama3_2_vision_transform
from torchtune.datasets.multimodal import multimodal_chat_dataset

model_transform = llama3_2_vision_transform(
    path="/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model",
    max_seq_len=8192,
    image_size=560,
)
ds = multimodal_chat_dataset(
    model_transform=model_transform,
    source="Lin-Chen/ShareGPT4V",
    split="train",
    name="ShareGPT4V",
    image_dir="/home/user/dataset/",
    image_tag="<image>",
)

# In config
tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model
  max_seq_len: 8192
  image_size: 560

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.multimodal.multimodal_chat_dataset
  source: Lin-Chen/ShareGPT4V
  split: train
  name: ShareGPT4V
  image_dir: /home/user/dataset/
  image_tag: "<image>"

这将使用默认的列名称“conversations”和“image”。要更改列名称，请使用 column_map 参数（参见重命名列）。

加载本地和远程多模态数据集¶

要通过 https 加载遵循指令格式的本地或远程数据集，您需要指定 source、data_files 和 split 参数。有关加载本地或远程文件的更多详细信息，请参阅 Hugging Face 的 load_dataset 文档。请参见上面的多模态数据集示例。

加载图像¶

在许多情况下，您的数据集包含图像路径，而不是原始图像本身。multimodal_chat_dataset() 会自动为您处理此问题，但如果您正在为自定义多模态数据集编写自定义消息变换（参见自定义消息变换），您可以直接使用 load_image() 工具函数。

from torchtune.data import load_image
from pathlib import Path

sample = {
    "conversations": [
        {
            "from": "human",
            "value": "What time is it on the clock?",
        },
        {
            "from": "gpt",
            "value": "It is 10:00 AM.",
        },
    ],
    "image": "images/clock.jpg",
}
image_dir = "/home/user/dataset/"
pil_image = load_image(Path(image_dir) / Path(sample["image"]))
print(pil_image)
# <PIL.Image.Image>

然后，您可以将 PIL 图像直接添加到相关消息的内容中。Message 中仅支持 PIL 图像作为图像内容，不支持图像路径或 URL。

from torchtune.data import Message

user_message = None
for msg in sample["conversations"]:
    if msg["from"] == "human":
        user_message = Message(
            role="user",
            content=[
                {"type": "image", "content": pil_image},
                {"type": "text", "content": msg["value"]},
            ]
        )
print(user_message.contains_media)
# True
print(user_message.get_media())
# [<PIL.Image.Image>]
print(user_message.text_content)
# What time is it on the clock?

如果您的数据集中的图像路径是相对路径，您可以在 multimodal_chat_dataset() 中使用 image_dir 参数来预置您本地下载图像的完整路径。

在文本中穿插图像¶

只要您的模型支持，torchtune 支持在文本中的任何位置添加多个图像。

import PIL
from torchtune.data import Message

image_dog = PIL.Image.new(mode="RGB", size=(4, 4))
image_cat = PIL.Image.new(mode="RGB", size=(4, 4))
image_bird = PIL.Image.new(mode="RGB", size=(4, 4))

user_message = Message(
    role="user",
    content=[
        {"type": "image", "content": image_dog},
        {"type": "text", "content": "This is an image of a dog. "},
        {"type": "image", "content": image_cat},
        {"type": "text", "content": "This is an image of a cat. "},
        {"type": "image", "content": image_bird},
        {"type": "text", "content": "This is a bird, the best pet of the three."},
    ]
)
print(user_message.contains_media)
# True
print(user_message.get_media())
# [<PIL.Image.Image>, <PIL.Image.Image>, <PIL.Image.Image>]
print(user_message.text_content)
# This is an image of a dog. This is an image of a cat. This is a bird, the best pet of the three.

您的数据集可能包含图像占位符标签，指示图像在文本中应被引用的位置。例如，请参阅 ShareGPT4V <https://hugging-face.cn/datasets/Lin-Chen/ShareGPT4V>，它使用“<image>”。您可以使用工具函数 format_content_with_images() 轻松创建与上述类似的多模态消息内容，该函数将图像占位符标签替换为传入的图像。

import PIL
from torchtune.data import Message, format_content_with_images

image_dog = PIL.Image.new(mode="RGB", size=(4, 4))
image_cat = PIL.Image.new(mode="RGB", size=(4, 4))
image_bird = PIL.Image.new(mode="RGB", size=(4, 4))

text = "[img]This is an image of a dog. [img]This is an image of a cat. [img]This is a bird, the best pet of the three."
user_message = Message(
    role="user",
    content=format_content_with_images(
        content=text,
        image_tag="[img]",
        images=[image_dog, image_cat, image_bird],
    ),
)
print(user_message.contains_media)
# True
print(user_message.get_media())
# [<PIL.Image.Image>,<PIL.Image.Image>, <PIL.Image.Image>]
print(user_message.text_content)
# This is an image of a dog. This is an image of a cat. This is a bird, the best pet of the three.

当您传入 image_tag 时，multimodal_chat_dataset() 会自动为您处理此问题。