快速入门指南¶

在本快速入门指南中，我们将探讨如何使用 torchao 执行基本的量化操作。首先，安装最新的稳定版 torchao

pip install torchao

如果您更倾向于使用 nightly 版本，则可以使用以下命令进行安装

pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121

torchao 与 PyTorch 最新的 3 个主要版本兼容，您也需要安装这些版本（详细说明）

pip install torch

第一个量化示例¶

torchao 中量化的主要入口点是 quantize_ API。此函数会就地修改您的模型，根据用户配置插入自定义量化逻辑。本指南中的所有代码都可以在这个示例脚本中找到。首先，让我们搭建一个玩具模型

import copy
import torch

class ToyLinearModel(torch.nn.Module):
    def __init__(self, m: int, n: int, k: int):
        super().__init__()
        self.linear1 = torch.nn.Linear(m, n, bias=False)
        self.linear2 = torch.nn.Linear(n, k, bias=False)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x

model = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda")

# Optional: compile model for faster inference and generation
model = torch.compile(model, mode="max-autotune", fullgraph=True)
model_bf16 = copy.deepcopy(model)

现在我们调用主要的量化 API，将模型中的线性层权重就地量化为 int4。更具体地说，这应用了 uint4 仅权重（weight-only）非对称每组（per-group）量化，利用 tinygemm int4mm CUDA kernel 实现高效的混合数据类型矩阵乘法

# torch 2.4+ only
from torchao.quantization import int4_weight_only, quantize_
quantize_(model, int4_weight_only(group_size=32))

量化后的模型现在可以使用了！请注意，量化逻辑是通过 tensor 子类插入的，因此整体模型结构没有改变；只有权重张量被更新了，而 nn.Linear 模块仍然是 nn.Linear 模块

>>> model.linear1
Linear(in_features=1024, out_features=1024, weight=AffineQuantizedTensor(shape=torch.Size([1024, 1024]), block_size=(1, 32), device=cuda:0, _layout=TensorCoreTiledLayout(inner_k_tiles=8), tensor_impl_dtype=torch.int32, quant_min=0, quant_max=15))

>>> model.linear2
Linear(in_features=1024, out_features=1024, weight=AffineQuantizedTensor(shape=torch.Size([1024, 1024]), block_size=(1, 32), device=cuda:0, _layout=TensorCoreTiledLayout(inner_k_tiles=8), tensor_impl_dtype=torch.int32, quant_min=0, quant_max=15))

首先，验证 int4 量化模型的大小大约是原始 bfloat16 模型大小的四分之一

>>> import os
>>> torch.save(model, "/tmp/int4_model.pt")
>>> torch.save(model_bf16, "/tmp/bfloat16_model.pt")
>>> int4_model_size_mb = os.path.getsize("/tmp/int4_model.pt") / 1024 / 1024
>>> bfloat16_model_size_mb = os.path.getsize("/tmp/bfloat16_model.pt") / 1024 / 1024

>>> print("int4 model size: %.2f MB" % int4_model_size_mb)
int4 model size: 1.25 MB

>>> print("bfloat16 model size: %.2f MB" % bfloat16_model_size_mb)
bfloat16 model size: 4.00 MB

接下来，我们展示量化模型不仅更小，而且速度也快得多！

from torchao.utils import (
    TORCH_VERSION_AT_LEAST_2_5,
    benchmark_model,
    unwrap_tensor_subclass,
)

# Temporary workaround for tensor subclass + torch.compile
# Only needed for torch version < 2.5
if not TORCH_VERSION_AT_LEAST_2_5:
    unwrap_tensor_subclass(model)

num_runs = 100
torch._dynamo.reset()
example_inputs = (torch.randn(1, 1024, dtype=torch.bfloat16, device="cuda"),)
bf16_time = benchmark_model(model_bf16, num_runs, example_inputs)
int4_time = benchmark_model(model, num_runs, example_inputs)

print("bf16 mean time: %0.3f ms" % bf16_time)
print("int4 mean time: %0.3f ms" % int4_time)
print("speedup: %0.1fx" % (bf16_time / int4_time))

在具有 80GB 内存的单个 A100 GPU 上，这将打印出

bf16 mean time: 30.393 ms
int4 mean time: 4.410 ms
speedup: 6.9x

后续步骤¶

在本快速入门指南中，我们学习了如何使用 torchao 量化一个简单模型。要了解 torchao 支持的不同工作流程，请参阅我们的主要 README。要获取 torchao 量化的更详细概述，请访问此页面。

最后，如果您想为 torchao 贡献力量，请不要忘记查看我们的贡献者指南以及 Github 上的适合初学者的问题列表！

快速入门指南¶

第一个量化示例¶

后续步骤¶

文档

教程

资源