XLA 量化操作（实验性功能）¶

本文档概述了如何利用量化操作在 XLA 设备上启用量化。

XLA 量化操作为量化操作（例如，分块 int4 量化矩阵乘法）提供了高层抽象。这些操作类似于 CUDA 生态系统中的量化 CUDA 内核（示例），在 XLA 框架内提供类似的功能和性能优势。

注意： 目前此功能被归类为实验性功能。其 API 细节将在下一版本（2.5）中有所变化。

如何使用：¶

XLA 量化操作可以用作 torch op，或者用作封装了 torch.op 的 torch.nn.Module。这两种选项为模型开发者提供了灵活性，可以选择将 XLA 量化操作集成到其解决方案中的最佳方式。

无论是 torch op 还是 nn.Module 都与 torch.compile( backend='openxla') 兼容。

在模型代码中调用 XLA 量化操作¶

用户可以像调用其他常规 PyTorch 操作一样调用 XLA 量化操作。这为将 XLA 量化操作集成到其应用中提供了最大的灵活性。

量化操作在 eager 模式和 Dynamo 中均可工作，支持常规 PyTorch CPU 张量和 XLA 张量。

import torch
import torch_xla.core.xla_model as xm
import torch_xla.experimental.xla_quantized_matmul

N_INPUT_FEATURES=10
N_OUTPUT_FEATURES=20
x = torch.randn((3, N_INPUT_FEATURES), dtype=torch.bfloat16)
w_int = torch.randint(-128, 127, (N_OUTPUT_FEATURES, N_INPUT_FEATURES), dtype=torch.int8)
scaler = torch.randn((N_OUTPUT_FEATURES,), dtype=torch.bfloat16)

# Call with torch CPU tensor (For debugging purpose)
matmul_output = torch.ops.xla.quantized_matmul(x, w_int, scaler)

device = xm.xla_device()
x_xla = x.to(device)
w_int_xla = w_int.to(device)
scaler_xla = scaler.to(device)

# Call with XLA Tensor to run on XLA device
matmul_output_xla = torch.ops.xla.quantized_matmul(x_xla, w_int_xla, scaler_xla)

# Use with torch.compile(backend='openxla')
def f(x, w, s):
  return torch.ops.xla.quantized_matmul(x, w, s)

f_dynamo = torch.compile(f, backend="openxla")
dynamo_out_xla = f_dynamo(x_xla, w_int_xla, scaler_xla)

注意请检查量化操作的文档字符串以获取量化权重的布局信息。

class MyQLinearForXLABackend(torch.nn.Module):
  def __init__(self):
    self.weight = ...
    self.scaler = ...

  def load_weight(self, w, scaler):
    # Load quantized Linear weights
    # Customized way to preprocess the weights
    ...
    self.weight = processed_w
    self.scaler = processed_scaler


  def forward(self, x):
    # Do some random stuff with x
    ...
    matmul_output = torch.ops.xla.quantized_matmul(x, self.weight, self.scaler)
    # Do some random stuff with matmul_output
    ...

模型开发者通常会在模型代码中将量化操作封装到自定义的 `nn.Module` 中

模块替换¶

orig_model = MyModel()
# Quantize the model and get quantized weights
q_weights = quantize(orig_model)
# Process the quantized weight to the format that XLA quantized op expects.
q_weights_for_xla = process_for_xla(q_weights)

# Do module swap
q_linear = XlaQuantizedLinear(self.linear.in_features,
                              self.linear.out_features)
q_linear.load_quantized_weight(q_weights_for_xla)
orig_model.linear = q_linear

或者，用户也可以使用封装了 XLA 量化操作的 `nn.Module`，并在模型代码中进行模块替换。

支持的量化操作：¶

矩阵乘法¶	权重	激活	数据类型
支持	per-channel（对称/非对称）	W8A16
支持	是	不适用	W8A8
否	per-channel	不适用	W8A8
否	per-channel	per-token	W8A8
W4A8	是	per-channel（对称/非对称）	W8A16
W4A8	是	per-channel（对称/非对称）	W8A16
blockwise（对称/非对称）	per-channel	不适用	W8A8
blockwise（对称/非对称）	per-channel	per-token	W8A8

blockwise

XLA 量化操作（实验性功能）¶

如何使用：¶

在模型代码中调用 XLA 量化操作¶

模型开发者通常会在模型代码中将量化操作封装到自定义的 `nn.Module` 中

或者，用户也可以使用封装了 XLA 量化操作的 `nn.Module`，并在模型代码中进行模块替换。

支持的量化操作：¶

文档

查看文档

查看教程

XLA 量化操作（实验性功能）¶

如何使用：¶

在模型代码中调用 XLA 量化操作¶

模型开发者通常会在模型代码中将量化操作封装到自定义的 nn.Module 中

或者，用户也可以使用封装了 XLA 量化操作的 nn.Module，并在模型代码中进行模块替换。

支持的量化操作：¶

文档

查看文档

查看教程

模型开发者通常会在模型代码中将量化操作封装到自定义的 `nn.Module` 中

或者，用户也可以使用封装了 XLA 量化操作的 `nn.Module`，并在模型代码中进行模块替换。