(原型) FX 图模式训练后静态量化¶
创建于:2021 年 2 月 08 日 | 最后更新:2025 年 1 月 24 日 | 最后验证:2024 年 11 月 05 日
作者:Jerry Zhang 编辑:Charles Hernandez
本教程介绍了基于 torch.fx 在图模式下进行训练后静态量化的步骤。FX 图模式量化的优势在于我们可以在模型上完全自动地执行量化。虽然可能需要一些努力使模型与 FX 图模式量化兼容(使用 torch.fx
进行符号追踪),但我们将提供一个单独的教程来展示如何使我们想要量化的模型部分与 FX 图模式量化兼容。我们还有一个关于 FX 图模式训练后动态量化 的教程。总结:FX 图模式 API 如下所示
import torch
from torch.ao.quantization import get_default_qconfig
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping
float_model.eval()
# The old 'fbgemm' is still available but 'x86' is the recommended default.
qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
def calibrate(model, data_loader):
model.eval()
with torch.no_grad():
for image, target in data_loader:
model(image)
example_inputs = (next(iter(data_loader))[0]) # get an example input
prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs) # fuse modules and insert observers
calibrate(prepared_model, data_loader_test) # run calibration on sample data
quantized_model = convert_fx(prepared_model) # convert the calibrated model to a quantized model
1. FX 图模式量化的动机¶
目前,PyTorch 只有 eager 模式量化作为替代方案:PyTorch 中使用 Eager 模式的静态量化。
我们可以看到,eager 模式量化过程涉及多个手动步骤,包括
显式量化和反量化激活 - 当浮点运算和量化运算混合在模型中时,这非常耗时。
显式融合模块 - 这需要手动识别卷积、批量归一化和 ReLU 以及其他融合模式的序列。
对于 PyTorch 张量运算(如 add、concat 等)需要特殊处理
Functionals 没有一流的支持(functional.conv2d 和 functional.linear 不会被量化)
大多数这些所需的修改都来自 eager 模式量化的底层限制。Eager 模式在模块级别工作,因为它无法检查实际运行的代码(在 forward 函数中),量化是通过模块交换实现的,我们不知道模块在 eager 模式的 forward 函数中是如何使用的,因此它需要用户手动插入 QuantStub 和 DeQuantStub 以标记他们想要量化或反量化的点。在图模式下,我们可以检查 forward 函数中已执行的实际代码(例如 aten 函数调用),量化是通过模块和图操作实现的。由于图模式完全可见运行的代码,我们的工具能够自动找出诸如要融合哪些模块以及在何处插入观察者调用、量化/反量化函数等,我们能够自动化整个量化过程。
FX 图模式量化的优势在于
简单的量化流程,最少的手动步骤
解锁了进行更高级别优化的可能性,例如自动精度选择
2. 定义辅助函数并准备数据集¶
我们将首先进行必要的导入,定义一些辅助函数并准备数据。这些步骤与 PyTorch 中使用 Eager 模式的静态量化 完全相同。
要使用整个 ImageNet 数据集运行本教程中的代码,请首先按照此处的说明下载 ImageNet ImageNet 数据。将下载的文件解压缩到“data_path”文件夹中。
下载 torchvision resnet18 模型 并将其重命名为 data/resnet18_pretrained_float.pth
。
import os
import sys
import time
import numpy as np
import torch
from torch.ao.quantization import get_default_qconfig, QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx, fuse_fx
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision
from torchvision import datasets
from torchvision.models.resnet import resnet18
import torchvision.transforms as transforms
# Set up warnings
import warnings
warnings.filterwarnings(
action='ignore',
category=DeprecationWarning,
module=r'.*'
)
warnings.filterwarnings(
action='default',
module=r'torch.ao.quantization'
)
# Specify random seed for repeatable results
_ = torch.manual_seed(191009)
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self, name, fmt=':f'):
self.name = name
self.fmt = fmt
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def __str__(self):
fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
return fmtstr.format(**self.__dict__)
def accuracy(output, target, topk=(1,)):
"""Computes the accuracy over the k top predictions for the specified values of k"""
with torch.no_grad():
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res
def evaluate(model, criterion, data_loader):
model.eval()
top1 = AverageMeter('Acc@1', ':6.2f')
top5 = AverageMeter('Acc@5', ':6.2f')
cnt = 0
with torch.no_grad():
for image, target in data_loader:
output = model(image)
loss = criterion(output, target)
cnt += 1
acc1, acc5 = accuracy(output, target, topk=(1, 5))
top1.update(acc1[0], image.size(0))
top5.update(acc5[0], image.size(0))
print('')
return top1, top5
def load_model(model_file):
model = resnet18(pretrained=False)
state_dict = torch.load(model_file, weights_only=True)
model.load_state_dict(state_dict)
model.to("cpu")
return model
def print_size_of_model(model):
if isinstance(model, torch.jit.RecursiveScriptModule):
torch.jit.save(model, "temp.p")
else:
torch.jit.save(torch.jit.script(model), "temp.p")
print("Size (MB):", os.path.getsize("temp.p")/1e6)
os.remove("temp.p")
def prepare_data_loaders(data_path):
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
dataset = torchvision.datasets.ImageNet(
data_path, split="train", transform=transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]))
dataset_test = torchvision.datasets.ImageNet(
data_path, split="val", transform=transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
]))
train_sampler = torch.utils.data.RandomSampler(dataset)
test_sampler = torch.utils.data.SequentialSampler(dataset_test)
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=train_batch_size,
sampler=train_sampler)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=eval_batch_size,
sampler=test_sampler)
return data_loader, data_loader_test
data_path = '~/.data/imagenet'
saved_model_dir = 'data/'
float_model_file = 'resnet18_pretrained_float.pth'
train_batch_size = 30
eval_batch_size = 50
data_loader, data_loader_test = prepare_data_loaders(data_path)
example_inputs = (next(iter(data_loader))[0])
criterion = nn.CrossEntropyLoss()
float_model = load_model(saved_model_dir + float_model_file).to("cpu")
float_model.eval()
# create another instance of the model since
# we need to keep the original model around
model_to_quantize = load_model(saved_model_dir + float_model_file).to("cpu")
4. 使用 QConfigMapping
指定如何量化模型¶
qconfig_mapping = QConfigMapping.set_global(default_qconfig)
我们使用 eager 模式量化中使用的相同 qconfig,qconfig
只是激活和权重的观察者的命名元组。QConfigMapping
包含从 ops 到 qconfigs 的映射信息
qconfig_mapping = (QConfigMapping()
.set_global(qconfig_opt) # qconfig_opt is an optional qconfig, either a valid qconfig or None
.set_object_type(torch.nn.Conv2d, qconfig_opt) # can be a callable...
.set_object_type("reshape", qconfig_opt) # ...or a string of the method
.set_module_name_regex("foo.*bar.*conv[0-9]+", qconfig_opt) # matched in order, first match takes precedence
.set_module_name("foo.bar", qconfig_opt)
.set_module_name_object_type_order()
)
# priority (in increasing order): global, object_type, module_name_regex, module_name
# qconfig == None means fusion and quantization should be skipped for anything
# matching the rule (unless a higher priority match is found)
与 qconfig
相关的实用函数可以在 qconfig 文件中找到,而与 QConfigMapping
相关的实用函数可以在 qconfig_mapping <https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/fx/qconfig_mapping_utils.py> 中找到
# The old 'fbgemm' is still available but 'x86' is the recommended default.
qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
5. 准备用于训练后静态量化的模型¶
prepared_model = prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
prepare_fx 将 BatchNorm 模块折叠到之前的 Conv2d 模块中,并在模型中的适当位置插入观察者。
prepared_model = prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
print(prepared_model.graph)
6. 校准¶
校准函数在观察者插入模型后运行。校准的目的是运行一些代表工作负载的示例(例如训练数据集的样本),以便模型中的观察者能够观察张量的统计信息,并且我们稍后可以使用此信息来计算量化参数。
def calibrate(model, data_loader):
model.eval()
with torch.no_grad():
for image, target in data_loader:
model(image)
calibrate(prepared_model, data_loader_test) # run calibration on sample data
7. 将模型转换为量化模型¶
convert_fx
接受经过校准的模型并生成量化模型。
quantized_model = convert_fx(prepared_model)
print(quantized_model)
8. 评估¶
我们现在可以打印量化模型的大小和准确率。
print("Size of model before quantization")
print_size_of_model(float_model)
print("Size of model after quantization")
print_size_of_model(quantized_model)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serilaization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
fx_graph_mode_model_file_path = saved_model_dir + "resnet18_fx_graph_mode_quantized.pth"
# this does not run due to some erros loading convrelu module:
# ModuleAttributeError: 'ConvReLU2d' object has no attribute '_modules'
# save the whole model directly
# torch.save(quantized_model, fx_graph_mode_model_file_path)
# loaded_quantized_model = torch.load(fx_graph_mode_model_file_path, weights_only=False)
# save with state_dict
# torch.save(quantized_model.state_dict(), fx_graph_mode_model_file_path)
# import copy
# model_to_quantize = copy.deepcopy(float_model)
# prepared_model = prepare_fx(model_to_quantize, {"": qconfig})
# loaded_quantized_model = convert_fx(prepared_model)
# loaded_quantized_model.load_state_dict(torch.load(fx_graph_mode_model_file_path), weights_only=True)
# save with script
torch.jit.save(torch.jit.script(quantized_model), fx_graph_mode_model_file_path)
loaded_quantized_model = torch.jit.load(fx_graph_mode_model_file_path)
top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test)
print("[after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
如果您想获得更好的准确率或性能,请尝试更改 qconfig_mapping。我们计划在 Numerical Suite 中添加对图模式的支持,以便您可以轻松确定模型中不同模块对量化的敏感性。有关更多信息,请参阅 PyTorch Numerical Suite 教程
9. 调试量化模型¶
我们还可以打印量化和非量化卷积运算的权重以查看差异,我们将首先显式调用 fuse 以融合模型中的卷积和批量归一化:请注意,fuse_fx
仅在 eval 模式下工作。
fused = fuse_fx(float_model)
conv1_weight_after_fuse = fused.conv1[0].weight[0]
conv1_weight_after_quant = quantized_model.conv1.weight().dequantize()[0]
print(torch.max(abs(conv1_weight_after_fuse - conv1_weight_after_quant)))
10. 与基线浮点模型和 Eager 模式量化的比较¶
scripted_float_model_file = "resnet18_scripted.pth"
print("Size of baseline model")
print_size_of_model(float_model)
top1, top5 = evaluate(float_model, criterion, data_loader_test)
print("Baseline Float Model Evaluation accuracy: %2.2f, %2.2f"%(top1.avg, top5.avg))
torch.jit.save(torch.jit.script(float_model), saved_model_dir + scripted_float_model_file)
在本节中,我们将使用 FX 图模式量化量化的模型与在 eager 模式下量化的模型进行比较。FX 图模式和 eager 模式产生非常相似的量化模型,因此预期准确率和加速也相似。
print("Size of Fx graph mode quantized model")
print_size_of_model(quantized_model)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("FX graph mode quantized model Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
from torchvision.models.quantization.resnet import resnet18
eager_quantized_model = resnet18(pretrained=True, quantize=True).eval()
print("Size of eager mode quantized model")
eager_quantized_model = torch.jit.script(eager_quantized_model)
print_size_of_model(eager_quantized_model)
top1, top5 = evaluate(eager_quantized_model, criterion, data_loader_test)
print("eager mode quantized model Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
eager_mode_model_file = "resnet18_eager_mode_quantized.pth"
torch.jit.save(eager_quantized_model, saved_model_dir + eager_mode_model_file)
我们可以看到,FX 图模式和 eager 模式量化模型的模型大小和准确率非常相似。
在 AIBench 中运行模型(使用单线程)得到以下结果
Scripted Float Model:
Self CPU time total: 192.48ms
Scripted Eager Mode Quantized Model:
Self CPU time total: 50.76ms
Scripted FX Graph Mode Quantized Model:
Self CPU time total: 50.63ms
正如我们所见,对于 resnet18,FX 图模式和 eager 模式量化模型都获得了与浮点模型相似的加速,这比浮点模型快 2-4 倍左右。但是,相对于浮点模型的实际加速可能因模型、设备、构建、输入批次大小、线程等而异。