(原型) FX 图模式训练后静态量化¶
作者: Jerry Zhang 编辑: Charles Hernandez
本教程介绍了基于 torch.fx 在图模式下进行训练后静态量化的步骤。FX 图模式量化的优势在于,我们可以完全自动地在模型上执行量化。虽然可能需要付出一些努力才能使模型与 FX 图模式量化兼容(可以使用 torch.fx
符号化跟踪),但我们将提供一个单独的教程来展示如何使我们要量化的模型部分与 FX 图模式量化兼容。我们还有一个关于 FX 图模式训练后动态量化 的教程。简而言之,FX 图模式 API 如下所示
import torch
from torch.ao.quantization import get_default_qconfig
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
from torch.ao.quantization import QConfigMapping
float_model.eval()
# The old 'fbgemm' is still available but 'x86' is the recommended default.
qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
def calibrate(model, data_loader):
model.eval()
with torch.no_grad():
for image, target in data_loader:
model(image)
example_inputs = (next(iter(data_loader))[0]) # get an example input
prepared_model = prepare_fx(float_model, qconfig_mapping, example_inputs) # fuse modules and insert observers
calibrate(prepared_model, data_loader_test) # run calibration on sample data
quantized_model = convert_fx(prepared_model) # convert the calibrated model to a quantized model
1. FX 图模式量化的动机¶
目前,PyTorch 仅提供急切模式量化作为替代方案:PyTorch 中使用 Eager 模式进行静态量化.
我们可以看到,急切模式量化过程涉及多个手动步骤,包括
显式量化和反量化激活 - 当模型中混合了浮点运算和量化运算时,这会非常耗时。
显式融合模块 - 这需要手动识别卷积、批归一化和 relu 以及其他融合模式的序列。
pytorch 张量运算 (例如加法、连接等) 需要特殊处理
函数式没有得到一流的支持 (functional.conv2d 和 functional.linear 不会被量化)
这些必要的修改大多源于 eager 模式量化的底层限制。Eager 模式在模块级别工作,因为它无法检查实际运行的代码(在 forward 函数中),量化是通过模块替换实现的,我们不知道模块在 eager 模式下的 forward 函数中是如何使用的,因此需要用户手动插入 QuantStub 和 DeQuantStub 来标记他们想要量化或反量化的点。在 graph 模式下,我们可以检查 forward 函数中实际执行的代码(例如 aten 函数调用),量化是通过模块和图操作实现的。由于 graph 模式可以完全看到运行的代码,我们的工具能够自动找出一些事情,例如哪些模块需要融合以及在哪里插入 observer 调用、量化/反量化函数等,我们能够自动完成整个量化过程。
FX Graph 模式量化的优势是
简单的量化流程,最少的步骤
解锁了更高层优化可能性,例如自动精度选择
2. 定义辅助函数并准备数据集¶
我们将从必要的导入开始,定义一些辅助函数并准备数据。这些步骤与 PyTorch 中的 Eager 模式静态量化 相同。
要使用整个 ImageNet 数据集运行本教程中的代码,请先按照 ImageNet Data 中的说明下载 imagenet。将下载的文件解压缩到 “data_path” 文件夹中。
下载 torchvision resnet18 模型 并将其重命名为 data/resnet18_pretrained_float.pth
。
import os
import sys
import time
import numpy as np
import torch
from torch.ao.quantization import get_default_qconfig, QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx, fuse_fx
import torch.nn as nn
from torch.utils.data import DataLoader
import torchvision
from torchvision import datasets
from torchvision.models.resnet import resnet18
import torchvision.transforms as transforms
# Set up warnings
import warnings
warnings.filterwarnings(
action='ignore',
category=DeprecationWarning,
module=r'.*'
)
warnings.filterwarnings(
action='default',
module=r'torch.ao.quantization'
)
# Specify random seed for repeatable results
_ = torch.manual_seed(191009)
class AverageMeter(object):
"""Computes and stores the average and current value"""
def __init__(self, name, fmt=':f'):
self.name = name
self.fmt = fmt
self.reset()
def reset(self):
self.val = 0
self.avg = 0
self.sum = 0
self.count = 0
def update(self, val, n=1):
self.val = val
self.sum += val * n
self.count += n
self.avg = self.sum / self.count
def __str__(self):
fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
return fmtstr.format(**self.__dict__)
def accuracy(output, target, topk=(1,)):
"""Computes the accuracy over the k top predictions for the specified values of k"""
with torch.no_grad():
maxk = max(topk)
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)
pred = pred.t()
correct = pred.eq(target.view(1, -1).expand_as(pred))
res = []
for k in topk:
correct_k = correct[:k].reshape(-1).float().sum(0, keepdim=True)
res.append(correct_k.mul_(100.0 / batch_size))
return res
def evaluate(model, criterion, data_loader):
model.eval()
top1 = AverageMeter('Acc@1', ':6.2f')
top5 = AverageMeter('Acc@5', ':6.2f')
cnt = 0
with torch.no_grad():
for image, target in data_loader:
output = model(image)
loss = criterion(output, target)
cnt += 1
acc1, acc5 = accuracy(output, target, topk=(1, 5))
top1.update(acc1[0], image.size(0))
top5.update(acc5[0], image.size(0))
print('')
return top1, top5
def load_model(model_file):
model = resnet18(pretrained=False)
state_dict = torch.load(model_file, weights_only=True)
model.load_state_dict(state_dict)
model.to("cpu")
return model
def print_size_of_model(model):
if isinstance(model, torch.jit.RecursiveScriptModule):
torch.jit.save(model, "temp.p")
else:
torch.jit.save(torch.jit.script(model), "temp.p")
print("Size (MB):", os.path.getsize("temp.p")/1e6)
os.remove("temp.p")
def prepare_data_loaders(data_path):
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
dataset = torchvision.datasets.ImageNet(
data_path, split="train", transform=transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]))
dataset_test = torchvision.datasets.ImageNet(
data_path, split="val", transform=transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
]))
train_sampler = torch.utils.data.RandomSampler(dataset)
test_sampler = torch.utils.data.SequentialSampler(dataset_test)
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=train_batch_size,
sampler=train_sampler)
data_loader_test = torch.utils.data.DataLoader(
dataset_test, batch_size=eval_batch_size,
sampler=test_sampler)
return data_loader, data_loader_test
data_path = '~/.data/imagenet'
saved_model_dir = 'data/'
float_model_file = 'resnet18_pretrained_float.pth'
train_batch_size = 30
eval_batch_size = 50
data_loader, data_loader_test = prepare_data_loaders(data_path)
example_inputs = (next(iter(data_loader))[0])
criterion = nn.CrossEntropyLoss()
float_model = load_model(saved_model_dir + float_model_file).to("cpu")
float_model.eval()
# create another instance of the model since
# we need to keep the original model around
model_to_quantize = load_model(saved_model_dir + float_model_file).to("cpu")
4. 使用 QConfigMapping
指定如何量化模型¶
qconfig_mapping = QConfigMapping.set_global(default_qconfig)
我们使用与 eager 模式量化中相同的 qconfig,qconfig
只是激活和权重的观察者的命名元组。 QConfigMapping
包含从操作到 qconfig 的映射信息
qconfig_mapping = (QConfigMapping()
.set_global(qconfig_opt) # qconfig_opt is an optional qconfig, either a valid qconfig or None
.set_object_type(torch.nn.Conv2d, qconfig_opt) # can be a callable...
.set_object_type("reshape", qconfig_opt) # ...or a string of the method
.set_module_name_regex("foo.*bar.*conv[0-9]+", qconfig_opt) # matched in order, first match takes precedence
.set_module_name("foo.bar", qconfig_opt)
.set_module_name_object_type_order()
)
# priority (in increasing order): global, object_type, module_name_regex, module_name
# qconfig == None means fusion and quantization should be skipped for anything
# matching the rule (unless a higher priority match is found)
与 qconfig
相关的实用函数可以在 qconfig 文件中找到,而与 QConfigMapping
相关的实用函数可以在 qconfig_mapping <https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/fx/qconfig_mapping.py> 中找到。
# The old 'fbgemm' is still available but 'x86' is the recommended default.
qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
5. 为训练后静态量化准备模型¶
prepared_model = prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
prepare_fx 将 BatchNorm 模块折叠到之前的 Conv2d 模块中,并在模型的适当位置插入 observer。
prepared_model = prepare_fx(model_to_quantize, qconfig_mapping, example_inputs)
print(prepared_model.graph)
6. 校准¶
校准函数在 observer 插入模型后运行。校准的目的是运行一些代表工作负载的示例(例如,训练数据集的样本),以便模型中的 observer 能够观察张量的统计信息,我们以后可以使用这些信息来计算量化参数。
def calibrate(model, data_loader):
model.eval()
with torch.no_grad():
for image, target in data_loader:
model(image)
calibrate(prepared_model, data_loader_test) # run calibration on sample data
7. 将模型转换为量化模型¶
convert_fx
获取一个校准后的模型并生成一个量化模型。
quantized_model = convert_fx(prepared_model)
print(quantized_model)
8. 评估¶
现在我们可以打印量化模型的大小和准确度。
print("Size of model before quantization")
print_size_of_model(float_model)
print("Size of model after quantization")
print_size_of_model(quantized_model)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("[before serilaization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
fx_graph_mode_model_file_path = saved_model_dir + "resnet18_fx_graph_mode_quantized.pth"
# this does not run due to some erros loading convrelu module:
# ModuleAttributeError: 'ConvReLU2d' object has no attribute '_modules'
# save the whole model directly
# torch.save(quantized_model, fx_graph_mode_model_file_path)
# loaded_quantized_model = torch.load(fx_graph_mode_model_file_path, weights_only=False)
# save with state_dict
# torch.save(quantized_model.state_dict(), fx_graph_mode_model_file_path)
# import copy
# model_to_quantize = copy.deepcopy(float_model)
# prepared_model = prepare_fx(model_to_quantize, {"": qconfig})
# loaded_quantized_model = convert_fx(prepared_model)
# loaded_quantized_model.load_state_dict(torch.load(fx_graph_mode_model_file_path), weights_only=True)
# save with script
torch.jit.save(torch.jit.script(quantized_model), fx_graph_mode_model_file_path)
loaded_quantized_model = torch.jit.load(fx_graph_mode_model_file_path)
top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test)
print("[after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
如果你想获得更好的准确度或性能,尝试更改 qconfig_mapping。我们计划在 Numerical Suite 中添加对 graph 模式的支持,以便你可以轻松确定模型中不同模块对量化的敏感性。有关更多信息,请参阅 PyTorch Numerical Suite 教程
9. 调试量化模型¶
我们还可以打印量化和非量化卷积操作的权重以查看差异,我们首先明确调用 fuse 来融合模型中的卷积和 batch norm:注意 fuse_fx
仅在评估模式下有效。
fused = fuse_fx(float_model)
conv1_weight_after_fuse = fused.conv1[0].weight[0]
conv1_weight_after_quant = quantized_model.conv1.weight().dequantize()[0]
print(torch.max(abs(conv1_weight_after_fuse - conv1_weight_after_quant)))
10. 与基线浮点模型和 Eager 模式量化的比较¶
scripted_float_model_file = "resnet18_scripted.pth"
print("Size of baseline model")
print_size_of_model(float_model)
top1, top5 = evaluate(float_model, criterion, data_loader_test)
print("Baseline Float Model Evaluation accuracy: %2.2f, %2.2f"%(top1.avg, top5.avg))
torch.jit.save(torch.jit.script(float_model), saved_model_dir + scripted_float_model_file)
在本节中,我们将使用 FX graph 模式量化的模型与使用 eager 模式量化的模型进行比较。FX graph 模式和 eager 模式会生成非常相似的量化模型,因此预期准确性和加速也会类似。
print("Size of Fx graph mode quantized model")
print_size_of_model(quantized_model)
top1, top5 = evaluate(quantized_model, criterion, data_loader_test)
print("FX graph mode quantized model Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
from torchvision.models.quantization.resnet import resnet18
eager_quantized_model = resnet18(pretrained=True, quantize=True).eval()
print("Size of eager mode quantized model")
eager_quantized_model = torch.jit.script(eager_quantized_model)
print_size_of_model(eager_quantized_model)
top1, top5 = evaluate(eager_quantized_model, criterion, data_loader_test)
print("eager mode quantized model Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg))
eager_mode_model_file = "resnet18_eager_mode_quantized.pth"
torch.jit.save(eager_quantized_model, saved_model_dir + eager_mode_model_file)
我们可以看到,FX graph 模式和 eager 模式量化模型的模型大小和准确度非常相似。
在 AIBench 中运行模型(使用单线程)会得到以下结果
Scripted Float Model:
Self CPU time total: 192.48ms
Scripted Eager Mode Quantized Model:
Self CPU time total: 50.76ms
Scripted FX Graph Mode Quantized Model:
Self CPU time total: 50.63ms
我们可以看到,对于 resnet18,FX graph 模式和 eager 模式量化模型都比浮点模型获得了类似的加速,比浮点模型快 2-4 倍。但实际的浮点模型加速可能会有所不同,具体取决于模型、设备、构建、输入批次大小、线程等。