注意

点击此处下载完整示例代码

优化用于部署的 Vision Transformer 模型¶

创建于: 2021 年 3 月 15 日 | 最后更新于: 2024 年 1 月 19 日 | 最后验证于: 2024 年 11 月 5 日

Vision Transformer 模型将最前沿的基于注意力的 Transformer 模型（在自然语言处理领域引入并取得了各种最先进 (SOTA) 的结果）应用于计算机视觉任务。Facebook 数据高效图像 Transformer 模型 DeiT 是一个在 ImageNet 上训练用于图像分类的 Vision Transformer 模型。

在本教程中，我们将首先介绍 DeiT 是什么以及如何使用它，然后逐步完成脚本化、量化、优化以及在 iOS 和 Android 应用中使用模型的完整步骤。我们还将比较量化优化模型与非量化非优化模型的性能，并展示在这些步骤中应用量化和优化对模型带来的好处。

什么是 DeiT¶

自 2012 年深度学习兴起以来，卷积神经网络 (CNN) 一直是图像分类的主要模型，但 CNN 通常需要数亿张图像进行训练才能达到 SOTA 结果。DeiT 是一种 Vision Transformer 模型，在训练中需要更少的数据和计算资源，就能在执行图像分类任务时与领先的 CNN 竞争。这得益于 DeiT 的两个关键组成部分：

数据增强，模拟在更大规模数据集上进行训练；
原生蒸馏，允许 Transformer 网络从 CNN 的输出中学习。

DeiT 表明 Transformer 可以成功应用于计算机视觉任务，即使数据和资源有限。有关 DeiT 的更多详细信息，请参阅其代码库和论文。

使用 DeiT 进行图像分类¶

请遵循 DeiT 代码库中的 README.md 文件，获取有关如何使用 DeiT 进行图像分类的详细信息，或者为了快速测试，首先安装所需的软件包：

pip install torch torchvision timm pandas requests

要在 Google Colab 中运行，请运行以下命令安装依赖项：

!pip install timm pandas requests

然后运行以下脚本：

from PIL import Image
import torch
import timm
import requests
import torchvision.transforms as transforms
from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD

print(torch.__version__)
# should be 1.8.0


model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()

transform = transforms.Compose([
    transforms.Resize(256, interpolation=3),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD),
])

img = Image.open(requests.get("https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png", stream=True).raw)
img = transform(img)[None,]
out = model(img)
clsidx = torch.argmax(out)
print(clsidx.item())

2.7.0+cu126
Downloading: "https://github.com/facebookresearch/deit/zipball/main" to /var/lib/ci-user/.cache/torch/hub/main.zip
/usr/local/lib/python3.10/dist-packages/timm/models/registry.py:4: FutureWarning:

Importing from timm.models.registry is deprecated, please import via timm.models

/usr/local/lib/python3.10/dist-packages/timm/models/layers/__init__.py:48: FutureWarning:

Importing from timm.models.layers is deprecated, please import via timm.layers

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:63: UserWarning:

Overwriting deit_tiny_patch16_224 in registry with models.deit_tiny_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:78: UserWarning:

Overwriting deit_small_patch16_224 in registry with models.deit_small_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:93: UserWarning:

Overwriting deit_base_patch16_224 in registry with models.deit_base_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:108: UserWarning:

Overwriting deit_tiny_distilled_patch16_224 in registry with models.deit_tiny_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:123: UserWarning:

Overwriting deit_small_distilled_patch16_224 in registry with models.deit_small_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:138: UserWarning:

Overwriting deit_base_distilled_patch16_224 in registry with models.deit_base_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:153: UserWarning:

Overwriting deit_base_patch16_384 in registry with models.deit_base_patch16_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:168: UserWarning:

Overwriting deit_base_distilled_patch16_384 in registry with models.deit_base_distilled_patch16_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.

Downloading: "https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/deit_base_patch16_224-b5f2ef4d.pth

  0%|          | 0.00/330M [00:00<?, ?B/s]
  5%|5         | 18.1M/330M [00:00<00:01, 189MB/s]
 13%|#3        | 44.4M/330M [00:00<00:01, 240MB/s]
 24%|##3       | 77.9M/330M [00:00<00:00, 290MB/s]
 32%|###1      | 106M/330M [00:00<00:00, 286MB/s]
 41%|####      | 135M/330M [00:00<00:00, 291MB/s]
 52%|#####1    | 171M/330M [00:00<00:00, 320MB/s]
 61%|######1   | 202M/330M [00:00<00:00, 321MB/s]
 71%|#######1  | 234M/330M [00:00<00:00, 328MB/s]
 81%|########  | 266M/330M [00:00<00:00, 328MB/s]
 91%|######### | 300M/330M [00:01<00:00, 339MB/s]
100%|##########| 330M/330M [00:01<00:00, 318MB/s]
269

输出应该是 269，根据 ImageNet 类别索引与标签文件的对应关系，它映射到 timber wolf, grey wolf, gray wolf, Canis lupus。

现在我们已经验证可以使用 DeiT 模型对图像进行分类，接下来看看如何修改模型以便它可以在 iOS 和 Android 应用上运行。

脚本化 DeiT¶

要在移动设备上使用该模型，我们首先需要对模型进行脚本化。有关快速概览，请参阅脚本化和优化秘籍。运行以下代码，将上一步中使用的 DeiT 模型转换为可在移动设备上运行的 TorchScript 格式。

model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("fbdeit_scripted.pt")

Using cache found in /var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main

生成了大小约为 346MB 的脚本化模型文件 fbdeit_scripted.pt。

量化 DeiT¶

为了在基本保持推理准确性不变的情况下显著减小训练模型的尺寸，可以对模型应用量化。由于 DeiT 中使用了 Transformer 模型，我们可以轻松地对模型应用动态量化，因为动态量化对 LSTM 和 Transformer 模型的效果最好（详情请参阅此处）。

现在运行以下代码：

# Use 'x86' for server inference (the old 'fbgemm' is still available but 'x86' is the recommended default) and ``qnnpack`` for mobile inference.
backend = "x86" # replaced with ``qnnpack`` causing much worse inference speed for quantized model on this notebook
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend

quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_quantized_model = torch.jit.script(quantized_model)
scripted_quantized_model.save("fbdeit_scripted_quantized.pt")

/var/lib/ci-user/.local/lib/python3.10/site-packages/torch/ao/quantization/observer.py:244: UserWarning:

Please use quant_min and quant_max to specify the range for observers.                     reduce_range will be deprecated in a future release of PyTorch.

这生成了模型的脚本化和量化版本 fbdeit_quantized_scripted.pt，大小约为 89MB，相较于非量化模型的 346MB，减小了 74%！

你可以使用 scripted_quantized_model 生成相同的推理结果

out = scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# The same output 269 should be printed

优化 DeiT¶

在移动设备上使用量化和脚本化模型之前的最后一步是进行优化

from torch.utils.mobile_optimizer import optimize_for_mobile
optimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)
optimized_scripted_quantized_model.save("fbdeit_optimized_scripted_quantized.pt")

生成的 fbdeit_optimized_scripted_quantized.pt 文件的大小与量化、脚本化但未优化的模型大致相同。推理结果保持不变。

out = optimized_scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# Again, the same output 269 should be printed

使用 Lite Interpreter¶

为了了解 Lite Interpreter 能带来多少模型尺寸减小和推理速度提升，让我们创建模型的精简（lite）版本。

optimized_scripted_quantized_model._save_for_lite_interpreter("fbdeit_optimized_scripted_quantized_lite.ptl")
ptl = torch.jit.load("fbdeit_optimized_scripted_quantized_lite.ptl")

尽管精简模型的尺寸与非精简版本相当，但在移动设备上运行精简版本时，预期会有推理速度提升。

比较推理速度¶

为了了解四种模型（原始模型、脚本化模型、量化脚本化模型、优化量化脚本化模型）的推理速度差异，请运行以下代码：

with torch.autograd.profiler.profile(use_cuda=False) as prof1:
    out = model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof2:
    out = scripted_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof3:
    out = scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof4:
    out = optimized_scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof5:
    out = ptl(img)

print("original model: {:.2f}ms".format(prof1.self_cpu_time_total/1000))
print("scripted model: {:.2f}ms".format(prof2.self_cpu_time_total/1000))
print("scripted & quantized model: {:.2f}ms".format(prof3.self_cpu_time_total/1000))
print("scripted & quantized & optimized model: {:.2f}ms".format(prof4.self_cpu_time_total/1000))
print("lite model: {:.2f}ms".format(prof5.self_cpu_time_total/1000))

original model: 121.17ms
scripted model: 128.88ms
scripted & quantized model: 94.96ms
scripted & quantized & optimized model: 136.20ms
lite model: 112.27ms

在 Google Colab 上运行的结果如下：

original model: 1236.69ms
scripted model: 1226.72ms
scripted & quantized model: 593.19ms
scripted & quantized & optimized model: 598.01ms
lite model: 600.72ms

以下结果总结了每种模型所需的推理时间以及每种模型相对于原始模型的百分比减小。

import pandas as pd
import numpy as np

df = pd.DataFrame({'Model': ['original model','scripted model', 'scripted & quantized model', 'scripted & quantized & optimized model', 'lite model']})
df = pd.concat([df, pd.DataFrame([
    ["{:.2f}ms".format(prof1.self_cpu_time_total/1000), "0%"],
    ["{:.2f}ms".format(prof2.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof3.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof4.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof5.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],
    columns=['Inference Time', 'Reduction'])], axis=1)

print(df)

"""
        Model                             Inference Time    Reduction
0   original model                             1236.69ms           0%
1   scripted model                             1226.72ms        0.81%
2   scripted & quantized model                  593.19ms       52.03%
3   scripted & quantized & optimized model      598.01ms       51.64%
4   lite model                                  600.72ms       51.43%
"""

                                    Model  ... Reduction
0                          original model  ...        0%
1                          scripted model  ...    -6.36%
2              scripted & quantized model  ...    21.63%
3  scripted & quantized & optimized model  ...   -12.41%
4                              lite model  ...     7.34%

[5 rows x 3 columns]

'\n        Model                             Inference Time    Reduction\n0\toriginal model                             1236.69ms           0%\n1\tscripted model                             1226.72ms        0.81%\n2\tscripted & quantized model                  593.19ms       52.03%\n3\tscripted & quantized & optimized model      598.01ms       51.64%\n4\tlite model                                  600.72ms       51.43%\n'

了解更多¶

脚本总运行时间： ( 0 分 13.055 秒)

由 Sphinx-Gallery 生成的图集