高效的 PyTorch：张量内存格式很重要

作者：Dhruv Matani, Suraj Subramanian

为输入选择正确的内存格式可以显著影响 PyTorch 视觉模型的运行时间。如果不确定，请选择 Channels Last 内存格式。

在 PyTorch 中处理接受多媒体（例如图像张量）作为输入的视觉模型时，张量的内存格式可以显著影响 在移动平台上使用 CPU 后端配合 XNNPACK 时模型的推理执行速度。这对于服务器平台的训练和推理也同样适用，但延迟对于移动设备和用户来说尤其关键。

本文大纲

深入探讨 C++ 中的矩阵存储/内存表示。介绍行主序和列主序。
以与存储表示相同或不同顺序遍历矩阵的影响，以及一个示例。
Cachegrind 介绍；一个用于检查代码缓存友好性的工具。
PyTorch 算子支持的内存格式。
使用 XNNPACK 优化确保模型高效执行的最佳实践示例

C++ 中的矩阵存储表示

图像作为多维张量被输入到 PyTorch ML 模型中。这些张量有特定的内存格式。为了更好地理解这个概念，我们来看看一个二维矩阵是如何存储在内存中的。

广义上讲，在内存中高效存储多维数据主要有两种方式。

行主序 (Row Major Order)：在这种格式中，矩阵按行顺序存储，每行在内存中存储在下一行之前。即第 N 行存储在第 N+1 行之前。
列主序 (Column Major Order)：在这种格式中，矩阵按列顺序存储，每列在内存中存储在下一列之前。即第 N 列存储在第 N+1 列之前。

您可以在下方图示中看到它们之间的区别。

C++ stores multi-dimensional data in row-major format.
C++ 以行主序格式存储多维数据。

高效访问二维矩阵的元素

与存储格式类似，访问二维矩阵中的数据有两种方式。

先遍历行 (Loop Over Rows first)：在处理下一行的任何元素之前，先处理一行中的所有元素。
先遍历列 (Loop Over Columns first)：在处理下一列的任何元素之前，先处理一列中的所有元素。

为了最大限度提高效率，应始终以数据存储的相同格式访问数据。也就是说，如果数据以行主序存储，则应尝试以该顺序访问它。

下面的代码 (main.cpp) 展示了访问一个 4000x4000 二维矩阵所有元素的两种方式。

#include <iostream>
#include <chrono>

// loop1 accesses data in matrix 'a' in row major order,
// since i is the outer loop variable, and j is the
// inner loop variable.
int loop1(int a[4000][4000]) {
 int s = 0;
 for (int i = 0; i < 4000; ++i) {
   for (int j = 0; j < 4000; ++j) {
     s += a[i][j];
   }
 }
 return s;
}

// loop2 accesses data in matrix 'a' in column major order
// since j is the outer loop variable, and i is the
// inner loop variable.
int loop2(int a[4000][4000]) {
 int s = 0;
 for (int j = 0; j < 4000; ++j) {
   for (int i = 0; i < 4000; ++i) {
     s += a[i][j];
   }
 }
 return s;
}

int main() {
 static int a[4000][4000] = {0};
 for (int i = 0; i < 100; ++i) {
   int x = rand() % 4000;
   int y = rand() % 4000;
   a[x][y] = rand() % 1000;
 }

 auto start = std::chrono::high_resolution_clock::now();
 auto end = start;
 int s = 0;

#if defined RUN_LOOP1
 start = std::chrono::high_resolution_clock::now();

 s = 0;
 for (int i = 0; i < 10; ++i) {
   s += loop1(a);
   s = s % 100;
 }
 end = std::chrono::high_resolution_clock::now();

 std::cout << "s = " << s << std::endl;
 std::cout << "Time for loop1: "
   << std::chrono::duration<double, std::milli>(end - start).count()
   << "ms" << std::endl;
#endif

#if defined RUN_LOOP2
 start = std::chrono::high_resolution_clock::now();
 s = 0;
 for (int i = 0; i < 10; ++i) {
   s += loop2(a);
   s = s % 100;
 }
 end = std::chrono::high_resolution_clock::now();

 std::cout << "s = " << s << std::endl;
 std::cout << "Time for loop2: "
   << std::chrono::duration<double, std::milli>(end - start).count()
   << "ms" << std::endl;
#endif
}


Let’s build and run this program and see what it prints.

g++ -O2 main.cpp -DRUN_LOOP1 -DRUN_LOOP2
./a.out


Prints the following:

s = 70
Time for loop1: 77.0687ms
s = 70
Time for loop2: 1219.49ms

loop1() 比 loop2() 快 15 倍。这是为什么？让我们在下面找找答案！

使用 Cachegrind 测量缓存未命中

Cachegrind 是一个缓存分析工具，用于查看您的程序导致了多少 I1（一级指令）、D1（一级数据）和 LL（最后一级）缓存未命中。

让我们分别只使用 loop1() 和只使用 loop2() 构建程序，看看这些函数各自的缓存友好程度。

只构建并运行/分析 `loop1()`

g++ -O2 main.cpp -DRUN_LOOP1
valgrind --tool=cachegrind ./a.out

输出

==3299700==
==3299700== I   refs:      643,156,721
==3299700== I1  misses:          2,077
==3299700== LLi misses:          2,021
==3299700== I1  miss rate:        0.00%
==3299700== LLi miss rate:        0.00%
==3299700==
==3299700== D   refs:      160,952,192  (160,695,444 rd   + 256,748 wr)
==3299700== D1  misses:     10,021,300  ( 10,018,723 rd   +   2,577 wr)
==3299700== LLd misses:     10,010,916  ( 10,009,147 rd   +   1,769 wr)
==3299700== D1  miss rate:         6.2% (        6.2%     +     1.0%  )
==3299700== LLd miss rate:         6.2% (        6.2%     +     0.7%  )
==3299700==
==3299700== LL refs:        10,023,377  ( 10,020,800 rd   +   2,577 wr)
==3299700== LL misses:      10,012,937  ( 10,011,168 rd   +   1,769 wr)
==3299700== LL miss rate:          1.2% (        1.2%     +     0.7%  )

只构建并运行/分析 `loop2()`

g++ -O2 main.cpp -DRUN_LOOP2
valgrind --tool=cachegrind ./a.out

输出

==3300389==
==3300389== I   refs:      643,156,726
==3300389== I1  misses:          2,075
==3300389== LLi misses:          2,018
==3300389== I1  miss rate:        0.00%
==3300389== LLi miss rate:        0.00%
==3300389==
==3300389== D   refs:      160,952,196  (160,695,447 rd   + 256,749 wr)
==3300389== D1  misses:    160,021,290  (160,018,713 rd   +   2,577 wr)
==3300389== LLd misses:     10,014,907  ( 10,013,138 rd   +   1,769 wr)
==3300389== D1  miss rate:        99.4% (       99.6%     +     1.0%  )
==3300389== LLd miss rate:         6.2% (        6.2%     +     0.7%  )
==3300389==
==3300389== LL refs:       160,023,365  (160,020,788 rd   +   2,577 wr)
==3300389== LL misses:      10,016,925  ( 10,015,156 rd   +   1,769 wr)
==3300389== LL miss rate:          1.2% (        1.2%     +     0.7%  )

两次运行的主要区别在于

D1 未命中：10M 对比 160M
D1 未命中率：6.2% 对比 99.4%

正如您所见，loop2() 导致的 L1 数据缓存未命中比 loop1() 多得多（约 16 倍）。这就是为什么 loop1() 比 loop2() 快约 15 倍的原因。

PyTorch 算子支持的内存格式

虽然 PyTorch 算子期望所有张量采用通道优先 (Channels First, NCHW) 的维度格式，但 PyTorch 算子支持 3 种输出内存格式。

Contiguous（连续）：张量内存在内存中的顺序与张量的维度顺序相同。
ChannelsLast（通道在后）：无论维度顺序如何，二维（图像）张量在内存中以 HWC 或 NHWC (N: batch, H: height, W: width, C: channels) 格式布局。维度可以按任何顺序排列。
ChannelsLast3d（3D 通道在后）：对于三维张量（视频张量），内存在内存中以 THWC（时间、高度、宽度、通道）或 NTHWC (N: batch, T: time, H: height, W: width, C: channels) 格式布局。维度可以按任何顺序排列。

视觉模型更偏好 ChannelsLast 格式的原因是，PyTorch 使用的 XNNPACK（内核加速库）期望所有输入都采用 Channels Last 格式，因此如果模型的输入不是 Channels Last 格式，则必须先将其转换为 Channels Last，这是一个额外的操作。

此外，大多数 PyTorch 算子会保留输入张量的内存格式，因此如果输入是 Channels First 格式，则算子需要先转换为 Channels Last，然后执行操作，最后再转换回 Channels First。

再考虑到加速算子在 Channels Last 内存格式下工作得更好，您会注意到，让算子返回 Channels Last 内存格式对于后续的算子调用会更好，否则您会发现每个算子都需要转换为 Channels Last（如果 Channels Last 对该特定算子更高效的话）。

摘自 XNNPACK 主页

“XNNPACK 中的所有算子都支持 NHWC 布局，此外还允许沿通道维度自定义步长。”

PyTorch 最佳实践

从 PyTorch 视觉模型中获得最佳性能的最佳方法是，在将输入张量馈送到模型之前，确保它采用 Channels Last 内存格式。

通过优化模型使用 XNNPACK 后端（只需在 torchscripted 模型上调用 optimize_for_mobile()），您可以获得更多加速。请注意，如果输入是 Contiguous 格式，XNNPACK 模型会运行得更慢，因此务必确保其采用 Channels-Last 格式。

展示加速效果的实际示例

在 Google Colab 上运行此示例 - 请注意，Colab CPU 上的运行时间可能无法反映准确的性能；建议在本地机器上运行此代码。

import torch
from torch.utils.mobile_optimizer import optimize_for_mobile
import torch.backends.xnnpack
import time

print("XNNPACK is enabled: ", torch.backends.xnnpack.enabled, "\n")

N, C, H, W = 1, 3, 200, 200
x = torch.rand(N, C, H, W)
print("Contiguous shape: ", x.shape)
print("Contiguous stride: ", x.stride())
print()

xcl = x.to(memory_format=torch.channels_last)
print("Channels-Last shape: ", xcl.shape)
print("Channels-Last stride: ", xcl.stride())

## Outputs:
 
# XNNPACK is enabled:  True
 
# Contiguous shape:  torch.Size([1, 3, 200, 200])
# Contiguous stride:  (120000, 40000, 200, 1)
 
# Channels-Last shape:  torch.Size([1, 3, 200, 200])
# Channels-Last stride:  (120000, 1, 600, 3)

对于 Contiguous 和 Channels-Last 格式，输入形状保持不变。然而，正如您在 strides 中看到的那样，张量的内部布局已经改变。现在，跨越通道所需的“跳转”次数仅为 1（而 Contiguous 张量中为 40000）。这种更好的数据局部性意味着卷积层可以更快地访问给定像素的所有通道。现在让我们看看内存格式如何影响运行时

from torchvision.models import resnet34, resnet50, resnet101

m = resnet34(pretrained=False)
# m = resnet50(pretrained=False)
# m = resnet101(pretrained=False)

def get_optimized_model(mm):
  mm = mm.eval()
  scripted = torch.jit.script(mm)
  optimized = optimize_for_mobile(scripted)  # explicitly call the xnnpack rewrite 
  return scripted, optimized


def compare_contiguous_CL(mm):
  # inference on contiguous
  start = time.perf_counter()
  for i in range(20):
    mm(x)
  end = time.perf_counter()
  print("Contiguous: ", end-start)

  # inference on channels-last
  start = time.perf_counter()
  for i in range(20):
    mm(xcl)
  end = time.perf_counter()
  print("Channels-Last: ", end-start)

with torch.inference_mode():
  scripted, optimized = get_optimized_model(m)

  print("Runtimes for torchscripted model: ")
  compare_contiguous_CL(scripted.eval())
  print()
  print("Runtimes for mobile-optimized model: ")
  compare_contiguous_CL(optimized.eval())

   
## Outputs (on an Intel Core i9 CPU):
 
# Runtimes for torchscripted model:
# Contiguous:  1.6711160129999598
# Channels-Last:  1.6678222839999535
 
# Runtimes for mobile-optimized model:
# Contiguous:  0.5712863490000473
# Channels-Last:  0.46113000699995155

结论

输入张量的内存布局可以显著影响模型的运行时间。对于视觉模型，更偏好 Channels Last 内存格式，以充分发挥 PyTorch 模型的性能。

本文大纲

C++ 中的矩阵存储表示

高效访问二维矩阵的元素

使用 Cachegrind 测量缓存未命中

只构建并运行/分析 `loop1()`

输出

只构建并运行/分析 `loop2()`

输出

PyTorch 算子支持的内存格式

PyTorch 最佳实践

展示加速效果的实际示例

结论

参考资料

文档

教程

资源

高效的 PyTorch：张量内存格式很重要

本文大纲

C++ 中的矩阵存储表示

高效访问二维矩阵的元素

使用 Cachegrind 测量缓存未命中

只构建并运行/分析 loop1()

输出

只构建并运行/分析 loop2()

输出

PyTorch 算子支持的内存格式

PyTorch 最佳实践

展示加速效果的实际示例

结论

参考资料

文档

教程

资源

只构建并运行/分析 `loop1()`

只构建并运行/分析 `loop2()`