快捷方式

知识蒸馏教程

**作者**:Alexandros Chariton

知识蒸馏是一种能够将知识从大型、计算量大的模型转移到较小模型的技术,而不会损失有效性。这使得可以在功能较弱的硬件上进行部署,从而使评估速度更快、效率更高。

在本教程中,我们将进行一系列实验,重点关注如何使用功能更强大的网络作为教师来提高轻量级神经网络的准确性。轻量级网络的计算成本和速度将保持不变,我们的干预只针对其权重,而不是其前向传播。这项技术的应用可以在无人机或手机等设备中找到。在本教程中,我们不使用任何外部软件包,因为我们需要的一切都可以在 torchtorchvision 中找到。

在本教程中,您将学习

  • 如何修改模型类以提取隐藏表示并将其用于进一步计算

  • 如何修改 PyTorch 中的常规训练循环,以在例如分类的交叉熵之上包含其他损失

  • 如何通过使用更复杂的模型作为教师来提高轻量级模型的性能

先决条件

  • 1 个 GPU,4GB 内存

  • PyTorch v2.0 或更高版本

  • CIFAR-10 数据集(由脚本下载并保存在名为 /data 的目录中)

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets

# Check if GPU is available, and if not, use the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

加载 CIFAR-10

CIFAR-10 是一种流行的图像数据集,包含十个类别。我们的目标是预测每个输入图像的以下类别之一。

../_static/img/cifar10.png

CIFAR-10图像示例

输入图像为RGB格式,因此具有3个通道,大小为32x32像素。基本上,每个图像由3 x 32 x 32 = 3072个介于0到255之间的数字描述。神经网络中一个常见的做法是规范化输入,这样做有多个原因,包括避免常用激活函数中的饱和以及提高数值稳定性。我们的规范化过程包括沿着每个通道减去均值并除以标准差。张量“mean=[0.485, 0.456, 0.406]”和“std=[0.229, 0.224, 0.225]”已经计算出来,它们分别表示预定义的CIFAR-10子集中每个通道的均值和标准差,该子集旨在作为训练集。请注意,我们也对测试集使用这些值,而无需从头开始重新计算均值和标准差。这是因为网络是在减去和除以上数字生成的特征上训练的,我们希望保持一致性。此外,在现实生活中,我们将无法计算测试集的均值和标准差,因为根据我们的假设,此时无法访问这些数据。

最后一点,我们通常将这个保留集称为验证集,并在模型在验证集上优化性能后使用一个单独的集合,称为测试集。这样做是为了避免基于单个指标的贪婪和有偏差的优化来选择模型。

# Below we are preprocessing data for CIFAR-10. We use an arbitrary batch size of 128.
transforms_cifar = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Loading the CIFAR-10 dataset:
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms_cifar)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transforms_cifar)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz

  0%|          | 0.00/170M [00:00<?, ?B/s]
  0%|          | 393k/170M [00:00<00:43, 3.93MB/s]
  3%|3         | 5.21M/170M [00:00<00:05, 29.9MB/s]
  7%|6         | 11.7M/170M [00:00<00:03, 46.0MB/s]
 11%|#         | 18.2M/170M [00:00<00:02, 53.2MB/s]
 15%|#4        | 24.8M/170M [00:00<00:02, 58.0MB/s]
 18%|#8        | 31.3M/170M [00:00<00:02, 60.4MB/s]
 22%|##2       | 37.8M/170M [00:00<00:02, 61.7MB/s]
 26%|##5       | 44.3M/170M [00:00<00:02, 62.7MB/s]
 30%|##9       | 50.9M/170M [00:00<00:01, 63.7MB/s]
 34%|###3      | 57.4M/170M [00:01<00:01, 64.3MB/s]
 39%|###9      | 66.6M/170M [00:01<00:01, 72.5MB/s]
 46%|####5     | 78.2M/170M [00:01<00:01, 85.6MB/s]
 53%|#####2    | 89.7M/170M [00:01<00:00, 94.6MB/s]
 59%|#####9    | 101M/170M [00:01<00:00, 101MB/s]
 66%|######6   | 113M/170M [00:01<00:00, 105MB/s]
 73%|#######3  | 125M/170M [00:01<00:00, 109MB/s]
 80%|#######9  | 136M/170M [00:01<00:00, 111MB/s]
 87%|########6 | 148M/170M [00:01<00:00, 113MB/s]
 94%|#########3| 160M/170M [00:01<00:00, 114MB/s]
100%|##########| 170M/170M [00:01<00:00, 85.3MB/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified

注意

此部分仅适用于希望快速获得结果的CPU用户。仅当您对小型实验感兴趣时才使用此选项。请记住,使用任何GPU,代码都应该运行得相当快。仅从训练/测试数据集中选择前num_images_to_keep张图像

#from torch.utils.data import Subset
#num_images_to_keep = 2000
#train_dataset = Subset(train_dataset, range(min(num_images_to_keep, 50_000)))
#test_dataset = Subset(test_dataset, range(min(num_images_to_keep, 10_000)))
#Dataloaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True, num_workers=2)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=128, shuffle=False, num_workers=2)

定义模型类和实用函数

接下来,我们需要定义我们的模型类。这里需要设置一些用户定义的参数。我们使用两种不同的架构,在我们的实验中保持滤波器的数量固定,以确保公平比较。两种架构都是卷积神经网络(CNN),具有不同数量的卷积层作为特征提取器,然后是具有10个类别的分类器。学生的滤波器和神经元的数量更少。

# Deeper neural network class to be used as teacher:
class DeepNN(nn.Module):
    def __init__(self, num_classes=10):
        super(DeepNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

# Lightweight neural network class to be used as student:
class LightNN(nn.Module):
    def __init__(self, num_classes=10):
        super(LightNN, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

我们使用2个函数来帮助我们在原始分类任务上生成和评估结果。一个函数称为train,它接受以下参数

  • model:要通过此函数训练(更新其权重)的模型实例。

  • train_loader:我们在上面定义了我们的train_loader,它的作用是将数据馈送到模型中。

  • epochs:我们遍历数据集的次数。

  • learning_rate:学习率决定了我们朝着收敛方向迈出的步长应该有多大。过大或过小的步长可能是有害的。

  • device:确定运行工作负载的设备。可以是CPU或GPU,具体取决于可用性。

我们的测试函数类似,但它将使用test_loader从测试集中加载图像来调用。

../_static/img/knowledge_distillation/ce_only.png

使用交叉熵训练两个网络。学生将用作基线:

def train(model, train_loader, epochs, learning_rate, device):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    model.train()

    for epoch in range(epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            # inputs: A collection of batch_size images
            # labels: A vector of dimensionality batch_size with integers denoting class of each image
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)

            # outputs: Output of the network for the collection of images. A tensor of dimensionality batch_size x num_classes
            # labels: The actual labels of the images. Vector of dimensionality batch_size
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

def test(model, test_loader, device):
    model.to(device)
    model.eval()

    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Test Accuracy: {accuracy:.2f}%")
    return accuracy

交叉熵运行

为了可重复性,我们需要设置torch手动种子。我们使用不同的方法训练网络,因此为了公平地比较它们,将网络初始化为相同的权重是有意义的。首先使用交叉熵训练教师网络

torch.manual_seed(42)
nn_deep = DeepNN(num_classes=10).to(device)
train(nn_deep, train_loader, epochs=10, learning_rate=0.001, device=device)
test_accuracy_deep = test(nn_deep, test_loader, device)

# Instantiate the lightweight network:
torch.manual_seed(42)
nn_light = LightNN(num_classes=10).to(device)
Epoch 1/10, Loss: 1.334357043971186
Epoch 2/10, Loss: 0.8678901834256204
Epoch 3/10, Loss: 0.6820475687761136
Epoch 4/10, Loss: 0.5395163333476962
Epoch 5/10, Loss: 0.42809074324415164
Epoch 6/10, Loss: 0.3238168299350592
Epoch 7/10, Loss: 0.23283474278800628
Epoch 8/10, Loss: 0.17823990300069076
Epoch 9/10, Loss: 0.1520518097559662
Epoch 10/10, Loss: 0.1167274760296735
Test Accuracy: 75.16%

我们实例化一个更轻量级的网络模型来比较它们的性能。反向传播对权重初始化敏感,因此我们需要确保这两个网络具有完全相同的初始化。

torch.manual_seed(42)
new_nn_light = LightNN(num_classes=10).to(device)

为了确保我们创建了第一个网络的副本,我们检查其第一层的范数。如果匹配,那么我们可以安全地得出结论,这两个网络确实是相同的。

# Print the norm of the first layer of the initial lightweight model
print("Norm of 1st layer of nn_light:", torch.norm(nn_light.features[0].weight).item())
# Print the norm of the first layer of the new lightweight model
print("Norm of 1st layer of new_nn_light:", torch.norm(new_nn_light.features[0].weight).item())
Norm of 1st layer of nn_light: 2.327361822128296
Norm of 1st layer of new_nn_light: 2.327361822128296

打印每个模型中的参数总数

total_params_deep = "{:,}".format(sum(p.numel() for p in nn_deep.parameters()))
print(f"DeepNN parameters: {total_params_deep}")
total_params_light = "{:,}".format(sum(p.numel() for p in nn_light.parameters()))
print(f"LightNN parameters: {total_params_light}")
DeepNN parameters: 1,186,986
LightNN parameters: 267,738

使用交叉熵损失训练和测试轻量级网络

train(nn_light, train_loader, epochs=10, learning_rate=0.001, device=device)
test_accuracy_light_ce = test(nn_light, test_loader, device)
Epoch 1/10, Loss: 1.4695049030396639
Epoch 2/10, Loss: 1.161587066662586
Epoch 3/10, Loss: 1.0320059902527754
Epoch 4/10, Loss: 0.9294197153862175
Epoch 5/10, Loss: 0.8507145019748327
Epoch 6/10, Loss: 0.7838947931518945
Epoch 7/10, Loss: 0.7171505662181493
Epoch 8/10, Loss: 0.6594943256329393
Epoch 9/10, Loss: 0.6066120313409039
Epoch 10/10, Loss: 0.5574908487479705
Test Accuracy: 69.80%

如我们所见,基于测试准确率,我们现在可以比较将用作教师的更深层网络和作为我们假设的学生的轻量级网络。到目前为止,我们的学生还没有与教师互动,因此此性能是由学生自己实现的。到目前为止,指标可以通过以下几行看到

print(f"Teacher accuracy: {test_accuracy_deep:.2f}%")
print(f"Student accuracy: {test_accuracy_light_ce:.2f}%")
Teacher accuracy: 75.16%
Student accuracy: 69.80%

知识蒸馏运行

现在让我们尝试通过整合教师来提高学生网络的测试准确率。知识蒸馏是一种实现此目的的简单技术,它基于两个网络都输出我们类别的概率分布这一事实。因此,这两个网络具有相同数量的输出神经元。该方法通过将一个额外的损失合并到传统的交叉熵损失中来工作,该损失基于教师网络的softmax输出。假设经过适当训练的教师网络的输出激活携带额外的信息,可以在训练期间由学生网络利用。最初的工作表明,利用软目标中较小概率的比率可以帮助实现深度神经网络的底层目标,即在数据上创建相似性结构,其中相似的对象被映射得更近。例如,在CIFAR-10中,如果卡车的轮子存在,它可能会被误认为是汽车或飞机,但不太可能被误认为是狗。因此,假设有价值的信息不仅存在于经过适当训练的模型的顶部预测中,而且存在于整个输出分布中是有道理的。但是,仅交叉熵并不能充分利用此信息,因为非预测类的激活往往非常小,以至于传播的梯度不会有意义地改变权重以构建这个理想的向量空间。

在我们继续定义引入师生动态的第一个辅助函数时,我们需要包含一些额外的参数

  • T:温度控制输出分布的平滑度。T越大,分布越平滑,因此较小的概率得到更大的提升。

  • soft_target_loss_weight:分配给我们要包含的额外目标的权重。

  • ce_loss_weight:分配给交叉熵的权重。调整这些权重会推动网络朝着优化任一目标的方向发展。

../_static/img/knowledge_distillation/distillation_output_loss.png

蒸馏损失是从网络的logits计算的。它只返回到学生的梯度:

def train_knowledge_distillation(teacher, student, train_loader, epochs, learning_rate, T, soft_target_loss_weight, ce_loss_weight, device):
    ce_loss = nn.CrossEntropyLoss()
    optimizer = optim.Adam(student.parameters(), lr=learning_rate)

    teacher.eval()  # Teacher set to evaluation mode
    student.train() # Student to train mode

    for epoch in range(epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()

            # Forward pass with the teacher model - do not save gradients here as we do not change the teacher's weights
            with torch.no_grad():
                teacher_logits = teacher(inputs)

            # Forward pass with the student model
            student_logits = student(inputs)

            #Soften the student logits by applying softmax first and log() second
            soft_targets = nn.functional.softmax(teacher_logits / T, dim=-1)
            soft_prob = nn.functional.log_softmax(student_logits / T, dim=-1)

            # Calculate the soft targets loss. Scaled by T**2 as suggested by the authors of the paper "Distilling the knowledge in a neural network"
            soft_targets_loss = torch.sum(soft_targets * (soft_targets.log() - soft_prob)) / soft_prob.size()[0] * (T**2)

            # Calculate the true label loss
            label_loss = ce_loss(student_logits, labels)

            # Weighted sum of the two losses
            loss = soft_target_loss_weight * soft_targets_loss + ce_loss_weight * label_loss

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

# Apply ``train_knowledge_distillation`` with a temperature of 2. Arbitrarily set the weights to 0.75 for CE and 0.25 for distillation loss.
train_knowledge_distillation(teacher=nn_deep, student=new_nn_light, train_loader=train_loader, epochs=10, learning_rate=0.001, T=2, soft_target_loss_weight=0.25, ce_loss_weight=0.75, device=device)
test_accuracy_light_ce_and_kd = test(new_nn_light, test_loader, device)

# Compare the student test accuracy with and without the teacher, after distillation
print(f"Teacher accuracy: {test_accuracy_deep:.2f}%")
print(f"Student accuracy without teacher: {test_accuracy_light_ce:.2f}%")
print(f"Student accuracy with CE + KD: {test_accuracy_light_ce_and_kd:.2f}%")
Epoch 1/10, Loss: 2.410450777129444
Epoch 2/10, Loss: 1.8915844320336266
Epoch 3/10, Loss: 1.6681236293919557
Epoch 4/10, Loss: 1.5058916444363801
Epoch 5/10, Loss: 1.3813653934337293
Epoch 6/10, Loss: 1.270200352077289
Epoch 7/10, Loss: 1.1714994689387739
Epoch 8/10, Loss: 1.087507491983721
Epoch 9/10, Loss: 1.0078320823362112
Epoch 10/10, Loss: 0.9452352542096697
Test Accuracy: 70.61%
Teacher accuracy: 75.16%
Student accuracy without teacher: 69.80%
Student accuracy with CE + KD: 70.61%

余弦损失最小化运行

随意调整控制softmax函数的软度和损失系数的温度参数。在神经网络中,很容易将额外的损失函数包含到主要目标中以实现诸如更好的泛化等目标。让我们尝试为学生包含一个目标,但现在让我们专注于他们的隐藏状态而不是他们的输出层。我们的目标是通过包含一个简单的损失函数将信息从教师的表示传递给学生,其最小化意味着随后传递给分类器的扁平化向量随着损失的减小变得更加“相似”。当然,教师不会更新其权重,因此最小化仅取决于学生的权重。这种方法背后的基本原理是我们假设教师模型具有更好的内部表示,学生在没有外部干预的情况下不太可能实现这一点,因此我们人为地推动学生模仿教师的内部表示。但是,这是否最终会帮助学生并不简单,因为推动轻量级网络达到这一点可能是一件好事,假设我们找到了一个导致更好测试准确率的内部表示,但它也可能是有害的,因为网络具有不同的架构,并且学生没有与教师相同的学习能力。换句话说,这两个向量(学生的和教师的)没有理由按组件匹配。学生可以达到教师内部表示的排列,并且效率一样高。尽管如此,我们仍然可以快速进行实验以找出这种方法的影响。我们将使用CosineEmbeddingLoss,它由以下公式给出

../_static/img/knowledge_distillation/cosine_embedding_loss.png

CosineEmbeddingLoss公式

显然,有一件事我们需要首先解决。当我们将蒸馏应用于输出层时,我们提到两个网络具有相同数量的神经元,等于类的数量。但是,对于我们的卷积层之后的层来说情况并非如此。在这里,在最终卷积层展平后,教师的神经元比学生多。我们的损失函数接受两个相同维度的向量作为输入,因此我们需要以某种方式匹配它们。我们将通过在教师的卷积层之后包含一个平均池化层来解决这个问题,以将其维度降低以匹配学生的维度。

为了继续,我们将修改我们的模型类,或创建新的模型类。现在,forward函数不仅返回网络的logits,还返回卷积层之后的扁平化隐藏表示。我们为修改后的教师包含了前面提到的池化。

class ModifiedDeepNNCosine(nn.Module):
    def __init__(self, num_classes=10):
        super(ModifiedDeepNNCosine, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        flattened_conv_output = torch.flatten(x, 1)
        x = self.classifier(flattened_conv_output)
        flattened_conv_output_after_pooling = torch.nn.functional.avg_pool1d(flattened_conv_output, 2)
        return x, flattened_conv_output_after_pooling

# Create a similar student class where we return a tuple. We do not apply pooling after flattening.
class ModifiedLightNNCosine(nn.Module):
    def __init__(self, num_classes=10):
        super(ModifiedLightNNCosine, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        flattened_conv_output = torch.flatten(x, 1)
        x = self.classifier(flattened_conv_output)
        return x, flattened_conv_output

# We do not have to train the modified deep network from scratch of course, we just load its weights from the trained instance
modified_nn_deep = ModifiedDeepNNCosine(num_classes=10).to(device)
modified_nn_deep.load_state_dict(nn_deep.state_dict())

# Once again ensure the norm of the first layer is the same for both networks
print("Norm of 1st layer for deep_nn:", torch.norm(nn_deep.features[0].weight).item())
print("Norm of 1st layer for modified_deep_nn:", torch.norm(modified_nn_deep.features[0].weight).item())

# Initialize a modified lightweight network with the same seed as our other lightweight instances. This will be trained from scratch to examine the effectiveness of cosine loss minimization.
torch.manual_seed(42)
modified_nn_light = ModifiedLightNNCosine(num_classes=10).to(device)
print("Norm of 1st layer:", torch.norm(modified_nn_light.features[0].weight).item())
Norm of 1st layer for deep_nn: 7.542240142822266
Norm of 1st layer for modified_deep_nn: 7.542240142822266
Norm of 1st layer: 2.327361822128296

自然地,我们需要更改训练循环,因为现在模型返回一个元组(logits, hidden_representation)。使用样本输入张量,我们可以打印它们的形状。

# Create a sample input tensor
sample_input = torch.randn(128, 3, 32, 32).to(device) # Batch size: 128, Filters: 3, Image size: 32x32

# Pass the input through the student
logits, hidden_representation = modified_nn_light(sample_input)

# Print the shapes of the tensors
print("Student logits shape:", logits.shape) # batch_size x total_classes
print("Student hidden representation shape:", hidden_representation.shape) # batch_size x hidden_representation_size

# Pass the input through the teacher
logits, hidden_representation = modified_nn_deep(sample_input)

# Print the shapes of the tensors
print("Teacher logits shape:", logits.shape) # batch_size x total_classes
print("Teacher hidden representation shape:", hidden_representation.shape) # batch_size x hidden_representation_size
Student logits shape: torch.Size([128, 10])
Student hidden representation shape: torch.Size([128, 1024])
Teacher logits shape: torch.Size([128, 10])
Teacher hidden representation shape: torch.Size([128, 1024])

在我们的例子中,hidden_representation_size1024。这是学生最终卷积层的扁平化特征图,如您所见,它是其分类器的输入。对于教师来说,它也是1024,因为我们使用avg_pool1d2048将其设置为1024。此处应用的损失仅影响损失计算前学生的权重。换句话说,它不影响学生的分类器。修改后的训练循环如下

../_static/img/knowledge_distillation/cosine_loss_distillation.png

在余弦损失最小化中,我们希望通过返回到学生的梯度来最大化两个表示的余弦相似度:

def train_cosine_loss(teacher, student, train_loader, epochs, learning_rate, hidden_rep_loss_weight, ce_loss_weight, device):
    ce_loss = nn.CrossEntropyLoss()
    cosine_loss = nn.CosineEmbeddingLoss()
    optimizer = optim.Adam(student.parameters(), lr=learning_rate)

    teacher.to(device)
    student.to(device)
    teacher.eval()  # Teacher set to evaluation mode
    student.train() # Student to train mode

    for epoch in range(epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()

            # Forward pass with the teacher model and keep only the hidden representation
            with torch.no_grad():
                _, teacher_hidden_representation = teacher(inputs)

            # Forward pass with the student model
            student_logits, student_hidden_representation = student(inputs)

            # Calculate the cosine loss. Target is a vector of ones. From the loss formula above we can see that is the case where loss minimization leads to cosine similarity increase.
            hidden_rep_loss = cosine_loss(student_hidden_representation, teacher_hidden_representation, target=torch.ones(inputs.size(0)).to(device))

            # Calculate the true label loss
            label_loss = ce_loss(student_logits, labels)

            # Weighted sum of the two losses
            loss = hidden_rep_loss_weight * hidden_rep_loss + ce_loss_weight * label_loss

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

我们需要出于相同原因修改测试函数。在这里,我们忽略了模型返回的隐藏表示。

def test_multiple_outputs(model, test_loader, device):
    model.to(device)
    model.eval()

    correct = 0
    total = 0

    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs, _ = model(inputs) # Disregard the second tensor of the tuple
            _, predicted = torch.max(outputs.data, 1)

            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Test Accuracy: {accuracy:.2f}%")
    return accuracy

在这种情况下,我们可以轻松地将知识蒸馏和余弦损失最小化都包含在同一个函数中。在师生范式中,结合多种方法以获得更好的性能是很常见的。现在,我们可以运行一个简单的训练-测试会话。

# Train and test the lightweight network with cross entropy loss
train_cosine_loss(teacher=modified_nn_deep, student=modified_nn_light, train_loader=train_loader, epochs=10, learning_rate=0.001, hidden_rep_loss_weight=0.25, ce_loss_weight=0.75, device=device)
test_accuracy_light_ce_and_cosine_loss = test_multiple_outputs(modified_nn_light, test_loader, device)
Epoch 1/10, Loss: 1.3040333937501054
Epoch 2/10, Loss: 1.067599084371191
Epoch 3/10, Loss: 0.9676838376942802
Epoch 4/10, Loss: 0.8940548546173993
Epoch 5/10, Loss: 0.838494420661341
Epoch 6/10, Loss: 0.793050117199988
Epoch 7/10, Loss: 0.7527368469616337
Epoch 8/10, Loss: 0.7177437879240421
Epoch 9/10, Loss: 0.681629899670096
Epoch 10/10, Loss: 0.6526979580712136
Test Accuracy: 70.93%

中间回归器运行

我们的朴素最小化方法由于多种原因无法保证获得更好的结果,其中之一是向量的维度。对于高维向量,余弦相似度通常比欧氏距离效果更好,但我们处理的向量每个都有 1024 个分量,因此提取有意义的相似度要困难得多。此外,正如我们提到的,推动教师和学生的隐藏表示匹配并没有理论依据。我们没有充分的理由要追求这些向量的 1:1 匹配。我们将提供一个最终的训练干预示例,其中包含一个称为回归器的额外网络。目标是首先提取卷积层之后教师的特征图,然后提取卷积层之后学生的特征图,最后尝试匹配这些特征图。但是,这次,我们将在网络之间引入一个回归器来促进匹配过程。回归器将是可训练的,并且理想情况下比我们朴素的余弦损失最小化方案做得更好。它的主要作用是匹配这些特征图的维度,以便我们可以在教师和学生之间正确定义损失函数。定义这样的损失函数提供了“教学路径”,这基本上是反向传播梯度的流程,它将改变学生的权重。专注于我们原始网络中每个分类器之前的卷积层的输出,我们有以下形状

# Pass the sample input only from the convolutional feature extractor
convolutional_fe_output_student = nn_light.features(sample_input)
convolutional_fe_output_teacher = nn_deep.features(sample_input)

# Print their shapes
print("Student's feature extractor output shape: ", convolutional_fe_output_student.shape)
print("Teacher's feature extractor output shape: ", convolutional_fe_output_teacher.shape)
Student's feature extractor output shape:  torch.Size([128, 16, 8, 8])
Teacher's feature extractor output shape:  torch.Size([128, 32, 8, 8])

教师有 32 个滤波器,学生有 16 个滤波器。我们将包含一个可训练层,将学生的特征图转换为教师的特征图的形状。在实践中,我们修改轻量级类以在匹配卷积特征图大小的中间回归器之后返回隐藏状态,并修改教师类以返回最终卷积层的输出,而无需池化或展平。

../_static/img/knowledge_distillation/fitnets_knowledge_distill.png

可训练层匹配中间张量的形状,并且均方误差 (MSE) 被正确定义:

class ModifiedDeepNNRegressor(nn.Module):
    def __init__(self, num_classes=10):
        super(ModifiedDeepNNRegressor, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        conv_feature_map = x
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x, conv_feature_map

class ModifiedLightNNRegressor(nn.Module):
    def __init__(self, num_classes=10):
        super(ModifiedLightNNRegressor, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(16, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )
        # Include an extra regressor (in our case linear)
        self.regressor = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, padding=1)
        )
        self.classifier = nn.Sequential(
            nn.Linear(1024, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        regressor_output = self.regressor(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x, regressor_output

之后,我们必须再次更新我们的训练循环。这次,我们提取学生的回归器输出和教师的特征图,我们计算这些张量上的 MSE(它们具有完全相同的形状,因此定义正确),并且除了分类任务的常规交叉熵损失之外,我们还基于该损失反向传播梯度。

def train_mse_loss(teacher, student, train_loader, epochs, learning_rate, feature_map_weight, ce_loss_weight, device):
    ce_loss = nn.CrossEntropyLoss()
    mse_loss = nn.MSELoss()
    optimizer = optim.Adam(student.parameters(), lr=learning_rate)

    teacher.to(device)
    student.to(device)
    teacher.eval()  # Teacher set to evaluation mode
    student.train() # Student to train mode

    for epoch in range(epochs):
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()

            # Again ignore teacher logits
            with torch.no_grad():
                _, teacher_feature_map = teacher(inputs)

            # Forward pass with the student model
            student_logits, regressor_feature_map = student(inputs)

            # Calculate the loss
            hidden_rep_loss = mse_loss(regressor_feature_map, teacher_feature_map)

            # Calculate the true label loss
            label_loss = ce_loss(student_logits, labels)

            # Weighted sum of the two losses
            loss = feature_map_weight * hidden_rep_loss + ce_loss_weight * label_loss

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {running_loss / len(train_loader)}")

# Notice how our test function remains the same here with the one we used in our previous case. We only care about the actual outputs because we measure accuracy.

# Initialize a ModifiedLightNNRegressor
torch.manual_seed(42)
modified_nn_light_reg = ModifiedLightNNRegressor(num_classes=10).to(device)

# We do not have to train the modified deep network from scratch of course, we just load its weights from the trained instance
modified_nn_deep_reg = ModifiedDeepNNRegressor(num_classes=10).to(device)
modified_nn_deep_reg.load_state_dict(nn_deep.state_dict())

# Train and test once again
train_mse_loss(teacher=modified_nn_deep_reg, student=modified_nn_light_reg, train_loader=train_loader, epochs=10, learning_rate=0.001, feature_map_weight=0.25, ce_loss_weight=0.75, device=device)
test_accuracy_light_ce_and_mse_loss = test_multiple_outputs(modified_nn_light_reg, test_loader, device)
Epoch 1/10, Loss: 1.7916893879775806
Epoch 2/10, Loss: 1.387149176024415
Epoch 3/10, Loss: 1.229984131310602
Epoch 4/10, Loss: 1.1302453178883818
Epoch 5/10, Loss: 1.0499040078933892
Epoch 6/10, Loss: 0.986985869114966
Epoch 7/10, Loss: 0.9317489292310632
Epoch 8/10, Loss: 0.8807459126043198
Epoch 9/10, Loss: 0.8373106495498697
Epoch 10/10, Loss: 0.7982133969931346
Test Accuracy: 71.52%

预期最终方法将比 CosineLoss 效果更好,因为现在我们在教师和学生之间允许了一个可训练层,这使得学生在学习时具有一定的灵活性,而不是强迫学生复制教师的表示。包含额外网络是基于提示的蒸馏背后的思想。

print(f"Teacher accuracy: {test_accuracy_deep:.2f}%")
print(f"Student accuracy without teacher: {test_accuracy_light_ce:.2f}%")
print(f"Student accuracy with CE + KD: {test_accuracy_light_ce_and_kd:.2f}%")
print(f"Student accuracy with CE + CosineLoss: {test_accuracy_light_ce_and_cosine_loss:.2f}%")
print(f"Student accuracy with CE + RegressorMSE: {test_accuracy_light_ce_and_mse_loss:.2f}%")
Teacher accuracy: 75.16%
Student accuracy without teacher: 69.80%
Student accuracy with CE + KD: 70.61%
Student accuracy with CE + CosineLoss: 70.93%
Student accuracy with CE + RegressorMSE: 71.52%

结论

以上方法均不会增加网络的参数数量或推理时间,因此性能提升是以训练过程中计算梯度的少量成本为代价的。在机器学习应用中,我们主要关心推理时间,因为训练发生在模型部署之前。如果我们的轻量级模型对于部署来说仍然过于庞大,我们可以应用不同的想法,例如训练后量化。除了分类之外,许多任务中都可以应用额外的损失,并且您可以尝试诸如系数、温度或神经元数量之类的量。随意调整上述教程中的任何数字,但请记住,如果您更改神经元/滤波器的数量,则可能会发生形状不匹配。

更多信息,请参阅

脚本的总运行时间:(7 分钟 45.329 秒)

由 Sphinx-Gallery 生成的图库

文档

访问 PyTorch 的全面开发者文档

查看文档

教程

获取针对初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并获得问题的解答

查看资源