Gradcheck 机制¶
本笔记概述了 gradcheck()
和 gradgradcheck()
函数的工作原理。
它将涵盖正向和反向模式 AD,包括实值和复值函数,以及高阶导数。本笔记还涵盖了 gradcheck 的默认行为以及传递 fast_mode=True
参数的情况(以下称为快速 gradcheck)。
符号和背景信息¶
在本笔记中,我们将使用以下约定
, , , , , , and are real-valued vectors and is a complex-valued vector that can be rewritten in terms of two real-valued vectors as .
和 是两个整数,我们将分别用于输入和输出空间的维度。
是我们基本的实数到实数函数,使得 .
是我们基本的复数到实数函数,使得 .
For the simple real-to-real case, we write as the Jacobian matrix associated with of size . This matrix contains all the partial derivatives such that the entry at position contains . Backward mode AD is then computing, for a given vector of size , the quantity . Forward mode AD on the other hand is computing, for a given vector of size , the quantity .
对于包含复数值的函数,情况要复杂得多。我们这里只提供概要,完整的描述可以在 复数自动微分 中找到。
满足复杂可微性(柯西-黎曼方程)的约束对于所有实值损失函数来说过于严格,因此我们选择使用 Wirtinger 微积分。在 Wirtinger 微积分的基本设置中,链式法则需要访问 Wirtinger 导数(下面称为 )和共轭 Wirtinger 导数(下面称为 )。 和 都需要进行传播,因为通常情况下,尽管它们的名字,一个不是另一个的复共轭。
为了避免必须传播两个值,对于反向模式 AD,我们始终假设正在计算导数的函数要么是实值函数,要么是更大实值函数的一部分。这个假设意味着我们在反向传播过程中计算的所有中间梯度也与实值函数相关联。在实践中,这个假设在进行优化时并不严格,因为这种问题需要实值目标(因为复数没有自然的排序)。
Under this assumption, using and definitions, we can show that (we use to denote complex conjugation here) and so only one of the two values actually need to be “backwarded through the graph” as the other one can easily be recovered. To simplify internal computations, PyTorch uses as the value it backwards and returns when the user asks for gradients. Similarly to the real case, when the output is actually in , backward mode AD does not compute but only for a given vector .
For forward mode AD, we use a similar logic, in this case, assuming that the function is part of a larger function whose input is in . Under this assumption, we can make a similar claim that every intermediary result corresponds to a function whose input is in and in this case, using and definitions, we can show that for the intermediary functions. To make sure the forward and backward mode compute the same quantities in the elementary case of a one dimensional function, the forward mode also computes . Similarly to the real case, when the input is actually in , forward mode AD does not compute but only for a given vector .
默认反向模式 gradcheck 行为¶
实数到实数函数¶
To test a function , we reconstruct the full Jacobian matrix of size in two ways: analytically and numerically. The analytical version uses our backward mode AD while the numerical version uses finite difference. The two reconstructed Jacobian matrices are then compared elementwise for equality.
默认实数输入数值评估¶
如果我们考虑一维函数的基本情况(),那么我们可以使用维基百科文章中提到的基本有限差分公式。为了获得更好的数值特性,我们使用“中心差分”。
This formula easily generalizes for multiple outputs () by having be a column vector of size like . In that case, the above formula can be re-used as-is and approximates the full Jacobian matrix with only two evaluations of the user function (namely and ).
处理具有多个输入的情况()在计算上更加昂贵。在这种情况下,我们依次遍历所有输入,并对 的每个元素依次应用 扰动。这使我们能够逐列重建 矩阵。
默认实数输入解析评估¶
对于分析评估,我们使用前面描述的事实,即反向模式 AD 计算 . 对于具有单个输出的函数,我们只需使用 来通过一次反向传播恢复完整的雅可比矩阵。
对于具有多个输出的函数,我们使用一个 for 循环来迭代输出,其中每个 是一个 one-hot 向量,对应于每个输出,依次排列。这允许我们逐行重建 矩阵。
复数到实数函数¶
To test a function with , we reconstruct the (complex-valued) matrix that contains .
默认复数输入数值评估¶
考虑最基本的情况,其中 首先。我们从(第 3 章)这篇研究论文 中知道
Note that and , in the above equation, are derivatives. To evaluate these numerically, we use the method described above for the real-to-real case. This allows us to compute the matrix and then multiply it by .
请注意,截至撰写本文时,代码以一种略微复杂的方式计算此值。
# Code from https://github.com/pytorch/pytorch/blob/58eb23378f2a376565a66ac32c93a316c45b6131/torch/autograd/gradcheck.py#L99-L105
# Notation changes in this code block:
# s here is y above
# x, y here are a, b above
ds_dx = compute_gradient(eps)
ds_dy = compute_gradient(eps * 1j)
# conjugate wirtinger derivative
conj_w_d = 0.5 * (ds_dx + ds_dy * 1j)
# wirtinger derivative
w_d = 0.5 * (ds_dx - ds_dy * 1j)
d[d_idx] = grad_out.conjugate() * conj_w_d + grad_out * w_d.conj()
# Since grad_out is always 1, and W and CW are complex conjugate of each other, the last line ends up computing exactly `conj_w_d + w_d.conj() = conj_w_d + conj_w_d = 2 * conj_w_d`.
默认复杂输入分析评估¶
由于反向模式 AD 已经计算了精确两倍的 导数,因此我们在这里简单地使用与实数到实数情况相同的技巧,并在有多个实数输出时逐行重建矩阵。
具有复数输出的函数¶
In this case, the user-provided function does not follow the assumption from the autograd that the function we compute backward AD for is real-valued. This means that using autograd directly on this function is not well defined. To solve this, we will replace the test of the function (where can be either or ), with two functions: and such that:
其中 。然后,我们对 和 进行基本的梯度检查,根据 使用上面描述的实数到实数或复数到实数情况。
Note that, the code, as of time of writing, does not create these functions explicitly but perform the chain rule with the or functions manually by passing the arguments to the different functions. When , then we are considering . When , then we are considering .
快速反向模式梯度检查¶
虽然上述梯度检查公式很棒,但为了确保正确性和可调试性,它非常慢,因为它重建了完整的雅可比矩阵。本节介绍了一种更快地执行梯度检查的方法,而不会影响其正确性。可调试性可以通过在检测到错误时添加特殊逻辑来恢复。在这种情况下,我们可以运行重建完整矩阵的默认版本,以便向用户提供完整的信息。
这里的高级策略是找到一个标量数量,它可以通过数值方法和解析方法有效地计算,并且能够很好地代表慢速梯度检查计算的完整矩阵,以确保它能够捕获雅可比矩阵中的任何差异。
实数到实数函数的快速梯度检查¶
The scalar quantity that we want to compute here is for a given random vector and a random unit norm vector .
对于数值评估,我们可以有效地计算
然后,我们将此向量与进行点积运算,以获得我们感兴趣的标量值。
对于解析版本,我们可以使用反向模式自动微分来计算。然后,我们与进行点积运算,以获得预期值。
复数到实数函数的快速梯度检查¶
与实数到实数的情况类似,我们希望对完整矩阵进行降维。但矩阵是复数,因此在这种情况下,我们将与复数标量进行比较。
由于对数值情况下可以有效计算的内容存在一些限制,并且为了将数值评估次数降至最低,我们计算了以下(虽然令人惊讶)的标量值
where , and .
快速复数输入数值评估¶
We first consider how to compute with a numerical method. To do so, keeping in mind that we’re considering with , and that , we rewrite it as follows:
In this formula, we can see that and can be evaluated the same way as the fast version for the real-to-real case. Once these real-valued quantities have been computed, we can reconstruct the complex vector on the right side and do a dot product with the real-valued vector.
快速复数输入解析评估¶
对于解析情况,事情变得更简单,我们将公式改写为
因此,我们可以利用反向模式 AD 为我们提供了一种有效的方法来计算 ,然后对实部与 和虚部与 进行点积,然后重建最终的复数标量 .
为什么不使用复数 ¶
At this point, you might be wondering why we did not select a complex and just performed the reduction . To dive into this, in this paragraph, we will use the complex version of noted . Using such complex , the problem is that when doing the numerical evaluation, we would need to compute:
这将需要对实数到实数的有限差分进行四次评估(与上面提出的方法相比,多出两倍)。由于这种方法没有更多自由度(相同数量的实值变量),并且我们试图在这里获得最快的评估,因此我们使用上面的另一种公式。
具有复数输出的函数的快速 gradcheck¶
就像在慢速情况下一样,我们考虑两个实值函数,并对每个函数使用上面的适当规则。
Gradgradcheck 实现¶
PyTorch 还提供了一个实用程序来验证二阶梯度。这里的目标是确保反向实现也是可微分的,并且计算出正确的结果。
This feature is implemented by considering the function and use the gradcheck defined above on this function. Note that in this case is just a random vector with the same type as .
gradgradcheck 的快速版本是通过对同一个函数 使用 gradcheck 的快速版本来实现的。