Manually/explicitly calculate gradients of Conv kernels

111414 · April 30, 2022, 6:23am

1. Background:

I can calculate the gradient of x with respect to a cost function loss in two ways: (1) manually writing out the explicit and analytic formula, and (2) using torch.autograd package. Here is my example:

import torch
import torch.nn.functional as F

for i in range(10):
    x = torch.randn(8, 1, 128, 128)
    y = torch.randn(8, 512, 4, 4)
    k = torch.randn(512, 1, 32, 32)

    loss = lambda z: 0.5 * (F.conv2d(z, k, stride=32) - y).pow(2).sum(dim=[1,2,3])  # cost function is [(1/2)||k*x-y||_F^2]

    # 1: calculate gradient of x explicitly and manually
    x_grad_manual = F.conv2d(x, k, stride=32) - y
    x_grad_manual = F.conv_transpose2d(x_grad_manual, k, stride=32)

    # 2: calculate gradient of x using torch.autograd
    x_var = torch.autograd.Variable(x, requires_grad=True)
    x_var_loss = loss(x_var)
    x_grad_auto = torch.autograd.grad(x_var_loss, x_var, torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)[0]

    # check if the results of implementations 1 and 2 are equal
    print((x_grad_manual - x_grad_auto).pow(2).mean())

Since the mean squared errors of the results of the above two implementations are very small (about 3.4*10^(-8)), I think that their should be mutually matched and the manual implementation works correctly.

2. My Problem

I am confused by how to explicitly write out the gradients of variables (features and Conv kernels) conveniently with some compound processes? For instance, I do not know how to calculate the gradients of feature x and Conv kernel w1 in the following context:

import torch
import torch.nn.functional as F

for i in range(10):
    x = torch.randn(8, 32, 128, 128)
    y = torch.randn(8, 512, 4, 4)
    k = torch.randn(512, 1, 32, 32)
    w1 = torch.randn(1, 32, 3, 3)

    def loss(z, w):
        z_forward = F.conv2d(z, w, padding=1)  # z = w1 * x
        return 0.5 * (F.conv2d(z_forward, k, stride=32) - y).pow(2).sum(dim=[1,2,3])  # cost function is [(1/2)||k*z-y||_F^2]

    # 1: calculate gradients of x and w1 explicitly and manually
    x_grad_manual = ???
    w1_grad_manual = ???

    # 2: calculate gradients of x and w1 using torch.autograd
    x_var = torch.autograd.Variable(x, requires_grad=True)
    w1_var = torch.autograd.Variable(w1, requires_grad=True)
    x_var_loss = loss(x_var, w1_var)
    x_grad_auto, w1_grad_auto = torch.autograd.grad(x_var_loss, [x_var, w1_var], torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)

    # check if the results of implementations 1 and 2 are equal
    print((x_grad_manual - x_grad_auto).pow(2).mean())
    print((w1_grad_manual - w1_grad_auto).pow(2).mean())

3. Extension:

Furthermore, if the forwarding process is more complicated than the above one, with two middle Conv layers and a ReLU activation, how can I write out the gradients? Please see the following problem:

import torch
import torch.nn.functional as F

for i in range(10):
    x = torch.randn(8, 32, 128, 128)
    y = torch.randn(8, 512, 4, 4)
    k = torch.randn(512, 1, 32, 32)
    w1 = torch.randn(32, 32, 3, 3)
    w2 = torch.randn(1, 32, 3, 3)

    def loss(z, q1, q2):
        z_forward = F.conv2d(z, q1, padding=1)  # z = w1 * x
        z_forward = F.relu(z_forward, inplace=True)  # z = ReLU(w1 * x)
        z_forward = F.conv2d(z_forward, q2, padding=1)  # z = w2 * ReLU(w1 * x)
        return 0.5 * (F.conv2d(z_forward, k, stride=32) - y).pow(2).sum(dim=[1,2,3])  # cost function is [(1/2)||k*z-y||_F^2]

    # 1: calculate gradients of x, w1 and w2 explicitly and manually
    x_grad_manual = ???
    w1_grad_manual = ???
    w2_grad_manual = ???

    # 2: calculate gradients of x, w1 and w2 using torch.autograd
    x_var = torch.autograd.Variable(x, requires_grad=True)
    w1_var = torch.autograd.Variable(w1, requires_grad=True)
    w2_var = torch.autograd.Variable(w2, requires_grad=True)
    x_var_loss = loss(x_var, w1_var, w2_var)
    x_grad_auto, w1_grad_auto, w2_grad_auto = torch.autograd.grad(x_var_loss, [x_var, w1_var, w1_var], torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)

    # check if the results of implementations 1 and 2 are equal
    print((x_grad_manual - x_grad_auto).pow(2).mean())
    print((w1_grad_manual - w1_grad_auto).pow(2).mean())
    print((w2_grad_manual - w2_grad_auto).pow(2).mean())

4. Guarantee of Differentiability

Like my first example, I hope that the manual gradient calculations are totally explicit and themselves are differential, such that I can inject some of the processes in my neural network implementation. Could you please teach me how to achieve this?

5. The Reason of Posting This Problem

In a neural network I constructed, it is needed to calculate the gradients of some features and Conv kernels with respect to my pre-defined cost functions (as you can see above). In my current implementations, I directly employ torch.autograd package to calculate various gradients. However, it seems that there are some mistakes accumulated which misleads the learning process when I train such a neural network.

(The whole neural network has its own loss function and backward process. I just added some extra inner gradient calculations to achieve my goals.)

I conjecture that I should calculate the gradients manually and not directly use torch.autograd in a common network forwarding process, since some computational graphs and backwards may be nested and lead to the wrong weight updates.

In my experiments, I train two networks (with manual and auto-calculations, like the first example) and get similar results. But when I extend to more complicated forwardings (like my posted two problems), the training processes would not be stable. So I want to manually write out the gradients to avoid the implementation mistakes and conduct more experiements.

KFrank · April 30, 2022, 9:29pm

Hi Bin!

To get the gradient of the result of one function applied to the result of
another function, that is, of the composition of two functions, you would
use the chain rule. This is how autograd computes the gradient when
many functions are composed together, such as the successive layers
in a network.

It is true that floating-point round-off error can accumulate during
backpropagation (as it can during the forward pass, as well). Underflow
and overflow "errors’ can occur as well. Nonetheless, autograd does
an altogether solid job of performing these numerical computations.
It is unlikely that you will be able to do better calculating your own
gradients, or, in effect, writing your own version of autograd.

If you really are having problems with numerical stability during
backpropagation, you would be better off identifying the root cause
and addressing it directly, presumably by using more numerically
stable functions or implementations in your forward pass.

Best.

K. Frank