1. Background:
I can calculate the gradient of x
with respect to a cost function loss
in two ways: (1) manually writing out the explicit and analytic formula, and (2) using torch.autograd
package. Here is my example:
import torch
import torch.nn.functional as F
for i in range(10):
x = torch.randn(8, 1, 128, 128)
y = torch.randn(8, 512, 4, 4)
k = torch.randn(512, 1, 32, 32)
loss = lambda z: 0.5 * (F.conv2d(z, k, stride=32) - y).pow(2).sum(dim=[1,2,3]) # cost function is [(1/2)||k*x-y||_F^2]
# 1: calculate gradient of x explicitly and manually
x_grad_manual = F.conv2d(x, k, stride=32) - y
x_grad_manual = F.conv_transpose2d(x_grad_manual, k, stride=32)
# 2: calculate gradient of x using torch.autograd
x_var = torch.autograd.Variable(x, requires_grad=True)
x_var_loss = loss(x_var)
x_grad_auto = torch.autograd.grad(x_var_loss, x_var, torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)[0]
# check if the results of implementations 1 and 2 are equal
print((x_grad_manual - x_grad_auto).pow(2).mean())
Since the mean squared errors of the results of the above two implementations are very small (about 3.4*10^(-8)), I think that their should be mutually matched and the manual implementation works correctly.
2. My Problem
I am confused by how to explicitly write out the gradients of variables (features and Conv kernels) conveniently with some compound processes? For instance, I do not know how to calculate the gradients of feature x
and Conv kernel w1
in the following context:
import torch
import torch.nn.functional as F
for i in range(10):
x = torch.randn(8, 32, 128, 128)
y = torch.randn(8, 512, 4, 4)
k = torch.randn(512, 1, 32, 32)
w1 = torch.randn(1, 32, 3, 3)
def loss(z, w):
z_forward = F.conv2d(z, w, padding=1) # z = w1 * x
return 0.5 * (F.conv2d(z_forward, k, stride=32) - y).pow(2).sum(dim=[1,2,3]) # cost function is [(1/2)||k*z-y||_F^2]
# 1: calculate gradients of x and w1 explicitly and manually
x_grad_manual = ???
w1_grad_manual = ???
# 2: calculate gradients of x and w1 using torch.autograd
x_var = torch.autograd.Variable(x, requires_grad=True)
w1_var = torch.autograd.Variable(w1, requires_grad=True)
x_var_loss = loss(x_var, w1_var)
x_grad_auto, w1_grad_auto = torch.autograd.grad(x_var_loss, [x_var, w1_var], torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)
# check if the results of implementations 1 and 2 are equal
print((x_grad_manual - x_grad_auto).pow(2).mean())
print((w1_grad_manual - w1_grad_auto).pow(2).mean())
3. Extension:
Furthermore, if the forwarding process is more complicated than the above one, with two middle Conv layers and a ReLU
activation, how can I write out the gradients? Please see the following problem:
import torch
import torch.nn.functional as F
for i in range(10):
x = torch.randn(8, 32, 128, 128)
y = torch.randn(8, 512, 4, 4)
k = torch.randn(512, 1, 32, 32)
w1 = torch.randn(32, 32, 3, 3)
w2 = torch.randn(1, 32, 3, 3)
def loss(z, q1, q2):
z_forward = F.conv2d(z, q1, padding=1) # z = w1 * x
z_forward = F.relu(z_forward, inplace=True) # z = ReLU(w1 * x)
z_forward = F.conv2d(z_forward, q2, padding=1) # z = w2 * ReLU(w1 * x)
return 0.5 * (F.conv2d(z_forward, k, stride=32) - y).pow(2).sum(dim=[1,2,3]) # cost function is [(1/2)||k*z-y||_F^2]
# 1: calculate gradients of x, w1 and w2 explicitly and manually
x_grad_manual = ???
w1_grad_manual = ???
w2_grad_manual = ???
# 2: calculate gradients of x, w1 and w2 using torch.autograd
x_var = torch.autograd.Variable(x, requires_grad=True)
w1_var = torch.autograd.Variable(w1, requires_grad=True)
w2_var = torch.autograd.Variable(w2, requires_grad=True)
x_var_loss = loss(x_var, w1_var, w2_var)
x_grad_auto, w1_grad_auto, w2_grad_auto = torch.autograd.grad(x_var_loss, [x_var, w1_var, w1_var], torch.ones_like(x_var_loss), create_graph=True, retain_graph=True)
# check if the results of implementations 1 and 2 are equal
print((x_grad_manual - x_grad_auto).pow(2).mean())
print((w1_grad_manual - w1_grad_auto).pow(2).mean())
print((w2_grad_manual - w2_grad_auto).pow(2).mean())
4. Guarantee of Differentiability
Like my first example, I hope that the manual gradient calculations are totally explicit and themselves are differential, such that I can inject some of the processes in my neural network implementation. Could you please teach me how to achieve this?
5. The Reason of Posting This Problem
In a neural network I constructed, it is needed to calculate the gradients of some features and Conv kernels with respect to my pre-defined cost functions (as you can see above). In my current implementations, I directly employ torch.autograd
package to calculate various gradients. However, it seems that there are some mistakes accumulated which misleads the learning process when I train such a neural network.
(The whole neural network has its own loss
function and backward
process. I just added some extra inner gradient calculations to achieve my goals.)
I conjecture that I should calculate the gradients manually and not directly use torch.autograd
in a common network forwarding process, since some computational graphs and backwards may be nested and lead to the wrong weight updates.
In my experiments, I train two networks (with manual and auto-calculations, like the first example) and get similar results. But when I extend to more complicated forwardings (like my posted two problems), the training processes would not be stable. So I want to manually write out the gradients to avoid the implementation mistakes and conduct more experiements.