My question is how autograd behave when you take the grad of a grad. Suppose y=wx, out=relu(y), dout/dw = dout/dy * dy/dw = dout/dy * x; In my opinion, If I take the derivative of dout/dw, d(dout/dw)/dw should all be 0, for there is no w in dout/dw
However, in the following example, if I take conv+relu, I get all zeros; If I take conv+batchnorm+relu, the derivative is not all zero; If I just use conv, I get an error “One of the differentiated Tensors appears to not have been used in the graph.”
Anyone can help explain this? Thanks!
def main():
resnet = ptcv_get_model("resnet18", pretrained=True)
resnet_modules = list(resnet.modules())
model = nn.Sequential(
resnet_modules[0].features.init_block.conv.conv,
# resnet_modules[0].features.init_block.conv.bn,
resnet_modules[0].features.init_block.conv.activ,
)
inputs = torch.rand(size = (1, 3, 224, 224), dtype=torch.float32, requires_grad=True)
outputs = model(inputs)
outputs.backward(torch.ones_like(outputs, dtype=torch.float), create_graph=True)
params, grads = get_params_grad(model)
model.zero_grad()
v = [
torch.randint_like(p, high=2)
for p in params
]
# generate Rademacher random variables
for v_i in v:
v_i[v_i == 0] = -1
Hv = torch.autograd.grad(grads,
params,
grad_outputs=v,
only_inputs=True,
retain_graph=True)
print(Hv)
def get_params_grad(model):
params = []
grads = []
for param in model.parameters():
if not param.requires_grad:
continue
params.append(param)
grads.append(0. if param.grad is None else param.grad + 0.)
return params, grads
In fact, I want to use grad of grad to compute hessian trace, but I cannot know its meaning.