Understanding gradients of gradients

chenqb1989 · April 20, 2023, 6:24am

My question is how autograd behave when you take the grad of a grad. Suppose y=wx, out=relu(y), dout/dw = dout/dy * dy/dw = dout/dy * x; In my opinion, If I take the derivative of dout/dw, d(dout/dw)/dw should all be 0, for there is no w in dout/dw
However, in the following example, if I take conv+relu, I get all zeros; If I take conv+batchnorm+relu, the derivative is not all zero; If I just use conv, I get an error “One of the differentiated Tensors appears to not have been used in the graph.”
Anyone can help explain this? Thanks!

def main():
    resnet = ptcv_get_model("resnet18", pretrained=True)
    resnet_modules = list(resnet.modules())

    model = nn.Sequential(
        resnet_modules[0].features.init_block.conv.conv,
        # resnet_modules[0].features.init_block.conv.bn,
        resnet_modules[0].features.init_block.conv.activ,
    )

    inputs = torch.rand(size = (1, 3, 224, 224), dtype=torch.float32, requires_grad=True)
    outputs = model(inputs)
    outputs.backward(torch.ones_like(outputs, dtype=torch.float), create_graph=True)

    params, grads = get_params_grad(model)
    model.zero_grad()
    v = [
        torch.randint_like(p, high=2)
        for p in params
    ]

    # generate Rademacher random variables
    for v_i in v:
        v_i[v_i == 0] = -1
    
    Hv = torch.autograd.grad(grads,
                             params,
                             grad_outputs=v,
                             only_inputs=True,
                             retain_graph=True)
    print(Hv)

def get_params_grad(model):
    params = []
    grads = []
    for param in model.parameters():
        if not param.requires_grad:
            continue
        params.append(param)
        grads.append(0. if param.grad is None else param.grad + 0.)
    return params, grads

In fact, I want to use grad of grad to compute hessian trace, but I cannot know its meaning.