Autograd.grad get None for some variable occassionally

I’m using the autograd.grad of torch to perform analysis on my network. But I met with the problem that, when I run grad() on different data items, I occassionally get None for some layers on some certain data items, which will further cause errors.

I’m confused why this happens. Why does the same network get None grad on some data items, while does not on other data items? There is not noise items in the dataset, no missing values.

Some part of my network is as follows (to make it short, I did not post the whole network, but it’s enough to tell the problem):

def forward(self, input):
    x = input
    out = self.backbone(x)
    predict_up_conv = self.relu2(self.bn2(self.conv_up_conv(out)))
    predict_down_conv = self.relu3(self.bn3(self.conv_down_conv(out)))
    predict_cls_conv = self.relu4(self.bn4(self.conv_cls_conv(out)))
    predict_up = self.conv_up(predict_up_conv)
    predict_down = self.conv_down(predict_down_conv)
    predict_cls = self.conv_cls(predict_cls_conv)
    if self.phase == 'test':
        predict_cls_softmax = self.softmax(predict_cls)
        return predict_up, predict_down, predict_cls_softmax
        return predict_up,predict_down,predict_cls

the code I use to calculate the gradient is:

grads = grad(y, w, retain_graph=True, create_graph=True, allow_unused=True)

y is the target value, and w is the list of all parameters of the network.

grads is the resulting list of gradients. In normal cases, it is a list of pytorch tensors with shapes matching the feature map of each layer in the network. However, on some data items when I run this code, the grads has some elements which is None. And the associated layers are always the self.conv_up_conv, self.conv_down_conv, self.conv_up, self.conv_down.

I’m confused what’s wrong with my code.

Thank you all for helping me!!!


First, retain_graph=True is not needed if you use create_graph=True.
Also if you expect that all the parameters were used to compute y, you should not pass allow_unused=True.

Can you share how y is computed based on predict_up, predict_down and predict_cls ?
Because if y only depends on predict_cls, given the structure of your forward, it looks expected that these conv layers wouldn’t get gradients as their result will not be used.