All my gradients are None, even my pred.grad, after backward() call

Hi all,

I am relatively new to Pytorch so please bare me with me :smiley:

I have trained a network that takes as input 3 images. The net is based on this:

After training, I wish a heatmap for each input image for interpretability of results.

I make a prediction based on a triplet of images, and call backward(), but all my parameters and even my loss has no gradient:

print(pred.grad) # gives None!

Here is the total print on the grad for model parameters:

trained_model.features_a4c.0.weight None
trained_model.features_a4c.0.bias None
trained_model.features_a4c.1.weight None
trained_model.features_a4c.1.bias None
trained_model.features_a4c.5.weight None
trained_model.features_a4c.5.bias None
trained_model.features_a4c.6.weight None
trained_model.features_a4c.6.bias None
trained_model.linear_block.0.weight None
trained_model.linear_block.0.bias None
trained_model.linear_block.3.weight None
trained_model.linear_block.3.bias None
linear_block.0.weight None
linear_block.0.bias None
linear_block.3.weight None
linear_block.3.bias None

model gradients = None

I have turned on model parameters requires_grad to True before backwards call, set model.eval().

Here is the gradcam model:

class MultiInputNet_GradCAM(nn.Module):
    # Implement GradCAM hooks
    def __init__(self, trained_model):
        super().__init__() # inherit methods and attributes of nn.Module
        self.trained_model = trained_model
        self.features_a4c = self.trained_model.features_a4c[:8]
        self.features_a2c = self.trained_model.features_a2c[:8]
        self.features_a3c = self.trained_model.features_a3c[:8]

        # we include the last maxpool2d and dropout in the forward function.
        self.last_maxpool2d = nn.MaxPool2d(kernel_size=(2, 2), padding=1)
        self.last_dropout = nn.Dropout(0.3)
        self.gradients = None

        self.linear_block = self.trained_model.linear_block

    def activations_hook(self, grad):
        self.gradients = grad

    def forward(self, x_a4c, x_a2c, x_a3c):
        x1 = self.features_a4c(x_a4c)
        x2 = self.features_a2c(x_a2c)
        x3 = self.features_a3c(x_a3c)

        h1 = x1.register_hook(self.activations_hook)
        h2 = x2.register_hook(self.activations_hook)
        h3 = x3.register_hook(self.activations_hook)

        x1 = self.last_maxpool2d(x1)
        x1 = self.last_dropout(x1)
        x2 = self.last_maxpool2d(x2)
        x2 = self.last_dropout(x2)
        x3 = self.last_maxpool2d(x3)
        x3 = self.last_dropout(x3)

        x_stack =, x2, x3), 1) # concatenate
        x_stack = x_stack.view(x_stack.size(0), -1) # flatten batchwise (not fully), we want size of (batch_size, __)
        out = self.linear_block(x_stack)

        return out

    # method for the gradient extraction
    def get_gradient(self):
        return self.gradients

    # method for the activation extraction
    def get_activations(self, x):
        return self.features(x)

And finally (sorry for blocky codes :S ):

def my_gradcam(model, imgs, target_class):

    # get the most likely prediction of the model
    pred = model(imgs[0].cuda(), imgs[1].cuda(), imgs[2].cuda()).argmax(dim=1)

    for name, param in model.named_parameters():
        print(name, param.grad) # returns all None

    # pull the gradients out of the model
    gradients = model.get_gradient() # here self.gradients None too

Thank you in advance for your help. Any insight will be much appreciated!

argmax is not differentiable and will detach the resulting tensor from the computation graph, so you would have to remove it.
I’m not familiar with your use case, but if you are working on e.g. a multi-class classification you should pass the raw logits to nn.CrossEntropyLoss instead.

1 Like

You are my absolute hero Mr. Bialecki.
I’ve been at this for the past two days. I should’ve knocked on the door earlier! :smiley:

Thank you so much!

1 Like

Hi ,
I have the same Issue I thought also that the problem is in argmax that causes the detach from the graph, But how to calculate loss if i remove argmax
for my exemple i have label of shape [batch_size,256,256] and the output of the model of shape
[batch_size, number_of_class, 256,256]

def forward (input):
     output = self.encoder(input)
     output =self.decoder(output)
     output = self.sigmoid(output)
     return output

optimizer = torch.optim.Adam(model.parameters(), lr=args.lr_start)

# in train 
images = Variable(images, requires_grad= True).cuda(0) 

logits = model(images)


logits   = torch.argmax(logits,dim=1)

criterion = nn.BCEWithLogitsLoss()

loss = criterion(logits.float(), labels.float())



but I still have all the grad parameter set as false even if I loaded data
with requires grad to true
and the error is


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

nn.BCEWithLogitsLoss expects raw logits as the model output, so just remove the argmax as well as the sigmoid.
Also, based on the output shape and the usage of nn.BCEWithLogitsLoss it seems you are working on a multi-label segmentation (each pixel can have zero, one, or multiple classes associated to it).
If that’s not the case and you are working on a multi-class segmentation (each pixel belongs to one class only) use nn.CrossEntropyLoss and also pass the raw logits to this loss function.

1 Like