Issue using loss.backward() operation, facing grad_fn

Hi, I am new to PyTorch. I am trying to train a Unet model. All layers used are custom-written for my project. The layers individually work and return expected outcomes. But while trying to put them together in my Unet model, seems like some gradient descent problem I am facing. It gives me the following error:
element 0 of tensors does not require grad and does not have a grad_fn.

It seems like some indexing I have messed up somewhere which I cannot see. I have printed the shapes and requires_grad of all layers and learned they are all in the expected condition.
for name, param in model.named_parameters():
print(name, param.requires_grad)
this check also shows all gradients are True. but in the training loop, I find the loss.requires_grad is False, which traces back to the last layer in the Unet, a deconvolution layer returns a requires_grad False. can anyone suggest some options to debug as in why this would be the case? since everything is custom written, I am not sure how to provide error reproducible script. I can share my github. let me know.
Many Thanks
Wasim

Can you share a minimal reproducible example of the code and wrap any code with ```?

Hi, thanks for your prompt response. All my convolution, pooling, upsampling and deconvolution layers are custom-written in PyTorch but without nn.Conv1d or nn.MaxPool1d etc. They are manually calculated. so I am not sure how to share a small reproducible script. But, since the issue comes up in the last layer (a deconvolution layer), here I can share my deconvolution function:

def custom_deconvolution(convolved_maps, nside, kernel, num_iterations=5, lr=0.01, activation=‘sigmoid’, padding=‘same’):
kernel_torch = kernel.clone().detach()

# Initialize the deconvolved map with the convolved map as a starting point
deconvolved_maps = convolved_maps.clone().detach().requires_grad_(True)
optimizer = torch.optim.Adam([deconvolved_maps], lr=lr)

for _ in range(num_iterations):
    optimizer.zero_grad()
    output = custom_convolution_torch(deconvolved_maps, nside, kernel_torch, activation, padding)
    loss = torch.nn.functional.mse_loss(output, convolved_maps)
    loss.backward(retain_graph=True)
    optimizer.step()

return deconvolved_maps.detach()

and the subclass I feed this to is as following:

subclass for deconvolution

class CustomDeconvolution(torch.nn.Module):
def init(self, nside, kernel, num_iterations=5, lr=0.05, activation=‘sigmoid’, padding=‘same’):
super(CustomDeconvolution, self).init()
self.nside = nside
self.kernel = nn.Parameter(torch.tensor(kernel, dtype=torch.float32), requires_grad=True)
self.num_iterations = num_iterations
self.lr = lr
self.activation = activation
self.padding = padding

def forward(self, convolved_map):
    return custom_deconvolution(convolved_map, self.nside, self.kernel, self.num_iterations, self.lr, self.activation, self.padding)

Many Thanks
wasim

Why are you detaching your outputs?

I was under the impression that detaching deconvolved_maps is needed to ensure a new graph is built in the next iteration. Do you reckon this is what creates the problem?

I don’t know about this, but when you detach a tensor you detach it from the graph, i.e. you destroy its gradient history. So, this most likely explains why you don’t get a gradient.

Thanks a lot. removed that detaching. Turns out I did that in a few places as well. removed from all. now the model is running. Thanks a lot for your help.

Another quick question:
in my deconvolution loop, I am having to use loss.backward(retain_graph=True). without keeping the retain_graph=True, it doesn’t work. but the problem appears when I am trying to predict.

with torch.no_grad():
output_tensor = model(input_tensor)
I have seen that turning grad off has been recommended for inference in a few examples. But if I do that, it breaks right at that loss.backward(retain_graph=True).
Do I need to disable the gradient when inferring? what happens with it or without it?

Many Thanks
Wasim

The reason why people say turning off gradients is recommended for inference is to just reduce memory usage when inferring (as we don’t care nor need the gradients when just computing the output of the model), and the use of a torch.no_grad() decorator is an easy way to do this without having to change too much of the source code.