Weights turn into NaN's

Hello all,

running this code:

    def do_something(self, img):

        # Create mask with weak edges set to 1, else 0
        masked_img = img.clone()
        masked_img[torch.logical_not(img == 0.5)] = 0
        masked_img[img == 0.5] = 1

        # Calculate weak edges that are changed to strong edges
        changed_edges = img.clone()
        changed_edges[:] = 0
        changed_edges[((self.conv4(img) > 1) & (masked_img == 1))] = 1

        # Add changed edges to already good edges
        changed_edges[img == 1] = 1

        return changed_edges

results in:
RuntimeError: Function ‘PowBackward0’ returned nan values in its 0th output.

And with anomaly detection set to false I can see that my kernel weights have turned to NaN’s.

If I delete the line “changed_edges[:] = 0” the network trains without problems. I’ve tried some different methods to cope with this problem but I am out of ideas here. Seems like the gradients become extremely big so that my network gets rekt during training? Help or some hints would be highly appreciated. Thanks.

If you’re trying to fill a Tensor with zeros you can either do .fill_(0) or you could try, changed_edges.zeros_(). This might cause in-place operation issues for the gradient, but give it a go and see if it solves your problem. If not, could try something like, changed_edges = torch.zeros_like(changed_edges) to avoid in-place operations.

You could also try debugging your code with torch.autograd.set_detect_anomaly, more information about that command can be found in the documentation here

https://pytorch.org/docs/stable/autograd.html#torch.autograd.set_detect_anomaly

Try running your code within a torch.autograd.set_detect_anomaly decorator and see where the error is. The RuntimeError states PowBackward0 returned NaN in its 0th output. So look for any use of ** or .pow() in your code and the gradient of that is causing the issue.

Thanks for sharing its solution!