Weights turn into NaN's

jw1400 · May 17, 2022, 1:32pm

Hello all,

running this code:

    def do_something(self, img):

        # Create mask with weak edges set to 1, else 0
        masked_img = img.clone()
        masked_img[torch.logical_not(img == 0.5)] = 0
        masked_img[img == 0.5] = 1

        # Calculate weak edges that are changed to strong edges
        changed_edges = img.clone()
        changed_edges[:] = 0
        changed_edges[((self.conv4(img) > 1) & (masked_img == 1))] = 1

        # Add changed edges to already good edges
        changed_edges[img == 1] = 1

        return changed_edges

results in:
RuntimeError: Function ‘PowBackward0’ returned nan values in its 0th output.

And with anomaly detection set to false I can see that my kernel weights have turned to NaN’s.

If I delete the line “changed_edges[:] = 0” the network trains without problems. I’ve tried some different methods to cope with this problem but I am out of ideas here. Seems like the gradients become extremely big so that my network gets rekt during training? Help or some hints would be highly appreciated. Thanks.

AlphaBetaGamma96 · May 17, 2022, 3:55pm

If you’re trying to fill a Tensor with zeros you can either do .fill_(0) or you could try, changed_edges.zeros_(). This might cause in-place operation issues for the gradient, but give it a go and see if it solves your problem. If not, could try something like, changed_edges = torch.zeros_like(changed_edges) to avoid in-place operations.

You could also try debugging your code with torch.autograd.set_detect_anomaly, more information about that command can be found in the documentation here

https://pytorch.org/docs/stable/autograd.html#torch.autograd.set_detect_anomaly

AlphaBetaGamma96 · May 19, 2022, 11:35am

Try running your code within a torch.autograd.set_detect_anomaly decorator and see where the error is. The RuntimeError states PowBackward0 returned NaN in its 0th output. So look for any use of ** or .pow() in your code and the gradient of that is causing the issue.

MyCenturaHealth · November 19, 2022, 2:08pm

Thanks for sharing its solution