Function 'PowBackward1' returned nan values in its 1th output

I am attempting to implement the following operation from this paper:

This is what it looks like in code:

class FrontEnd(nn.Module):
    def __init__(self):
        super().__init__()
        self.bn = nn.BatchNorm1d(80, affine=False)
        self.register_parameter('alpha', torch.nn.Parameter(torch.tensor(0.5)))
        
    def forward(self, x):
        bs, im_num, ch, y_dim, x_dim = x.shape
        x = x ** torch.sigmoid(self.alpha) # <----- line causing issues
        x = x.view(-1, y_dim, x_dim)
        x = self.bn(x)
        return x.view(bs, im_num, ch, y_dim, x_dim)

If I set the alpha parameter to not require grad, everything is fine. If I however make it learnable, loss turns to nan after a single iteration.

When running with anomaly detection, this is the message that I get:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-953c2962d9a9> in <module>
     10         loss = criterion(outputs, labels)
     11         with autograd.detect_anomaly():
---> 12             loss.backward()
     13     #         print(model.frontend.alpha.grad)
     14             optimizer.step()

/opt/conda/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    193                 products. Defaults to ``False``.
    194         """
--> 195         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    196 
    197     def register_hook(self, hook):

/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: Function 'PowBackward1' returned nan values in its 1th output.

If I register a backward hook on frontend as follows:

def printgradvals(self, grad_input, grad_output):
    print(grad_input[0].ne(grad_input[0]).any())
    print(grad_output[0].ne(grad_output[0]).any())
    print(grad_input[0].abs().mean())
    print(grad_output[0].abs().mean())
    print(grad_input[0].abs().min())
    print(grad_output[0].abs().min())

model.frontend.register_backward_hook(printgradvals)

I get the following output:

tensor(False, device='cuda:0')
tensor(False, device='cuda:0')
tensor(1.9707e-05, device='cuda:0')
tensor(1.9707e-05, device='cuda:0')
tensor(1.3642e-12, device='cuda:0')
tensor(1.3642e-12, device='cuda:0')

The gradients are very small. But why doesn’t input grad differ from output grad? :thinking: And what could be causing the nans?

Thank you very much for any help that you can give me on this :pray:

EDIT: Just noticed I initialize alpha to 0.5 in the code that I copied - I tried with initializing it to zero, no changes. Getting same results.

I’ve put together a small example hoping it could help find the issue, but everything works as expected:

I created a simplified architecture, still no go:

class FrontEnd(nn.Module):
    def __init__(self):
        super().__init__()
        self.register_parameter('alpha', torch.nn.Parameter(torch.tensor(0.)))
        
    def forward(self, x):
        x = x ** torch.sigmoid(self.alpha) # <----- line causing issues
        return x

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.frontend = FrontEnd()
        self.lin = nn.Linear(508800, len(classes))
    
    def forward(self, x):
        bs, im_num, ch, y_dim, x_dim = x.shape
        x = self.frontend(x)
        return self.lin(x.view(16, -1))

I wonder what I could try to fix this or troubleshoot this further? How can I pinpoint why PowBackward1 returned nan values?

If you are initializing self.alpha as zero initially, torch.sigmoid(self.alpha) would have the value 0.5.
If the input x contains negative values, you would calculate the square root of these negative values, which would yield a NaN output:

x = torch.tensor([[-1]])
print(x ** 0.5)
> tensor([[nan]])

Did the paper mention something about a non-negative condition for the input to this layer?

Thank you very much for your help @ptrblck, really appreciate it.

All the inputs should be non-negative - added a check on line 7 to make sure that is the case:

Even if I remove the sigmoid nonlinearity and initialize alpha to a value > 1, the issue persists:

That’s a bit weird. I cannot reproduce this issue using this code snippet:

class FrontEnd(nn.Module):
    def __init__(self):
        super().__init__()
        self.register_parameter('alpha', torch.nn.Parameter(torch.tensor(0.)))
        
    def forward(self, x):
        x = x ** torch.sigmoid(self.alpha) # <----- line causing issues
        return x

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.frontend = FrontEnd()
        self.lin = nn.Linear(10, 2)
    
    def forward(self, x):
        x = self.frontend(x)
        return self.lin(x)


model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

x = torch.empty(10, 10).uniform_(0, 10)
target = torch.randint(0, 2, (10,))

for i in range(100):
    optimizer.zero_grad()
    out = model(x)
    print(torch.isfinite(out).all())
    loss = criterion(out, target)
    loss.backward()
    optimizer.step()

I reduced the number of neurons a bit, but that shouldn’t make a difference.
Could you run my code and check, if you are still getting NaNs?
Also, do you see any difference between my and your code?

PS: It’s better to post code directly by wrapping it into three backticks ```, as I could simply copy-paste it :wink:

Thank you very much again for your help @ptrblck, really, really appreciate it! :pray: Very sorry regarding the images - my apologies!

I couldn’t reproduce the issue with synthetic data, but if I modify the example that you shared to work with my data, the issue appears.

I saved the data as follows (I uploaded it here, it is under 8MB, forums say I cannot upload it here):

it = iter(train_dl)
x1, t1 = next(it)
x2, t2 = next(it)

torch.save([[x1, t1], [x2, t2]], 'data.pth')

This is the code I run

data = torch.load('data.pth')

class FrontEnd(nn.Module):
    def __init__(self):
        super().__init__()
        self.register_parameter('alpha', torch.nn.Parameter(torch.tensor(0.)))
        
    def forward(self, x):
        x = x ** torch.sigmoid(self.alpha) # <----- line causing issues
        return x

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.frontend = FrontEnd()
        self.lin = nn.Linear(10, 264)
    
    def forward(self, x):
        x = self.frontend(x)
        x = x.mean((2,3,4))
        return self.lin(x)
    
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

for x, target in data:
    optimizer.zero_grad()
    out = model(x)
    print(torch.isfinite(out).all())
    if not torch.isfinite(out).all(): break
    loss = criterion(out, target)
    loss.backward()
    optimizer.step()

And this is the output I receive

tensor(True)
tensor(False)

The author of the paper implemented this operation using lasagne and theano - my data is different but I should be preprocessing it how he does in the paper.

Thanks for the data!
I loaded it and executed the provided code snippet, however I’m getting all finite outputs (even with executing the training loop several times).

Which PyTorch version are you currently using?
I tried to reproduce it with 1.6.0.dev20200611 and it’s working fine.

Really appreciate all your help @ptrblck! :pray::pray::pray: I am using a GCP vm, their deep learning image. I was on pytorch version 1.4.0 and that is what was causing the issue! Upgrading to 1.5.1 ended up fixing the issue but causing slowness in dataloader, so I compiled using source from github (assuming this way I also got all the computation support libraries specific to the platform) and everything seems to be working fine now! :slightly_smiling_face:

Apologies to have bothered you for something that ended up being a version issue. I wouldn’t have imagined that this could be the culprit and so it didn’t cross my mind to try a newer version (and searching didn’t turn up any results). Thanks again for all your help!!!

1 Like

I’m glad it’s working and thanks for the code snippets and the debugging! :slight_smile:
I’m surprised I haven’t seen this issue before (at least I cannot remember a similar error).

1 Like

Hi radek, (@radek )
How did you find the “line causing the issue” from the anomaly detection error message “…Function: PowBackward1…”?