Got nan contrastive loss value after few epochs

Bramahan · October 4, 2021, 3:52am

I used a Siamese network with contrastive loss as image below, but after few epochs, the loss gave nan value with message

error :'RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

I have searched this issue, but haven’t find solution yet. I think the problem would be my loss function. Can someone give me some advice? Thanks

ptrblck · October 4, 2021, 7:47am

Try to isolate the iteration which causes this issue and check the inputs as well as outputs to torch.pow. Based on your code I cannot find anything obviously wrong. Also, I would recommend to post code snippets directly by wrapping them into three backticks ``` (as you’ve already done), as it would make debugging easier and the search engine could also use it for results in case other users face similar issues.

Bramahan · October 4, 2021, 12:23pm

Thanks for your support, i have fixed it.

ptrblck · October 4, 2021, 5:55pm

Good to hear you’ve fixed it! Was the pow operation creating the invalid values or what was the issue (in case you can share it)?

Bramahan · October 5, 2021, 12:10am

Actually, I found out problem was my custom Siamese net not my loss function. I want to use a pretrained Vgg face model and continuously train on my dataset. My Siamese net likes that:

class SiameseNetwork(nn.Module):
    def __init__(self, vgg_model):
        super(SiameseNetwork, self).__init__()
        self.vgg = vgg_model

    def forward(self,x0,x1):
        out1 = self.vgg(x0)
        out2 = self.vgg(x1)
        return out1, out2

And I found out that nan values come from output of my network, now I’m trying my best to discover why my network gives nan values. I have checked input to find anomaly but nothing abnormally. Can you give me some advice? I will appreciate it.

[[0.9608, 0.9569, 0.9451,  ..., 0.1098, 0.1059, 0.1059],
          [0.9569, 0.9490, 0.9333,  ..., 0.1098, 0.1098, 0.1098],
          [0.9451, 0.9333, 0.9098,  ..., 0.1137, 0.1137, 0.1137],
          ...,
          [0.9529, 0.9529, 0.9490,  ..., 0.4235, 0.4275, 0.4275],
          [0.9490, 0.9490, 0.9529,  ..., 0.4235, 0.4275, 0.4275],
          [0.9490, 0.9490, 0.9529,  ..., 0.4235, 0.4275, 0.4275]],

         [[0.8941, 0.8902, 0.8784,  ..., 0.0902, 0.0863, 0.0863],
          [0.8902, 0.8863, 0.8706,  ..., 0.0902, 0.0863, 0.0863],
          [0.8863, 0.8745, 0.8471,  ..., 0.0941, 0.0902, 0.0902],
          ...,
          [0.9569, 0.9569, 0.9529,  ..., 0.2902, 0.2902, 0.2902],
          [0.9529, 0.9529, 0.9529,  ..., 0.2863, 0.2902, 0.2902],
          [0.9529, 0.9529, 0.9529,  ..., 0.2863, 0.2902, 0.2902]]]],
       device='cuda:0')
----------------------------------------------------------
output: tensor([[-0.0138, -0.0085,  0.0077,  ..., -0.0088, -0.0003, -0.0021],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [ 0.0359,  0.0134,  0.0074,  ...,  0.0280,  0.0116,  0.0102],
        ...,
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [ 0.0099,  0.0099, -0.0156,  ...,  0.0109, -0.0009,  0.0184],
        [-0.0121, -0.0051,  0.0370,  ...,  0.0406,  0.0065, -0.0012]],
       device='cuda:0', dtype=torch.float16, grad_fn=<AddmmBackward>) tensor([[    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        [-0.0158, -0.0266,  0.0180,  ..., -0.0115,  0.0026, -0.0345],
        [    nan,     nan,     nan,  ...,     nan,     nan,     nan],
        ...,
        [ 0.0249,  0.0007, -0.0102,  ..., -0.0132,  0.0214,  0.0118],
        [ 0.0028,  0.0037,  0.0042,  ...,  0.0135,  0.0115, -0.0005],
        [ 0.0220,  0.0144,  0.0100,  ...,  0.0045,  0.0385, -0.0046]],
       device='cuda:0', dtype=torch.float16, grad_fn=<AddmmBackward>)

Additionally, I load Vgg model like that:

from torchsummary import summary
vgg_model = vgg_face_dag('pretrained/vgg_face_dag.pth').to(device)
'''for param in vgg_model.parameters():
    param.requires_grad = False'''
idx = 0
for layer in vgg_model.children():
    idx += 1
    if idx < 34:
        for param in layer.parameters():
            param.requires_grad = False
summary(vgg_model,(3,224,224))

ptrblck · October 5, 2021, 4:30am

I don’t know what might be failing inside your model, but in case you are using an older PyTorch release, update to the latest one (or the nightly) and try to apply the same debugging strategy by isolating the iteration, which fails. Then check the inputs, intermediate activations, and gradients for any invalid values.

Bramahan · October 5, 2021, 7:27am

Thank you, I’ll try it!