CTC loss produced 'Nan' when using spatial features from AlexNet

I applied CTC loss for the continuous sign-language recognition task. I use the model architecture proposed in SubUNets.

I used ResNet18 for the CNN model to spatial feature extraction. There’s no problem here but ResNet18 consumes very high memory.

So, I changed The CNN model to be AlexNet for lighter weight.

but when I use AlexNet for the spatial feature extraction model, after few epochs the CTC loss function produced ‘Nan’ value.

I’m not sure where the problem is due to I used the same parameter in both models.
only one part that two models are different is a model part.
please check my code for both models.

this is code for AlexNet

class CnnEncode(nn.Module):
    def __init__(self, cnn_embed_dim=128, hidden = (128,128,128), drop_p=0.3):
        super(CnnEncode, self).__init__()

        self.cnn_embed_dim = cnn_embed_dim
        self.h1 , self.h2, self.h3 = hidden
        self.drop_p = drop_p
        
        alexnet = models.alexnet(pretrained=False)
        alexnet.features[0] = nn.Conv2d(1, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
        modules = [list(alexnet.children())[0]] + [list(alexnet.children())[1]]
        self.alexnet = nn.Sequential(*modules)
        
    def forward(self, x_3d):
        cnn_embed_seq = []
        print(x_3d.shape)
        for t in range(x_3d.size(1)):
            
            x = x_3d[:, t, :, :,:].to(device=device)
            x = F.relu(self.alexnet(x))
            x = x.view(x.size(0), -1)
            x = x.to(device=cpu)

            cnn_embed_seq.append(x)
            
        cnn_embed_seq = torch.stack(cnn_embed_seq, dim=2)
        x = cnn_embed_seq.permute(2,0,1)
        return x

this is code for ResNet18

class CnnEncode(nn.Module):
    def __init__(self, cnn_embed_dim=128, hidden = (128,128,128), drop_p=0.3):
        super(CnnEncode, self).__init__()

        self.cnn_embed_dim = cnn_embed_dim
        self.h1 , self.h2, self.h3 = hidden
        self.drop_p = drop_p
       
        resnet = models.resnet18(pretrained=False)
        resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
        modules = list(resnet.children())[:-1] #delete the last fc layer.
        self.resnet = nn.Sequential(*modules)
        
    def forward(self, x_3d):
        cnn_embed_seq = []
        print(x_3d.shape)
        for t in range(x_3d.size(1)):
            
            x = x_3d[:, t, :, :,:].to(device=device)
            x = F.relu(self.resnet(x))
            x = x.view(x.size(0), -1)
            x = x.to(device=cpu)
            cnn_embed_seq.append(x)
            
        cnn_embed_seq = torch.stack(cnn_embed_seq, dim=2)

        x = cnn_embed_seq.permute(2,0,1)
        return x

This post suggests to use zero_infinity. Could you use it, if not already done, and check the targets for all blanks?

2 Likes

@ptrblck 's suggestion probably is the first important thing to look out for. If that is the cause, it means that for some of your data, the output length of your AlexNet is insufficient to be able to produce the target sequence.

If that doesn’t help in itself:

  • Be sure to use the latest PyTorch version.
  • If you want to find out what is going on, you can do the following:
    • at the end of each training step (after optimizer.step()) can you check if any weight has a NaN,
    • look at the last inputs into CTC loss when a weight turned NaN (which means that the gradient was NaN).
    • does it happen for CPU as well?

Note that you need to check the weights for NaN, not the loss. Very likely the gradients go NaN first and you’ll see the loss turn NaN in the next iter.

Things to look out for:

  • targets must not contain blank.
  • input length should be larger than target length + number of repeat positions (+1? I can’t remember).

Best regards

Thomas

2 Likes

sorry for the late reply, I have used zero_infinity = True but it didn’t solve the problem.

I found that when I use batch size is 1 there’s no problem, but when I use batch size more than 1 it always produces ‘NaN’ in weight and loss output when training passed few iterations.

so I think it possible that the problem occurs due to padding.

I padded the sequence of video input to have the length as the maximum input length in mini-batch with the last frame.

and I padded target to have the length as the maximum target length in mini-batch with ‘blank’.

I would like to ask you about my padding method is it the correct way ?.

the input length look like

[150 246 220 120]
(each element is the length of each input in batch, in this case, use batch size is 4)

the target look like

[[1 30 4 3 14 1 0 0 0 0]
[1 2 5 29 3 14 30 2 4 1]
[1 5 3 7 1 0 0 0 0 0]
[1 29 3 1 0 0 0 0 0 0 0]]
(padded 'blank ’ to have the length as the maximum target length in mini-batch)

and target length look like

[6 10 5 4]

I do not know if padding the targets create a problem for sure, but you can surely provide the targets as a 1D sequence to CTCLoss