nn.CTCLoss returns inf

Arsham_mor · July 14, 2021, 5:21am

I tried to train CTC model based on LibriSpeech dataset,
but CTCLoss always return inf or 0 if I use zero_infinity=True.

here is my decoder:

class ConvDecoder(nn.Module):
    def __init__(self, in_channels, vocab_size):
        super().__init__()
        self.decoder = nn.Conv1d(
            in_channels=in_channels, out_channels=vocab_size, kernel_size=1)

    def forward(self, x):
        return self.decoder(x)

It gets tensor of size (batch, channels, width) and return (batch, 28, width) where 28 is my vocab size, my vocab labels start from 1 and reserved label 0 for blank id.
Then I reshape model output to (width, batch, 28), here is exact code:

logits = self.forward(inputs)
batch_size, channels, sequence = logits.size()
logits = logits.view((sequence, batch_size, channels))
probs = nn.functional.log_softmax(logits, dim=-1)

And finally CTC loss:

loss = self.loss(probs, outputs, input_lengths, output_lengths)

how I instantiate loss:
self.loss = nn.CTCLoss(blank=0)

and this is my vocab dictionary:

{'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26,
 ' ': 27,
 '[unk]': 28}

indexes start from 1 but didn’t use any character for blank in my vocab.

I tired several examples of CTC models and they worked, just can’t figure out what I’ve done wrong.

here is train logs:

Epoch 0:   0%|          | 2/28539 [00:04<17:53:05,  2.26s/it, loss=inf, v_num=36]
output shape: torch.Size([732, 1, 28])

Epoch 0:   0%|          | 3/28539 [00:05<15:16:36,  1.93s/it, loss=inf, v_num=36]
output shape: torch.Size([1508, 1, 28])

Epoch 0:   0%|          | 4/28539 [00:08<16:28:45,  2.08s/it, loss=inf, v_num=36]
output shape: torch.Size([549, 1, 28])

note that output shape refers to final tensor that get into self.loss, set batch size to 1 for simplicity.

Thank you.

tom · July 14, 2021, 7:29am

You almost certainly want permute here and not view.

A loss of inf means your input sequence is too short to be aligned to your target sequence (ie the data has likelihood 0 given the model - CTC loss is a negative log likelihood after all).

Best regards

Thomas

Arsham_mor · July 14, 2021, 8:45am

Thanks @tom

I debug my audio preprocessor and found it returns tensor(1) for all audio sample as input length, it shame I totally missed checking input_lengths … well, couldn’t find it without your guid.

tom · July 14, 2021, 10:48am

Cool, glad you solved it!