I tried to train CTC model based on LibriSpeech dataset,
but CTCLoss always return inf or 0 if I use zero_infinity=True.
here is my decoder:
class ConvDecoder(nn.Module):
def __init__(self, in_channels, vocab_size):
super().__init__()
self.decoder = nn.Conv1d(
in_channels=in_channels, out_channels=vocab_size, kernel_size=1)
def forward(self, x):
return self.decoder(x)
-
It gets tensor of size
(batch, channels, width)
and return(batch, 28, width)
where 28 is my vocab size, my vocab labels start from 1 and reserved label 0 for blank id. -
Then I reshape model output to
(width, batch, 28)
, here is exact code:
logits = self.forward(inputs)
batch_size, channels, sequence = logits.size()
logits = logits.view((sequence, batch_size, channels))
probs = nn.functional.log_softmax(logits, dim=-1)
- And finally CTC loss:
loss = self.loss(probs, outputs, input_lengths, output_lengths)
how I instantiate loss:
self.loss = nn.CTCLoss(blank=0)
and this is my vocab dictionary:
{'a': 1,
'b': 2,
'c': 3,
'd': 4,
'e': 5,
'f': 6,
'g': 7,
'h': 8,
'i': 9,
'j': 10,
'k': 11,
'l': 12,
'm': 13,
'n': 14,
'o': 15,
'p': 16,
'q': 17,
'r': 18,
's': 19,
't': 20,
'u': 21,
'v': 22,
'w': 23,
'x': 24,
'y': 25,
'z': 26,
' ': 27,
'[unk]': 28}
indexes start from 1 but didn’t use any character for blank in my vocab.
I tired several examples of CTC models and they worked, just can’t figure out what I’ve done wrong.
here is train logs:
Epoch 0: 0%| | 2/28539 [00:04<17:53:05, 2.26s/it, loss=inf, v_num=36]
output shape: torch.Size([732, 1, 28])
Epoch 0: 0%| | 3/28539 [00:05<15:16:36, 1.93s/it, loss=inf, v_num=36]
output shape: torch.Size([1508, 1, 28])
Epoch 0: 0%| | 4/28539 [00:08<16:28:45, 2.08s/it, loss=inf, v_num=36]
output shape: torch.Size([549, 1, 28])
note that output shape refers to final tensor that get into self.loss
, set batch size to 1 for simplicity.
Thank you.