I am doing seq2seq where the input is a sequence of images and the output is a text (sequence of token words). My model is a pretrained CNN layer + Self-attention encoder (or LSTM) + Linear layer and apply the logSoftmax to get the log probs of the classes + blank label (batch, Seq, classes+1) + CTC.
I am using the ctc_loss of Pytorch. i am padding all sequences with the blank token = 0. I followed all the instructions in here (https://pytorch.org/docs/stable/nn.html#ctcloss).
When training, my model seems to predict only blanks after a few batches. And the loss is slowly decreasing and stays very high even after many epochs. After many many epochs, the model produces some non-blank tokens (i am doing torch.max(output, dim=-1) to get the predictions).
When using an encoder-decoder approach completely removing the CTC layer, i get a decent bleu score but horrible WER score (~80).
y shape = torch.Size([2, 13])
y = tensor([[ 1, 55, 9, 413, 344, 29, 318, 38, 15, 305, 196, 144, 54],
[ 1, 217, 163, 4, 222, 93, 45, 54, 0, 0, 0, 0, 0]],
(Seq, batch, classes+blank)
output = torch.Size([54, 2, 1232])
x_lengths = tensor([44, 54], dtype=torch.int32)
y_lengths = tensor([13, 8], dtype=torch.int32)
loss = ctc_loss(output, y, x_lengths.cpu(), y_lengths.cpu())
I am not sure if this is normal as i heard that it is pretty hard to train with a CTC loss or else i am doing something wrong.
I appreciate you taking the time to read this. I appreciate any feedback on this.