CTCLoss predicts blanks

Hi,

I am doing seq2seq where the input is a sequence of images and the output is a text (sequence of token words). My model is a pretrained CNN layer + Self-attention encoder (or LSTM) + Linear layer and apply the logSoftmax to get the log probs of the classes + blank label (batch, Seq, classes+1) + CTC.
I am using the ctc_loss of Pytorch. i am padding all sequences with the blank token = 0. I followed all the instructions in here (https://pytorch.org/docs/stable/nn.html#ctcloss).
When training, my model seems to predict only blanks after a few batches. And the loss is slowly decreasing and stays very high even after many epochs. After many many epochs, the model produces some non-blank tokens (i am doing torch.max(output, dim=-1) to get the predictions).
When using an encoder-decoder approach completely removing the CTC layer, i get a decent bleu score but horrible WER score (~80).

y shape = torch.Size([2, 13])
y = tensor([[ 1, 55, 9, 413, 344, 29, 318, 38, 15, 305, 196, 144, 54],
[ 1, 217, 163, 4, 222, 93, 45, 54, 0, 0, 0, 0, 0]],
device=ā€˜cuda:0ā€™)

(Seq, batch, classes+blank)
output = torch.Size([54, 2, 1232])

x_lengths = tensor([44, 54], dtype=torch.int32)
y_lengths = tensor([13, 8], dtype=torch.int32)

loss = ctc_loss(output, y, x_lengths.cpu(), y_lengths.cpu())

I am not sure if this is normal as i heard that it is pretty hard to train with a CTC loss or else i am doing something wrong.

I appreciate you taking the time to read this. I appreciate any feedback on this.
Thank you.

You should not pad target sequences with blank, thatā€™s the main reason your model predicts blanks.
In general is harder to train end-2-end the full architecture only with CTC Loss depending on the model architecture and depth
Also try to use reduction=ā€˜meanā€™ and zero_infinity=ā€œTrueā€ , I found it works better with these parameters.

1 Like

hey sorry for the late response and thank you for your reply.

I tried using use reduction=ā€˜meanā€™ and zero_infinity=ā€œTrueā€ but they didnā€™t work.
It seems that by doing this : loss = ctc_loss(output, y.cpu(), x_lengths.cpu(), y_lengths.cpu()) and updating Pytorch, it made it work perfectly now.

CTC loss calculation can also be done in cuda device, there is no need to send tensors to cpu

Hi, I meet the same error. I pad target labels with ā€œPAD_ID=1ā€, and ā€œBLANK_ID = 0ā€ in my vocabulary. After some batches, it only predicts blank.

def CTCLoss(ctc_decoder_out, samples):
    ctcprobs = ctc_decoder_out.log_softmax(-1)
    loss_func = torch.nn.CTCLoss(blank=0, reduction="mean", zero_infinity=False)
    ctc_target = samples["real_target"]
    src_lengths = samples["src_lengths"]
    tgt_lengths = ctc_target.ne(tgt_dict.pad()).sum(-1)
    ctc_loss = loss_func(ctcprobs, ctc_target.cpu(), src_lengths.cpu(), tgt_lengths.cpu())
    return ctc_loss

This is my code. I donā€™t know weather it is hard to train ctc loss or there are something wrong in my code? After one epoch, it stills only predict blank and the ctc loss is really slowly decreased. Could you help solve my problem by any chance? Thanks.

Hi Yan,

Make sure your src lengths are long enough in which src lengths >= target lengths or simply try using zero_infinity=True to avoid having infinite losses.
Other than that, training with CTC is a bit tricky, so expect to get a lot of blanks at early epochs as that is perfectly normal.

Hope that helps.

Hi PersiQ,

Thank you for your replay. I am sure my src lengths are longer than target lengths. I tried using both reduction=ā€œsumā€ and reduction=ā€œmeanā€, but they both didnā€™t work.
My task is video based translation, maybe it is difficult to train with ctc loss. My model is pretrained EfficientNet + self-attention + Linear layer, and then using CTCLoss. During training, after 1 epoch the ctc loss is approximately 5.5(the loss printed is divided by the tokens), and after 30 epochs, the ctc loss is still about 5.2. In addition, the model still only predicts blank.
Do you have experience in training similar models? I hope you can provide some suggestions.

Thank you.

Hi All,

I tried the proposed solutions on this thread, but , Unfortunately, I still get blank predictions, any idea?

P.S. I see changing the learning rate could change the appearance of the blank predictions.

The code is here: https://github.com/cwig/start_follow_read/blob/e9a4f7cb92a63e803c958239be4b907f35efa346/hw_pretraining.py#L65

Thanks a million,
Mughrabi

Hi all,

I donā€™t understand how to pad the targets in a batch.

If i have blank=0 and vocabulary=1:N so i have num_class in my model equal to N, so what is my padded value?

I hope iā€™m clear, thanks
Antoine

Your padded value is your blank. You can simply do:

import torch

labels = [
    torch.Tensor([1, 2]), 
    torch.Tensor([3, 1, 2]),
    torch.Tensor([2])
]

padded_label_batch = torch.nn.utils.rnn.pad_sequence(
    sequences=labels,
    batch_first=True,
    padding_value=0,
).long()

>>> print(padded_label_batch)
tensor([[1, 2, 0],
        [3, 1, 2],
        [2, 0, 0]])

That is, your list (batch) of variable sized labels will be padded at the trailing end to the length of the longest label in the batch. A good place to do this operation is inside the collate function of the dataloader.

Later, for CTC loss, you will need the length of the labels. That can be computed with count_nonzero easily, since the padding was zero.

label_lengths = torch.count_nonzero(padded_label_batch, axis=1)
>>> print(label_lengths)
tensor([2, 3, 1])
1 Like

Ok thank you Mashrur Morshed, itā€™s very clear.

However, I canā€™t get my model to converge.

I have 1000 training data and 100 in validation.
Maybe my network is not deep enough, or I should use the pre-trained weights from imagenet, or maybe I just donā€™t have enough dataā€¦ If you have any advice, Iā€™ll take it .

I have about 10/100 perfect match after 100 epochs and an LR=0.0001

My model is the following (sorry for some comments in French :p) :

class CRNN(nn.Module):
def init(self, num_classes):
super(CRNN, self).init()

    self.conv1 = nn.Conv2d(3, 32, kernel_size=(3, 3))
    self.norm1 = nn.InstanceNorm2d(32)
    self.conv2 = nn.Conv2d(32, 32, kernel_size=(3, 3), stride=2)
    self.norm2 = nn.InstanceNorm2d(32)
    self.conv3 = nn.Conv2d(32, 64, kernel_size=(3, 3))
    self.norm3 = nn.InstanceNorm2d(64)
    self.conv4 = nn.Conv2d(64, 64, kernel_size=(3, 3), stride=2)
    self.norm4 = nn.InstanceNorm2d(64)
    
    self.linear1 = nn.Linear(320, 128)
    self.gru = nn.GRU(128, 32, 2, batch_first=True, bidirectional=True)
    self.fc = nn.Linear(64, num_classes)
    
def forward(self, x):
    batch_size = len(x)
    
    ## Convolution Part 
    out = self.conv1(x)
    out = self.norm1(out)
    out = nn.functional.leaky_relu(out)
    
    out = self.conv2(out)
    out = self.norm2(out)
    out = nn.functional.leaky_relu(out)
    
    out = self.conv3(out)
    out = self.norm3(out)
    out = nn.functional.leaky_relu(out)
    
    out = self.conv4(out)
    out = self.norm4(out)
    out = nn.functional.leaky_relu(out) #1, 64, 5, 29
    
    ## Reshape for GRU
    out = out.permute(0, 3, 1, 2) #1, 29, 64, 5 (on met channels et width en first)
    out = out.view(batch_size, out.size(1), -1) #1, 29, 320 (pour chaque width on a 320 values)
    
    ## On branche une linear pour diminuer la taille
    out = self.linear1(out) #1, 29, 128 
    
    ## Apply GRU
    out, _ = self.gru(out) #1, 29, 64 (double because bidirectionnal)
    out = torch.stack([nn.functional.log_softmax(self.fc(out[i]), dim=-1) for i in range(out.shape[0])])
    return out
1 Like

How is your modelā€™s performance on the training data? If training loss isnā€™t converging, then your model could be underfitting. I feel like you could try increasing the layers in the GRU (e.g. to 4), if that is the case. (Then later if it is overfitting, you could regularize by turning on the dropout argument in GRU, i.e. adding dropout between GRU layers).

1000 data for training is also quite low, especially if youā€™re training from scratch and not finetuning; thereā€™s not enough data to generalize on validation/test. If getting more data isnā€™t possible, perhaps a pretrained CNN extractor might help (e.g. some model from timm. Timm also has a guide on feature extraction from their models which you might find helpful).

Also are you applying any augmentation methods on your training data?

Thanks, after 100 epoch my loss function is around 0.2 (starting with a value of 9).

When I run my model on the training data, I only have 22 perfect predictions out of the 1000ā€¦

The errors are quite close to reality, for example:
true: elecsprint
pred: eleecsprint

Yes I have to try a pre-trained model to extract the features and no, I am not using data augmentation at the moment.

Itā€™s not easy to get more data, I have to annotate manually.