Hi, I would like some help on the use of CTCLoss for handwritten text recognition task.
The problem that I am facing now is that although the loss decreases pretty rapidly at the start, it levels out and no longer decrease at around 10% into an epoch. When I inspect the output of the forward pass of the model, it seems like the model decided that the best way to minimize CTCLoss is to predict only the first character, and then blanks for the the rest of the sequence.
Originally, based on this other post, it seems that the problem was padding with the blank labels in my encoding from label strings to label vectors (because I couldn’t batch them if they are of varying length). Now I have changed to concatenating all labels into a single vector as stated in the documentation. However, the loss still level out and the model predicts gibberish.
Below is the training code and how I used CTCLoss.
for epoch in range(1):
training_loss = np.array([])
validation_loss = np.array([])
with tqdm(total=len(train_set), position=0, leave=True, desc="Epoch {}".format(epoch)) as pbar:
for i, data in enumerate(train_loader, 0):
# zero parameter gradients
optimizer.zero_grad()
images, labels, label_lengths = data['image'].to(device), data['label'], data['label_length'].to(device)
# Forward pass
outputs = model(images)
# Mostly 64, except last batch
batch_size = len(data['label_length'])
input_lengths = torch.full(size=(batch_size,), fill_value=128, dtype=torch.long)
# Concate labels into a single tensor of size (sum(target_lengths))
targets = torch.from_numpy(tokenizer.encode(''.join(labels)))
# Loss calculation
loss = ctc_criterion(outputs, targets.to(device), input_lengths, label_lengths)
# Backpropagate loss
loss.backward()
# Gradient descent
optimizer.step()
training_loss = np.append(training_loss, loss.item())
pbar.update(train_loader.batch_size)
pbar.set_description(desc="Epoch {}, Train Loss {}".format(epoch, loss.item()))
Am I doing anything wrong with the way I use CTC? Let me know if more information is required.
Thanks.