Model only predicts blank labels (using CTCLoss)

I’m trying to implement this model in PyTorch:

Here is my implementation so far:

My problem is that the network only predicts the blank labels after a couple of batches of training and I don’t know why.
I would be thankful for any help.