nn.Transformer not learning, generates repeated tokens

Hello, I am getting repeated tokens and the training is not converging while using nn.Transformer. My training loop looks like this:

tgt_output = tgt_input[1:, :]
decoder_input = tgt_input[:-1, :]
tgt_mask = tgt_mask[:, :-1]
output = model.forward(src_input, src_key_mask=src_mask, 
                       tgt=decoder_input, tgt_key_mask=tgt_mask)
output_flat = output.view(-1, vocab_size)
tgt_flat = tgt_output.flatten()
loss = criterion(output_flat, tgt_flat)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optimizer.step()

Where the source and target batches are sequences of token indices. The architecture just learns to predict the most frequent token of them all giving stuff like this when I try to evaluate:

[1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]

I am having the same issue.
If you figured it out, may I ask what the problem was?

Same issue too, exactly same … Have you figured out the problem? I am stuck for quite a long time …