Hello, I am getting repeated tokens and the training is not converging while using nn.Transformer. My training loop looks like this:
tgt_output = tgt_input[1:, :]
decoder_input = tgt_input[:-1, :]
tgt_mask = tgt_mask[:, :-1]
output = model.forward(src_input, src_key_mask=src_mask,
tgt=decoder_input, tgt_key_mask=tgt_mask)
output_flat = output.view(-1, vocab_size)
tgt_flat = tgt_output.flatten()
loss = criterion(output_flat, tgt_flat)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), 0.5)
optimizer.step()
Where the source and target batches are sequences of token indices. The architecture just learns to predict the most frequent token of them all giving stuff like this when I try to evaluate:
[1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]