Transformer Not Training Properly

I don’t believe I am training/using my transformer correctly. At the end of 20 epochs it’s giving output predictions that consist of only one or two types of tokens (shown below). This is my first time building a transformer and I’m really not sure if I’m performing the training loop correctly. I don’t know how to shift the output to the right or if that would make a difference. And I’m not feeding the previous predictions back into the decoder for to generate the next token in the sequence (I was told you only do that in inference), it’s just computing the whole prediction in one forward pass through the decoder. Could someone help me diagnose this issue?

target:  [['BOS', 'six', 'divided', 'by', 'two', 'equals', 'three', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD',...]]
predit:  [['BOS', 'EOS', 'EOS', 'EOS', 'BOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS',...]]
target:  [['BOS', 'people', 'love', 'to', 'talk', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD',...]]
predit:  [['BOS', '.', 'BOS', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', ....]]
target:  [['BOS', 'my', 'mother', 'is', 'writing', 'a', 'letter', 'now', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD',...]]
predit:  [['BOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'BOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS',...]]
Epoch[20/20] train_loss: 4.4391326904296875 val_loss: 4.598245028791757

Here is my training loop:

# optimization loop 
best_loss = 1e5
best_epoch = 0
optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-3)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=0) 
train_losses = []
val_losses = []
for epoch in range(1,EPOCHS+1):

    # train loop 
    for i, (src,trg) in enumerate(train_data):

        # place tensors to device 
        src = torch.Tensor(src).to(DEVICE).long()
        trg = torch.Tensor(trg).to(DEVICE).long()
        mask = torch.tril(torch.ones((MAX_LENGTH, MAX_LENGTH))).to(DEVICE)

        # forward pass 
        out = model(src,trg, mask)

        # print prediction vs target sentence 
        trg_sentence = id_to_word(trg,en_index_dict)
        print('target: ',trg_sentence)
        val, ind = torch.max(out, -1)
        pred_sentence = id_to_word(ind,en_index_dict)
        print('predit: ', pred_sentence)


        # compute loss 
        train_loss = loss_fn(out.view(-1, tgt_vocab), trg.view(-1))
        
        # backprop 
        optimizer.zero_grad()
        train_loss.backward()

        # update weights 
        optimizer.step()

I can provide more code if necessary! Thanks