Transformer Training with shifted targets

I was able to train my transformer down to a loss of less than 0.04 over 40 epochs, but I was not able to perform any sort of autoregressive inference. I realized that during training I was not shifting the target sentence to the right at all, which enables the model to predict the next word in the sequence during inference. So I shifted the target, but it completely ruined training. I can’t get a loss less than 4.0 over 40 epochs. Here I’ve attached my code and some outputs. If anyone could look at this I’d greatly appreciate it. My transformer model is based off of Harvard NLP’s modern PyTorch implementation.

optimizer=torch.optim.Adam(params=model.parameters(),lr=5e-4)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=0)
train_losses = []
val_losses = []
bleu_scores = []
for epoch in range(1,EPOCHS+1):

    # train loop 
    for i, (src,tgt) in enumerate(train_data):

        # place tensors to device 
        src = torch.Tensor(src).to(DEVICE).long()
        trg = torch.Tensor(tgt).to(DEVICE).long()

        tgt = trg[:,:-1] # decoder input (shifted left)
        tgt_y = trg[:,1:] # decoder target (shifted right)

        # print statements 
        print('tgt: ', id_to_word(tgt, en_index_dict))
        print('tgt_y: ', id_to_word(tgt_y, en_index_dict))
        print('tgt shape: ', tgt.size())
        print('tgt_y shape: ', tgt_y.size())

        src_mask = (src != PAD).unsqueeze(-2)
        tgt_mask_ = (tgt != PAD).unsqueeze(-2)
        tgt_mask = tgt_mask_ & transformer_nlp.subsequent_mask(tgt.size(-1)).type_as(tgt_mask_.data)

        # forward pass 
        out = model.forward(src, tgt, src_mask, tgt_mask)

        # print prediction vs target sentence 
        val, ind = torch.max(out, -1)
        pred_sentence = id_to_word(ind,en_index_dict)
        print('out: ', id_to_word(ind, en_index_dict))
        

        # compute loss 
        train_loss = loss_fn(out.contiguous().view(-1, tgt_vocab), tgt_y.contiguous().view(-1))
        
        # backprop 
        optimizer.zero_grad()
        train_loss.backward()

        # update weights 
        optimizer.step()
src:  [['BOS', '你', '和', '我', '有', '相', '同', '的', '想', '法', '。', 'EOS', 'PAD',...]]
tgt:  [['BOS', 'you', 'and', 'i', 'have', 'the', 'same', 'idea', '.', 'EOS', 'PAD', 'PAD', 'PAD',...]]
tgt_y:  [['you', 'and', 'i', 'have', 'the', 'same', 'idea', '.', 'EOS', 'PAD', 'PAD','PAD','PAD','PAD'...]]
tgt shape:  torch.Size([64, 59])
tgt_y shape:  torch.Size([64, 59])
out:  [["'ve", 'cradle', 'artificial', 'minutes', 'letting', 'match', 'mysteries', ...]]