LSTM loss is stagnated

Hi there. I know this is a recurrent issue, but I could not find a fitting solution for my problem in other threads.
The task is a Seq2Seq task.
I have the following model:

StackedResidualLSTM(
  (encoder): RecurrentEncoder(
    (embed_tokens): Embedding(32100, 128, padding_idx=0)
    (dropout): Dropout(p=0, inplace=False)
    (rnn): LSTM(128, 128, num_layers=2, batch_first=True)
  )
  (decoder): RecurrentDecoder(
    (embed_tokens): Embedding(32100, 128, padding_idx=0)
    (dropout_in_module): Dropout(p=0, inplace=False)
    (dropout_out_module): Dropout(p=0, inplace=False)
    (rnn): LSTM(128, 128, num_layers=2, batch_first=True)
    (dropout): Dropout(p=0, inplace=False)
    (fc_out): Linear(in_features=128, out_features=32100, bias=True)
  )
)

My forward function is the following:

def forward(self, input_ids, tgt, return_loss=True, **kwargs):
    state = self.encoder(input_ids, **kwargs)
    prev_output_tokens = self._shift_right(tgt)
    decoder_out, _ = self.decoder(prev_output_tokens, state=state, **kwargs)
    if return_loss:
        return self.loss(decoder_out, tgt, ignore_index=kwargs.get("ignore_index", -100))
    else:
        return decoder_out

Previous Output Tokens are the targets shifted right one character, starting with eos_token.
The loss is a CrossEntropy with ignore index = pad_token.
The optimizer is a SGD with lr=1e-3, momentum=0.9, weight decay=1e-2
Do you have any insights on what may hinder the proper training of the model? Thanks in advance for any help you can provide.