Hi there. I know this is a recurrent issue, but I could not find a fitting solution for my problem in other threads.
The task is a Seq2Seq task.
I have the following model:
StackedResidualLSTM( (encoder): RecurrentEncoder( (embed_tokens): Embedding(32100, 128, padding_idx=0) (dropout): Dropout(p=0, inplace=False) (rnn): LSTM(128, 128, num_layers=2, batch_first=True) ) (decoder): RecurrentDecoder( (embed_tokens): Embedding(32100, 128, padding_idx=0) (dropout_in_module): Dropout(p=0, inplace=False) (dropout_out_module): Dropout(p=0, inplace=False) (rnn): LSTM(128, 128, num_layers=2, batch_first=True) (dropout): Dropout(p=0, inplace=False) (fc_out): Linear(in_features=128, out_features=32100, bias=True) ) )
My forward function is the following:
def forward(self, input_ids, tgt, return_loss=True, **kwargs): state = self.encoder(input_ids, **kwargs) prev_output_tokens = self._shift_right(tgt) decoder_out, _ = self.decoder(prev_output_tokens, state=state, **kwargs) if return_loss: return self.loss(decoder_out, tgt, ignore_index=kwargs.get("ignore_index", -100)) else: return decoder_out
Previous Output Tokens are the targets shifted right one character, starting with eos_token.
The loss is a CrossEntropy with ignore index = pad_token.
The optimizer is a SGD with lr=1e-3, momentum=0.9, weight decay=1e-2
Do you have any insights on what may hinder the proper training of the model? Thanks in advance for any help you can provide.