Hi there. I know this is a recurrent issue, but I could not find a fitting solution for my problem in other threads.
The task is a Seq2Seq task.
I have the following model:
StackedResidualLSTM(
(encoder): RecurrentEncoder(
(embed_tokens): Embedding(32100, 128, padding_idx=0)
(dropout): Dropout(p=0, inplace=False)
(rnn): LSTM(128, 128, num_layers=2, batch_first=True)
)
(decoder): RecurrentDecoder(
(embed_tokens): Embedding(32100, 128, padding_idx=0)
(dropout_in_module): Dropout(p=0, inplace=False)
(dropout_out_module): Dropout(p=0, inplace=False)
(rnn): LSTM(128, 128, num_layers=2, batch_first=True)
(dropout): Dropout(p=0, inplace=False)
(fc_out): Linear(in_features=128, out_features=32100, bias=True)
)
)
My forward function is the following:
def forward(self, input_ids, tgt, return_loss=True, **kwargs):
state = self.encoder(input_ids, **kwargs)
prev_output_tokens = self._shift_right(tgt)
decoder_out, _ = self.decoder(prev_output_tokens, state=state, **kwargs)
if return_loss:
return self.loss(decoder_out, tgt, ignore_index=kwargs.get("ignore_index", -100))
else:
return decoder_out
Previous Output Tokens are the targets shifted right one character, starting with eos_token.
The loss is a CrossEntropy with ignore index = pad_token.
The optimizer is a SGD with lr=1e-3, momentum=0.9, weight decay=1e-2
Do you have any insights on what may hinder the proper training of the model? Thanks in advance for any help you can provide.