I was able to train my transformer down to a loss of less than 0.04 over 40 epochs, but I was not able to perform any sort of autoregressive inference. I realized that during training I was not shifting the target sentence to the right at all, which enables the model to predict the next word in the sequence during inference. So I shifted the target, but it completely ruined training. I can’t get a loss less than 4.0 over 40 epochs. Here I’ve attached my code and some outputs. If anyone could look at this I’d greatly appreciate it. My transformer model is based off of Harvard NLP’s modern PyTorch implementation.
optimizer=torch.optim.Adam(params=model.parameters(),lr=5e-4)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=0)
train_losses = []
val_losses = []
bleu_scores = []
for epoch in range(1,EPOCHS+1):
# train loop
for i, (src,tgt) in enumerate(train_data):
# place tensors to device
src = torch.Tensor(src).to(DEVICE).long()
trg = torch.Tensor(tgt).to(DEVICE).long()
tgt = trg[:,:-1] # decoder input (shifted left)
tgt_y = trg[:,1:] # decoder target (shifted right)
# print statements
print('tgt: ', id_to_word(tgt, en_index_dict))
print('tgt_y: ', id_to_word(tgt_y, en_index_dict))
print('tgt shape: ', tgt.size())
print('tgt_y shape: ', tgt_y.size())
src_mask = (src != PAD).unsqueeze(-2)
tgt_mask_ = (tgt != PAD).unsqueeze(-2)
tgt_mask = tgt_mask_ & transformer_nlp.subsequent_mask(tgt.size(-1)).type_as(tgt_mask_.data)
# forward pass
out = model.forward(src, tgt, src_mask, tgt_mask)
# print prediction vs target sentence
val, ind = torch.max(out, -1)
pred_sentence = id_to_word(ind,en_index_dict)
print('out: ', id_to_word(ind, en_index_dict))
# compute loss
train_loss = loss_fn(out.contiguous().view(-1, tgt_vocab), tgt_y.contiguous().view(-1))
# backprop
optimizer.zero_grad()
train_loss.backward()
# update weights
optimizer.step()
src: [['BOS', '你', '和', '我', '有', '相', '同', '的', '想', '法', '。', 'EOS', 'PAD',...]]
tgt: [['BOS', 'you', 'and', 'i', 'have', 'the', 'same', 'idea', '.', 'EOS', 'PAD', 'PAD', 'PAD',...]]
tgt_y: [['you', 'and', 'i', 'have', 'the', 'same', 'idea', '.', 'EOS', 'PAD', 'PAD','PAD','PAD','PAD'...]]
tgt shape: torch.Size([64, 59])
tgt_y shape: torch.Size([64, 59])
out: [["'ve", 'cradle', 'artificial', 'minutes', 'letting', 'match', 'mysteries', ...]]