I don’t believe I am training/using my transformer correctly. At the end of 20 epochs it’s giving output predictions that consist of only one or two types of tokens (shown below). This is my first time building a transformer and I’m really not sure if I’m performing the training loop correctly. I don’t know how to shift the output to the right or if that would make a difference. And I’m not feeding the previous predictions back into the decoder for to generate the next token in the sequence (I was told you only do that in inference), it’s just computing the whole prediction in one forward pass through the decoder. Could someone help me diagnose this issue?
target: [['BOS', 'six', 'divided', 'by', 'two', 'equals', 'three', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD',...]]
predit: [['BOS', 'EOS', 'EOS', 'EOS', 'BOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS',...]]
target: [['BOS', 'people', 'love', 'to', 'talk', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD',...]]
predit: [['BOS', '.', 'BOS', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', '.', ....]]
target: [['BOS', 'my', 'mother', 'is', 'writing', 'a', 'letter', 'now', '.', 'EOS', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD', 'PAD',...]]
predit: [['BOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'BOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS', 'EOS',...]]
Epoch[20/20] train_loss: 4.4391326904296875 val_loss: 4.598245028791757
Here is my training loop:
# optimization loop
best_loss = 1e5
best_epoch = 0
optimizer=torch.optim.Adam(params=model.parameters(),lr=1e-3)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=0)
train_losses = []
val_losses = []
for epoch in range(1,EPOCHS+1):
# train loop
for i, (src,trg) in enumerate(train_data):
# place tensors to device
src = torch.Tensor(src).to(DEVICE).long()
trg = torch.Tensor(trg).to(DEVICE).long()
mask = torch.tril(torch.ones((MAX_LENGTH, MAX_LENGTH))).to(DEVICE)
# forward pass
out = model(src,trg, mask)
# print prediction vs target sentence
trg_sentence = id_to_word(trg,en_index_dict)
print('target: ',trg_sentence)
val, ind = torch.max(out, -1)
pred_sentence = id_to_word(ind,en_index_dict)
print('predit: ', pred_sentence)
# compute loss
train_loss = loss_fn(out.view(-1, tgt_vocab), trg.view(-1))
# backprop
optimizer.zero_grad()
train_loss.backward()
# update weights
optimizer.step()
I can provide more code if necessary! Thanks