Answered my own question on this thread.
The code I replace it with is like this:
# Model requires both "inputs" and "targets"
for i in range(2, targets.size(1)):
opt.zero_grad()
trimmed_tgt = targets[:, :i].contiguous()
in_tgt = trimmed_tgt[:, :-1]
exp_tgt = trimmed_tgt[:, 1:]
# Some code missing here, assume in_tgt gets converted to in_tgt_emb
out = model(inp_emb, in_tgt_emb, tgt_mask=tgt_mask,
src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
loss = criterion(out, exp_tgt)
loss.backward()
opt.step()
sch.step()
Not sure if I’m supposed to accumulate losses in the loop there or not, but this seems to be getting more realistic results than I was getting before.