I followed the tutorial given here. However, the implementation for Transformer is significantly different in the pytorch codebase. The latter being closer to the the proposed approach by the authors.
Can someone guide me how to use the pytorch transformer to do a sequence to sequence translation task. I have described below the problem in some detail.
Transformer(src, tgt) parameters: src: the sequence to the encoder (required), tgt: the sequence to the decoder (required).EDIT: For example, English Language dataset
src: The dataset is [32, 5, 256] where 32 represents the total sentences in the database, 5 are the words in every sentence and 256 are the embeddings for each of the 5 words.
tgt: I don’t know what to provide for this argument to the Transformer.
EDIT: I have a similar dataset for French say the shape is [32, 7, 256]
Assume that the positional encodings have been added to the above src
Thanks @zhangguanheng66. I am looking for translation therefore, tgt is required. I was confused as to how to format the tgt tensor. However, trying with things I figured out the difference between inference and test time settings for translation. Thanks nonetheless.
@paganpasta@zhangguanheng66 Can you share how you handled the difference between training and test? I’m having a similar issue and haven’t figured it out yet.
Specifically, given that I’m training this way:
opt.zero_grad()
# Model requires both "inputs" and "targets"
out = model(inp_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
loss = # ...
loss.backward()
opt.step()
sch.step()
How do I call the model for inference? I haven’t seen any code online that uses nn.Transformer with a decoder at inference time.
model.eval()
with torch.no_grad():
# What goes in "tgt_emb" and masks??
out = model(inp_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
# ...
# Model requires both "inputs" and "targets"
for i in range(2, targets.size(1)):
opt.zero_grad()
trimmed_tgt = targets[:, :i].contiguous()
in_tgt = trimmed_tgt[:, :-1]
exp_tgt = trimmed_tgt[:, 1:]
# Some code missing here, assume in_tgt gets converted to in_tgt_emb
out = model(inp_emb, in_tgt_emb, tgt_mask=tgt_mask,
src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
loss = criterion(out, exp_tgt)
loss.backward()
opt.step()
sch.step()
Not sure if I’m supposed to accumulate losses in the loop there or not, but this seems to be getting more realistic results than I was getting before.
For the training, you pass the complete inp_emb and tgt_emb updated with positional embeddings. tgt_emb is your ground-truth embeddings. The mask is computed by the Transformer module to have access to only the previous outputs in time.
However, during inference you don’t have access to the ground truth tgt_emb and have to input output[i-1] as the tgt_emb at time i. How I did was something like this:
#let's assume batch_size = 1
initial_dec_input = zeros(1, 1, emb_dim) #All 0s
tgt_emb = zeros(1, tgt_size, emb_dim)
tgt_emb[0,0, :] = initial_dec_input
for i in range(tgt_size):
out = model(inp_emb, tgt_emb)
tgt_emb[0, i+1, :] = out[0,i,:]
Ah, perfect! Yep, this makes sense. Thanks very much for explaining it.
I also want to offer a revision to my previous post. Turns out you get very bad model performance if you do that training sub-loop like I was. The correct way to do teacher forcing is just to pass the targets shifted left one. That way, since the attention mask already restricts the model from cheating by looking ahead, each output token will have to be correctly predicted based on all the previous tokens in the input. Effectively, the same thing is accomplished as my training sub-loop, but without the massive inefficiency and bad performance.
# Model requires both "inputs" and "targets"
opt.zero_grad()
in_tgt = targets[:, :-1]
exp_tgt = targets[:, 1:]
tgt_padding_mask = tgt_padding_mask[:, :-1]
# Some code missing here, assume in_tgt gets converted to in_tgt_emb
out = model(inp_emb, in_tgt_emb, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
loss = criterion(out, exp_tgt)
loss.backward()
opt.step()
sch.step()
Thanks all for the discussion! Has anyone had problems in inference despite training using the teacher-forcing with decoder input shifted left and loss computed on the target shifted right? I am beginning to suspect that too much teacher forcing during training is disrupting the ability of the network to independently estimate the sequence without using ground-truth priors. The inference on training data itself fails badly in my case despite a successful training. I am training the transformer model to generate spectrogram like features.
You might have to elaborate on what you mean from a successful training.
In a nut shell, during training, the model is predicting the GT at t+1 using GT from 0...t
For inference, since you dont have GT have to iteratively feed in the model output to arrive at 1…t.