How to use/train Transformer in Pytorch

I followed the tutorial given here. However, the implementation for Transformer is significantly different in the pytorch codebase. The latter being closer to the the proposed approach by the authors.

Can someone guide me how to use the pytorch transformer to do a sequence to sequence translation task. I have described below the problem in some detail.

Transformer(src, tgt) parameters: src: the sequence to the encoder (required), tgt: the sequence to the decoder (required). EDIT: For example, English Language dataset

src: The dataset is [32, 5, 256] where 32 represents the total sentences in the database, 5 are the words in every sentence and 256 are the embeddings for each of the 5 words.

tgt: I don’t know what to provide for this argument to the Transformer.

EDIT: I have a similar dataset for French say the shape is [32, 7, 256]

Assume that the positional encodings have been added to the above src

1 Like

If you use TransformerEncoder, you don’t need to provide tgt. For translation task, tgt is target language output, as the typical translation task.

1 Like

Thanks @zhangguanheng66. I am looking for translation therefore, tgt is required. I was confused as to how to format the tgt tensor. However, trying with things I figured out the difference between inference and test time settings for translation. Thanks nonetheless.

yes, the setup for inference and train are different

@paganpasta @zhangguanheng66 Can you share how you handled the difference between training and test? I’m having a similar issue and haven’t figured it out yet.

Specifically, given that I’m training this way:

opt.zero_grad()
# Model requires both "inputs" and "targets"
out = model(inp_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
loss = # ...
loss.backward()
opt.step()
sch.step()

How do I call the model for inference? I haven’t seen any code online that uses nn.Transformer with a decoder at inference time.

model.eval()
with torch.no_grad():
  # What goes in "tgt_emb" and masks??
  out = model(inp_emb, tgt_emb, src_mask=src_mask, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
  # ...
1 Like

Answered my own question on this thread.

The code I replace it with is like this:

# Model requires both "inputs" and "targets"
for i in range(2, targets.size(1)):
  opt.zero_grad()
  trimmed_tgt = targets[:, :i].contiguous()
  in_tgt = trimmed_tgt[:, :-1]
  exp_tgt = trimmed_tgt[:, 1:]
  # Some code missing here, assume in_tgt gets converted to in_tgt_emb
  out = model(inp_emb, in_tgt_emb, tgt_mask=tgt_mask, 
src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
  loss = criterion(out, exp_tgt)
  loss.backward()
  opt.step()
  sch.step()

Not sure if I’m supposed to accumulate losses in the loop there or not, but this seems to be getting more realistic results than I was getting before.

1 Like

Hi @dav-ell.

For the training, you pass the complete inp_emb and tgt_emb updated with positional embeddings. tgt_emb is your ground-truth embeddings. The mask is computed by the Transformer module to have access to only the previous outputs in time.
However, during inference you don’t have access to the ground truth tgt_emb and have to input output[i-1] as the tgt_emb at time i. How I did was something like this:

#let's assume batch_size = 1
initial_dec_input = zeros(1, 1, emb_dim) #All 0s
tgt_emb = zeros(1, tgt_size, emb_dim)
 tgt_emb[0,0, :] = initial_dec_input
for i in range(tgt_size):
  out = model(inp_emb, tgt_emb)
  tgt_emb[0, i+1, :] = out[0,i,:]

Cheers.

2 Likes

Ah, perfect! Yep, this makes sense. Thanks very much for explaining it.

I also want to offer a revision to my previous post. Turns out you get very bad model performance if you do that training sub-loop like I was. The correct way to do teacher forcing is just to pass the targets shifted left one. That way, since the attention mask already restricts the model from cheating by looking ahead, each output token will have to be correctly predicted based on all the previous tokens in the input. Effectively, the same thing is accomplished as my training sub-loop, but without the massive inefficiency and bad performance.

# Model requires both "inputs" and "targets"
opt.zero_grad()
in_tgt = targets[:, :-1]
exp_tgt = targets[:, 1:]
tgt_padding_mask = tgt_padding_mask[:, :-1]
# Some code missing here, assume in_tgt gets converted to in_tgt_emb
out = model(inp_emb, in_tgt_emb, tgt_mask=tgt_mask, src_key_padding_mask=inp_padding_mask, tgt_key_padding_mask=tgt_padding_mask)
loss = criterion(out, exp_tgt)
loss.backward()
opt.step()
sch.step()
3 Likes

So then can the Transformer network only predict ONE tilmestep as an output at a time?

What if the input and target have the same length?

During inference, for most of the usecases, YES. For training, predictions happen in parallel.

1 Like

During training, I’m doing:

pred = model(x, y)

Where both have the same shape: (batch, seq, features). pred has the same shape. But if it’s one at a time, then how do I get one time-step at a time?

This answer explains the working process. I am unaware of the problem you are facing, hopefully the responses on the other thread can help you out.

1 Like

Thanks so much! I think I got something working now. Only problem is that inference is super slow because it’s autoregressive.

I’m doing:

        tgt = torch.ones(x.size(0), 1, 128) * -1

        for i in range(x.size(1)):
            tgt = torch.randn(tgt.size())
            pred = reconstruct_spect_model(x, tgt)

            tgt = torch.cat((tgt, pred[:, -1, :].unsqueeze(1)), 1)

I assume I don’t need any tgt_mask during inference?

Thanks all for the discussion! Has anyone had problems in inference despite training using the teacher-forcing with decoder input shifted left and loss computed on the target shifted right? I am beginning to suspect that too much teacher forcing during training is disrupting the ability of the network to independently estimate the sequence without using ground-truth priors. The inference on training data itself fails badly in my case despite a successful training. I am training the transformer model to generate spectrogram like features.

You might have to elaborate on what you mean from a successful training.
In a nut shell, during training, the model is predicting the GT at t+1 using GT from 0...t
For inference, since you dont have GT have to iteratively feed in the model output to arrive at 1…t.