"nn.TransformerDecoderLayer" Without Encoder Input

dev1 · July 11, 2023, 8:03pm

Hi everyone. I’m trying to implement GPT. For that I have to use decoder only.

However PyTorch Decoder requires Encoder output as “memory” parameter to forward the decoder.

How can I use the PyTorch Decoder without providing input from Encoder for GPT?

nairbv · July 11, 2023, 8:28pm

The naming convention can be a bit confusing. GPT-style models are often referred to as “decoder only” because they require a causal mask not typically used in the encoder side of an encoder-decoder model. Decoders in encoder-decoder models also use cross attention though, which is not found in GPT-style “decoder only” models. It’s the cross attention that’s looking for encoder outputs, and which is giving you trouble. Some might prefer to call GPT-style models “encoder only” instead of “decoder only.”

I haven’t tried it using the PyTorch transformer modules, but your best bet for a GPT-style model might actually be to use the PyTorch encoder instead of decoder. nn.TransformerEncoder.forward does have an is_causal=True parameter. Then again, you may just want to implement your own module for a GPT-style transformer.

dev1 · July 12, 2023, 5:59am

First of all thank you for your answer. I checked the paper again and you seem right. I have another question if its ok.

So BERT is also using nearly the same architecture. Stacked encoders basically. As I can see only the way they got trained (masked and casual LMs) are different. Is that correct?

vdw · July 12, 2023, 6:50am

Yes, in their core, they are all transformer-based using only “one half”: decoder-only or encoder-only, as @nairbv nicely summarized.

The difference is in its uses, i.e., the training setup. GPT-variants utilize an autoregressive training relying in on the “you are not allowed to peak into the future” mask :). BERT is a masked LM where certain words are masked out to get predicted during training.

In short, your understanding is correct.