The naming convention can be a bit confusing. GPT-style models are often referred to as “decoder only” because they require a causal mask not typically used in the encoder side of an encoder-decoder model. Decoders in encoder-decoder models also use cross attention though, which is not found in GPT-style “decoder only” models. It’s the cross attention that’s looking for encoder outputs, and which is giving you trouble. Some might prefer to call GPT-style models “encoder only” instead of “decoder only.”
I haven’t tried it using the PyTorch transformer modules, but your best bet for a GPT-style model might actually be to use the PyTorch encoder instead of decoder. nn.TransformerEncoder.forward does have an is_causal=True parameter. Then again, you may just want to implement your own module for a GPT-style transformer.
First of all thank you for your answer. I checked the paper again and you seem right. I have another question if its ok.
So BERT is also using nearly the same architecture. Stacked encoders basically. As I can see only the way they got trained (masked and casual LMs) are different. Is that correct?
Yes, in their core, they are all transformer-based using only “one half”: decoder-only or encoder-only, as @nairbv nicely summarized.
The difference is in its uses, i.e., the training setup. GPT-variants utilize an autoregressive training relying in on the “you are not allowed to peak into the future” mask :). BERT is a masked LM where certain words are masked out to get predicted during training.