PyTorch built-in layer for nanoGPT

I’m looking at the nanoGPT code from which has a single Causal Self Attention Block.

class Block(nn.Module):

    def __init__(self, config):
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

I’m trying to represent the Transformer model in nanoGPT (decoder only) using one of PyTorch’s built-in layers. I thought of using TransformerDecoder — PyTorch 2.1 documentation at first, but looking at the code suggests that it’s built for use as a decoder that accepts an encoder’s output. i.e. the memory argument is required.

  • memory (Tensor) – the sequence from the last layer of the encoder (required).

I’m hoping that there’s some simple 1-2 line solution for this w/o me having to re-implement large parts of the code.

My initial attempt is to use TransformerEncoder by passing in is_causal=True. However, when I pass in is_causal=False, I get the same result. I am a bit confused.

x = torch.randn(10, 30, 16)

ys = {}

for is_causal in (False, True):
    enc_layer = nn.TransformerEncoderLayer(d_model=16, nhead=4, batch_first=True,)
    enc = nn.TransformerEncoder(enc_layer, num_layers=2,)

    ys[is_causal] = enc(x, is_causal=is_causal)

print(torch.allclose(ys[False], ys[True]))  # prints 'True'.

I also found a tutorial on next word prediction using TransformerEncoder at Language Modeling with nn.Transformer and torchtext — PyTorch Tutorials 2.1.1+cu121 documentation - however, it seems like the attention mask argument is not passed in to the model’s forward() method.