Using Subsequent Mask for Transformer Decoder Leads to NaN values


I’m trying to implement an autoregressive transformer model similar to the paper “attention is all you need”. From what I’ve understood, in order to replicate the architecture fully, I need to give the transformer decoder 3 masks.

1 - Target subsequent mask: this is for casaulity.
2 - Target padding indexes: just to look at non-padded indices.
3 - Encoder padding indices: just to look at non-padded inputs from the encoder.

The snippet is here:

y = self.decoder(y, x,

With masks being generated like this:

    def generate_no_peek_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float("-inf")).masked_fill(mask == 1, float(0.0))
        mask =
        return mask

    def generate_padding_mask(self, seq, pad_idx):
        return (seq != pad_idx).to(self.device)

The problem is that using these masks leads to issues with the Softmax function because of NaN values. Without these masks, the model does not generate any NaN value.