Mask not making any difference for time series transformers

I am using transformers for a time series forecasting task. The following is my mask code, and transformer code. I am having a few issues, one of which is that the validation MAE is 0, but during inference the model performs very poorly (I think there may be other problems allowing the model to perform perfectly, but I wanted to start here). I figured it may be due to an improper implementation of the mask. Any help would be greatly appreciated. In order to check my hypothesis, I also set the masks to all 0’s, and nothing changed. Any ideas as to why that would be as well?

mask code:

def mask(dim1: int, dim2: int):

return torch.triu(torch.ones(dim1, dim2) * float('-inf'), diagonal=1)

transformer code:

class Model(nn.Module):

def __init__(self,
             d_model = int,
             heads = int,
             dropout = float,
             dim_feedforward = int,
             stack = int, 

             # [Embedding]
             channel_in = int,
             window_size = int,
             pred_size = int
    super(Model, self).__init__()
    def mask(dim1: int, dim2: int):
        return torch.triu(torch.ones(dim1, dim2) * float('-inf'), diagonal=1)

    self.embedding = Embedding(channel_in=channel_in, window_size=window_size)

    # [Encoder]
    encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=heads, dim_feedforward=dim_feedforward, dropout=dropout, activation='gelu',batch_first=True, norm_first=True,)
    self.encoder = nn.TransformerEncoder(encoder_layer=encoder_layer, num_layers=stack, norm=nn.LayerNorm(d_model))
    # [Mask]
    self.tgt_mask = mask(pred_size, pred_size).to(DEVICE)
    self.src_mask = mask(pred_size, window_size).to(DEVICE)

    # [Decoder]
    decoder_layer = nn.TransformerDecoderLayer(d_model=d_model, nhead=heads, dim_feedforward=dim_feedforward, dropout=dropout, activation='gelu', batch_first=True, norm_first=True,)
    self.decoder = nn.TransformerDecoder(decoder_layer=decoder_layer, num_layers=stack, norm=nn.LayerNorm(d_model))

    self.out = nn.Linear(d_model, 1)

def forward(self, x, tgt):
    x = self.embedding(x)
    tgt = self.embedding(tgt)
    memory = self.encoder(x)
    out = self.decoder(tgt, memory, self.tgt_mask, self.src_mask)
    out = self.out(out)


    return out

P.S. I apologize in advance for the wierd formatting. I am new to pytorch forums, and don’t know how to make it so all the code bits are greyed out. If anyone knows how to do that, it may help in future questions.

I still haven’t figured it out completely, but just in case someone has these issues, I figured out that the mask is an additive mask when implemented with torch.transformerdecoder, which means that even though I set the mask to all zeros, it essentially got rid of the whole mask.