Help wanted: How to pad and mask correctly for variable length label sequences


i implemented a transformer-encoder which takes some cp_trajectories and has to then create a fitting log mel spectrogram for those. Because input as well as labels are variable in length i use a custom_collate_fn to pad them like this

import torch
from torch.nn.utils.rnn import pad_sequence

def pad_and_mask(batch):
    # Assuming each element in 'batch' is a tuple (sequence, label)
    sequences = [torch.tensor(item[0]) for item in batch]
    labels = [torch.tensor(item[1]) for item in batch]

    # Pad the sequences to have the same length
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=0)
    labels_padded = pad_sequence(labels, batch_first=True, padding_value=0)

    # Create attention masks for sequences
    attention_masks = torch.zeros((sequences_padded.size(1), len(batch)), dtype=torch.float32)
    for i, seq in enumerate(sequences):
        attention_masks[i, :len(seq)] = 1

    # Create label masks for labels
    label_masks = torch.zeros_like(labels_padded, dtype=torch.float32)
    for i, label in enumerate(labels):
        label_masks[i, :len(label)] = 1

    return sequences_padded, attention_masks, labels_padded, label_masks```

i use the attention_masks in the transformer and i believe they work fine if i input them into key_padding_mask. Now i want to caltulate the loss as a MSELoss but i dont want the loss to be skewed because i dont mask the output of my transformer. How would i implement that? Or is that simply a completely wrong approach. 

Thanks for the help :D