Masked Transformer for any CSV format data

I am trying to develop a Masked Transformer for play-by-play football data. All the data is stored in a CSV file where each csv file is a game, each row is a play, and each column is an attribute about that play.

Example of the data turned into a tensor:
input_tensor = torch.tensor([[ 1., 0., 0., 0., 0.],
[ 1., 0., 9., 3., 3.],
[ 1., 0., 52., 2., 6.],
[ 1., 1., 32., 1., 10.]])

Each of the rows in this tensor would be a play and the values in this row are the attributes/features. For example, this is one row/play with features [ 1., 0., 0., 0., 0.].

I have a Transformer architecture of the following format.

    def __init__(self, input_dim, output_dim, nhead=2, num_layers=2):
         super(FootballTransformer, self).__init__()
         self.encoder_layer = nn.TransformerEncoderLayer(input_dim, nhead)
         self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers)
         self.decoder_layer = nn.TransformerDecoderLayer(output_dim, nhead)
         self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers)
         self.linear = nn.Linear(input_dim, output_dim)

I want to implement several training techniques. The first is to randomly mask different attributes in different rows to allow the model to learn about the data. The next technique would be to finetune the model by masking only the last row/play in the input data so it learns to then predict the next play/row.

I am currently confused about how to properly mask. I would assume that the model would want a mask the same size as the input, where whatever features/rows you want to mask would be the same location in both tensors.

When I use the example data above it wants a mask shape (4,4) which is (sequence_len, sequence_len) I do not understand this. How would I either alter my model to use a mask of the same size so I can randomly select features / rows to mask or how would I make a mask that allows me to do this?

If masking specific features is not an option and I can only mask a whole row/play, how would I create a mask so I know exactly what rows/plays I am masking. With bigger sequences of say 100+ I want to be able to mask a ton of random rows/features.

I am willing to pay for working code that gets me what I want.