Dimensions of attention mask

Hello!

I’m having quite a hard time making a custom transformer architecture to work and I ran out of options to ask for help, because I literally read the whole documentation, all the existing forums and relevant stackoverflow articles.

I would love if someone would be so kind to explain to me how in God’s name does Pytorch’s nn.Transformer expects the mask dimensions to be, so that I can use it for self-supervised training.

I’m passing batch_first=True to the TransformerEncoderLayer and my data has dimensions of batch_size x seq_length. I’m working with a set number of distinct features, lets say 500, and have a BS of 32.

In this case my data is a 32 x 500 Tensor. And I want a different mask for each data-point in the batch so I would think my mask should be 32 x 500 too.

But of course that is not the case, and at this point I’m so lost that I don’t even have a guess what should it be. For the record I tried adding dimensions with unsqueeze to try 32x 1 x 500 and 32 x 500 x 1, but to no awail.

Any help would be appriciated, thank you!

How many attention heads nhead did you specify when you constructed your Transformer? If you want a different mask per sample, it looks like you’ll want to specify a [batch_size * nhead, seq_len, seq_len] tensor. Also see Transformer — PyTorch 2.1 documentation

Currently I’m using 5 heads, but that is a hyperparameter of my model. I’m creating the mask in the data collator, which takes my model as a parameter to extract structural information.


from transformers import default_data_collator, DefaultDataCollator
from torch import rand

#This was separated, because in the future dynamic padding will be needed
# and HF offers no built-in tool for our case

class MyDataCollator(DefaultDataCollator):
    def __init__(self,model,mask_prob = 0.15):
        self.model_nhead = model.nhead
        self.d_model = model.d_model
        self.mask_prob = mask_prob

    def __call__(self,input):
        batch = default_data_collator(input)
        # Mask should be size :  [batch_size * nhead, seq_len, seq_len]

        mask = rand(batch['src'].shape) < self.mask_prob

        batch_size = batch['src'].shape[0] #THis is now 8 due to low memory of my laptop
        seq_len = batch['src'].shape[1] #This is 500, however embedding transforms it into d_model

        mask = rand(self.model_nhead * batch_size, self.d_model,self.d_model)
        batch['src_mask'] = mask

        return batch

Now when I run this I get the error:

AssertionError: Expected attn_mask shape to be (5, 8, 8) but got torch.Size([40, 100, 100]).

So the size I pass is actually, what you suggested, and this is what I too found in the documentation, however it wants (5,8,8).

Could you double check that you are passing in batched inputs? You’d only run into that error if your inputs (i.e. query) are 2D.

Well I am unsqueezing the third dimension so my tensor is [batch_size x seq_len x 1] sized, and send in a mask along with it with dimensions [batch_size*nheads, seq_len,seq_len].

This gets me the error:
AssertionError: Expected attn_mask shape to be (5, 8, 8) but got torch.Size([40, 100, 100])
Note that here nheads is 5 batch size is 8 seq_le 100, so actually the wanted size is [nheads, batch,batch], but that cannot be right.

If I don’t unsqueeze and send in my data as batc_size x seq_len, and my mask as [nheahs*batch_size,seq_len] then I run into:

RuntimeError: The shape of the 2D attn_mask is torch.Size([40, 100]), but should be (8, 8).

Here again the “correct” size is [batch_size x batch_size].

Here is the updated verison of my current data collator:


class MyDataCollator(DefaultDataCollator):
    def __init__(self,model,mask_prob = 0.15):
        self.model_nhead = model.nhead
        self.d_model = model.d_model
        self.mask_prob = mask_prob

    def __call__(self,input):
        batch = default_data_collator(input)
        # Mask should be size :  [batch_size * nhead, seq_len, seq_len], but that doesn't work

        batch['src'] = batch['src'].unsqueeze(2)
        batch_size = batch['src'].shape[0] #This is now 8 due to low memory of my laptop
        seq_len = batch['src'].shape[1] #This is 500, however embedding transforms it into d_model

        mask = rand(self.model_nhead * batch_size, self.d_model,self.d_model) < self.mask_prob
        batch['src_mask'] = mask

        return batch

Also your comment on whether I send in batched inputs intrigued me and I followed into the Pytorch code. I found that I get the error from inside an if statement which checks whether my data is 2 or three dimensional, and what is interesting from the 2D branch. So even tho I send in batched inputs somehow Pytorch detects it as unbatched.

This I found in the torch/nn/functional.py file in the function _mha_shape_check.

But then I printed out the shape of my input tensor right before passing it to the nn.TransformerEncoder and it was the correct [batch_size x seq_len, 1].

Also since my first dimension is the batch_dimension I am passing the batch_first=True parameter to the nn.TransformerEncoderLayer. However it really seemed like that it has no effect so I tested it by swapping the axes of my input so I had [seq_len x batch_size x 1]. Then it passed the shape check but failed later on a different one.

AssertionError: was expecting embedding dimension of 100, but got 1. I honestly don’t know what the hell is going on with this one, but I though I will share maybe it helps.

Here is an example using masks in the nn.Transformer class you can use and modify as needed:

import torch
import torch.nn as nn

encoder_depth = 512
nhead = 8
seq_len = 10
seq_len_tgt = 20
model = nn.Transformer(d_model = encoder_depth, nhead = nhead, batch_first = True)

batch_size = 32

src = torch.rand((batch_size, seq_len, encoder_depth))
tgt = torch.rand((batch_size, seq_len_tgt, encoder_depth))

src_mask = torch.rand((batch_size*nhead, seq_len, seq_len )).bool()
tgt_mask = torch.rand((batch_size*nhead, seq_len_tgt, seq_len_tgt)).bool()

output = model(src, tgt, src_mask, tgt_mask)
print(output.size())

Thank you for this compact example code, it helped me ask myself the right questions and ultimately solve my problem. I’m gonna leave one thought here for anyone else who runs into my problem in the future.

So originally my problem came from the fact that I was trying to consider single datapoints not sequences of them, and messing up the terminology, as the transformer architecture was created to deal with NLP task where we consider sequences all the time.

In fact the data has to be: [batch_size, seq_len, feaute_dim] where feature dim is the same with d_model.
And the attention mask considers which data points IN THE SEQUENCE can pay attention to each other, and as such has to be [batch_size*nhead, seq_len,seq_len].

It was my mistake to try to set up a training regimen where I would only have sequences of length one, because that is not what this architecture was made to do. All-in-all this was more of a structural error of design on my part than anything else. Thank you for you kindness guys :blush:

1 Like

By the way, Huggingface has some very good resources, models and libraries built on PyTorch, along with some tutorials. Might save you some time and effort so you’re not reinventing the wheel, so to speak.

Hello,

I too landed here since I feel as though I have run out of options. I am experiencing an identical problem to the one you have outlined in this post:

I’m batching sequences as tensors of shape: [batch_size, (max)_seq_length] and I believed I could pass in an attention mask with the same dimensions indicating which tokens are padding tokens, but that is not the case.

I’ve read this post and worked through the math for scaled-dot-product attention several times. I think I understand why what I provide is incorrect, however I can’t figure out how to get my mask in the correct shape.

If I’ve got a mask like so:

[
  [True, True, True, False, False],
  [True, True, True, True, True],
]

I’m curious to know what code you ran to get it into the appropriate shape of [batch_size*n_head, seq_len, seq_len].

Any help is very much appreciated. Thank you!

Well this was some time ago, but I think you want to use the src_key_padding_mask argument of the nn.TransformerEncoderLayer, and that will take care of the mask reshaping. If you run a custom implementation I recommend checking the original Pytorch code.

yeah. that was it. I appreciate it. I saw that but it just didn’t seem like the answer. I suppose I should have just tried it. Doesn’t seem too well documented. From another thread

The main difference is that ‘src_key_padding_mask’ looks at masks applied to entire tokens. So for example, when you set a value in the mask Tensor to ‘True’, you are essentially saying that the token is a ‘pad token’ and should not be attended by any other tokens.