Well I am unsqueezing the third dimension so my tensor is [batch_size x seq_len x 1] sized, and send in a mask along with it with dimensions [batch_size*nheads, seq_len,seq_len].

This gets me the error:

AssertionError: Expected `attn_mask`

shape to be (5, 8, 8) but got torch.Size([40, 100, 100])

Note that here nheads is 5 batch size is 8 seq_le 100, so actually the wanted size is [nheads, batch,batch], but that cannot be right.

If I don’t unsqueeze and send in my data as batc_size x seq_len, and my mask as [nheahs*batch_size,seq_len] then I run into:

RuntimeError: The shape of the 2D attn_mask is torch.Size([40, 100]), but should be (8, 8).

Here again the “correct” size is [batch_size x batch_size].

Here is the updated verison of my current data collator:

```
class MyDataCollator(DefaultDataCollator):
def __init__(self,model,mask_prob = 0.15):
self.model_nhead = model.nhead
self.d_model = model.d_model
self.mask_prob = mask_prob
def __call__(self,input):
batch = default_data_collator(input)
# Mask should be size : [batch_size * nhead, seq_len, seq_len], but that doesn't work
batch['src'] = batch['src'].unsqueeze(2)
batch_size = batch['src'].shape[0] #This is now 8 due to low memory of my laptop
seq_len = batch['src'].shape[1] #This is 500, however embedding transforms it into d_model
mask = rand(self.model_nhead * batch_size, self.d_model,self.d_model) < self.mask_prob
batch['src_mask'] = mask
return batch
```

Also your comment on whether I send in batched inputs intrigued me and I followed into the Pytorch code. I found that I get the error from inside an if statement which checks whether my data is 2 or three dimensional, and what is interesting from the 2D branch. So even tho I send in batched inputs somehow Pytorch detects it as unbatched.

This I found in the `torch/nn/functional.py`

file in the function _mha_shape_check.

But then I printed out the shape of my input tensor right before passing it to the nn.TransformerEncoder and it was the correct [batch_size x seq_len, 1].

Also since my first dimension is the batch_dimension I am passing the batch_first=True parameter to the nn.TransformerEncoderLayer. However it really seemed like that it has no effect so I tested it by swapping the axes of my input so I had [seq_len x batch_size x 1]. Then it passed the shape check but failed later on a different one.

AssertionError: was expecting embedding dimension of 100, but got 1. I honestly don’t know what the hell is going on with this one, but I though I will share maybe it helps.