TransformerEncoder output size doesn't match input size

I’m looking for suggestions on how to debug the following issue, where the output size of a nn.TransformerEncoder does not match the input. As part of a larger network, I have a nn.TransformerEncoder layer containing a single nn.TransformerEncoderLayer. It’s created as:

enc_layers = nn.TransformerEncoderLayer(16, 2, 
    dim_feedforward=32, dropout=0.0, activation='gelu',
self.encoder = nn.TransformerEncoder(enc_layers, 1)

In the forward method, the snippet where this is called is:

xs = xs.view(-1, 20, 16)
xsa = self.encoder(xs, src_key_padding_mask=mask)

I’m getting:

torch.Size([1024, 20, 16])
torch.Size([1024, 14, 16])

The strange thing is that when I run a single batch in isolation in a notebook, the sizes match and everything looks fine, but when run in the training script, I see the size mismatch. I’ve printed the sizes of all of the model parameter tensors in both, and they match. The mask size is `torch.Size([1024, 20]).

I’ve also used python -m pdb to drop in and inspect things when the training script fails at the next step because of the unexpected tensor size. When I re-pass xs to self.encoder I’m getting the expected size, so this is really mysterious to me.

I’m using torch 2.0.0 on a cuda gpu, but the same thing happens when running on the CPU as well.

Is there a situation where a nn.TransformerEncoder’s output could have a different size than the input? Thoughts/suggestions for how to further debug this?


I am experimenting exactly same problem. Changing the batch size of my model for some reason triggers this behaviour.

transformer_output = self.encoder(embeddings, is_causal=self.autoregressive, src_key_padding_mask=attention_mask)

where embeddings is shape [9, 2048, 256] and attention_mask is size [9, 2048], but the output is size [9, 2041, 256].
Is there any explanation? I don’t know what is going on and it’s difficult to debug.