nn.Transformer src/tgt/memory masks fail to work

nn.TransformerEncoderLayer produces exactly the same output with same src,no matter what src_key_padding_mask or src_mask is.

Likewise,nn.TransformerDecoderLayer output is not affected by any one of tgt_mask,memory_mask,tgt_key_padding_mask or memory_key_padding_mask.

Does anyone know what’s going wrong?How could I make the masks work in the right way?Thanks a lot.

[Input0]:
import torch
import torch.nn as nn
encoder_layer=nn.TransformerEncoderLayer(d_model=6,nhead=2)
encoder_layer.eval()
src=torch.ones((4,3,6))
encoder_layer(src)

[Output0]:
tensor([[[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]]],
   grad_fn=<NativeLayerNormBackward0>)

print(encoder_layer(src,src_mask=torch.zeros((4,4)).bool()))
print(encoder_layer(src,src_mask=torch.tensor(
[[0,1,1,1],
[0,0,1,1],
[0,0,0,1],
[0,0,0,0]]
).bool()))
print(encoder_layer(src,src_mask=torch.tensor(
[[0,1,0,1],
[1,0,1,1],
[0,1,0,1],
[0,1,1,1]]
).bool()))

[Output2]:
tensor([[[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]]],
   grad_fn=<NativeLayerNormBackward0>)

tensor([[[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]]],
   grad_fn=<NativeLayerNormBackward0>)

tensor([[[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910],
[ 0.9927, 0.2251, -1.5199, 1.2508, -1.0397, 0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]],

    [[ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910],
     [ 0.9927,  0.2251, -1.5199,  1.2508, -1.0397,  0.0910]]]

All above produces the same result!What’s going wrong?

Problem solved! It’s a special instance causing by making input filled with all ones.Despite attn_weights differs,attn_weights.sum(dim=-1) always equals to 1,and different masks does not affect the value_matrix,which contains equaled line vectors.Thus when dim -1 sumed to 1 attn_weights multiplies value_matrix,the result is the same.And the output could only be the same after a dnn layer.