Mask in Attention: scalar vs torch.tensor vs torch.zeros

Hello all,

I implemented a Transformer model and I want to use masking in the score table.

I implemented the masking with 3 different ways that represent exactly the same masking and I have 3 different results.

The score table is of dimension (time_len * joint_num):
Way 1:
t_mask = torch.ones(time_len * joint_num, time_len * joint_num)
filtered_area = torch.zeros(joint_num, joint_num)

    for i in range(time_len):
        row_begin = i * joint_num
        column_begin = row_begin
        row_num = joint_num
        column_num = row_num

        t_mask[row_begin: row_begin + row_num, column_begin: column_begin + column_num] *= filtered_area

Way 2
t_mask = torch.ones(time_len * joint_num, time_len * joint_num)

    for i in range(time_len):
        row_begin = i * joint_num
        column_begin = row_begin
        row_num = joint_num
        column_num = row_num

        t_mask[row_begin: row_begin + row_num, column_begin: column_begin + column_num] *= 0.0

Way 3
t_mask = torch.ones(time_len * joint_num, time_len * joint_num)

    for i in range(time_len):
        row_begin = i * joint_num
        column_begin = row_begin
        row_num = joint_num
        column_num = row_num

        t_mask[row_begin: row_begin + row_num, column_begin: column_begin + column_num] *= torch.tensor(0.0)

The Accuracy of Way1 is better compared to the Accuracy of Way2 and 3
Why does it happen ?
torch version : 1.7.0

As far as I could see, I do not find a particular differentiating factor in way-1 that would make it perform better than way-2 & way-3.
What’s the performance difference that you are observing?

Me neither ! It does not makes sense! I use the same seed every time !
The difference is around 2%.

ok. 2% might be a pretty big difference depending on the task.
Do you think it could be due to the difference in weight initialization?