Transformer Encoderlayer outputs nan

Hi everyone,
I am trying to use Transformer Encoder Layer with src_key_padding_mask to be the encoder in the multi-turned dialogue generation task, but i get NaN.

I found the attention output is NaN when the sentence is all PAD. Because all tokens in the sentence is converted to -inf, the Softmax returns NaN as results.

However, to fix the size of all contexts in the batch, I need sentences with all PAD tags.

There is an simple example as follow:

context: [[[1,2,3,0,0],[1,1,0,0,0],[0,0,0,0,0]],
          [[1,2,3,0,0],[1,1,0,0,0],[2,5,6,0,0]]]
mask: [[[False,False,False,True,True],[False,False,True,True,True],[True,True,True,True,True]],
       [[False,False,False,True,True],[False,False,True,True,True],[False,False,False,True,True]]]

0 is the PAD id.

I input the sentences to the encoder layer one by one, so [0,0,0,0,0] get NaN.

How can I solve this?
Is it essential to use the mask in the Transformer Encoder Layer?

  1. Are the values in context, independent of each other? ie does context[1] has any relation to context[0]? Padding is typically used when we want to uncover one token at a time (and hide the other tokens for the future).
  2. If you’re simply inputting this sequence one at a time, you may not use a mask at all. Simply padding it with 0 values at the end, would let your model learn that these tokens (PAD tokens) are not useful. The softmax can still be applied and it should work just fine.

Thank you for your answer. I will try it.

Appreciate it.