Transformer Encoderlayer outputs nan

Sine · November 10, 2020, 4:06pm

Hi everyone,
I am trying to use Transformer Encoder Layer with src_key_padding_mask to be the encoder in the multi-turned dialogue generation task, but i get NaN.

I found the attention output is NaN when the sentence is all PAD. Because all tokens in the sentence is converted to -inf, the Softmax returns NaN as results.

However, to fix the size of all contexts in the batch, I need sentences with all PAD tags.

There is an simple example as follow:

context: [[[1,2,3,0,0],[1,1,0,0,0],[0,0,0,0,0]],
          [[1,2,3,0,0],[1,1,0,0,0],[2,5,6,0,0]]]
mask: [[[False,False,False,True,True],[False,False,True,True,True],[True,True,True,True,True]],
       [[False,False,False,True,True],[False,False,True,True,True],[False,False,False,True,True]]]

0 is the PAD id.

I input the sentences to the encoder layer one by one, so [0,0,0,0,0] get NaN.

How can I solve this?
Is it essential to use the mask in the Transformer Encoder Layer?

Abhilash_Srivastava · November 10, 2020, 7:14pm

Are the values in context, independent of each other? ie does context[1] has any relation to context[0]? Padding is typically used when we want to uncover one token at a time (and hide the other tokens for the future).
If you’re simply inputting this sequence one at a time, you may not use a mask at all. Simply padding it with 0 values at the end, would let your model learn that these tokens (PAD tokens) are not useful. The softmax can still be applied and it should work just fine.

Sine · November 11, 2020, 7:06am

Thank you for your answer. I will try it.

Appreciate it.