Hi everyone,
I am trying to use Transformer Encoder Layer with src_key_padding_mask to be the encoder in the multi-turned dialogue generation task, but i get NaN.
I found the attention output is NaN when the sentence is all PAD. Because all tokens in the sentence is converted to -inf, the Softmax
returns NaN as results.
However, to fix the size of all contexts in the batch, I need sentences with all PAD tags.
There is an simple example as follow:
context: [[[1,2,3,0,0],[1,1,0,0,0],[0,0,0,0,0]],
[[1,2,3,0,0],[1,1,0,0,0],[2,5,6,0,0]]]
mask: [[[False,False,False,True,True],[False,False,True,True,True],[True,True,True,True,True]],
[[False,False,False,True,True],[False,False,True,True,True],[False,False,False,True,True]]]
0 is the PAD id.
I input the sentences to the encoder layer one by one, so [0,0,0,0,0] get NaN.
How can I solve this?
Is it essential to use the mask in the Transformer Encoder Layer?