I have a transformer model where a 0 is an actual value in an input sequence and the sequence values go from 0 to 49 (sort of like dictionary size =50). So the sequence can look like this s = [0,1,3,5,8,20] The input to the embedding layer has input_dim=50.
I pad the sequences with 0 to make sure they are all the same length. The attention layer requires the padding_mask to be specified. And I do it as in code below. Problem is that this way valid 0 values are ignored and then the error pops up in the training loop because predictions are nan. I can’t pad with -1 because the input_dim=50 won’t work anymore.
I can’t seem to find a solution to this, even though this seems like it should be a common problem in lagnuage modeling. Please help!
nopeek_mask = torch.triu(torch.from_numpy(np.ones((query.shape, query.shape))), diagonal=1)
nopeek_mask[nopeek_mask == 1] = -float("Inf")
pad_token_index = 0
pad_mask = (x == pad_token_index)
query_att,_ = self.atten_head(query, key, values , attn_mask=nopeek_mask , key_padding_mask=pad_mask )
A solution that I can think of is to change the embedding dictionary so that 0 represents the token. This is what is usually done in the literature.
Thanks for the response @AbdulsalamBande . I think that’s what it’s currently doing. So using the example where the sequence values can be 0-49, the embedding layer is
embeddings = torch.nn.Embedding(50, self.emb_size) . So 0s are being embedded, but then the pad masking, masks the 0s that are due to padding as well as 0s early in the sequence that are valid cases.
Eg. sequence can be s = [0,0,0,1,1,2,3, 0,0,0] , where last 3 zeros are padding, but first 3 are actual values.
Dont forget that the padding_mask is a binary mask that has 1 for actual tokens and 0 for padding tokens. So in the case of [0,0,0,1,1,2,3, 0,0,0] , the padding_mask is [1,1,1,1,1,1,1, 0,0,0]
for padding as an index, you can have zero as your padding value, and i + 1 can be the value. for example, 0 can be encoded as 1, 1 as 2, and so on.
in this case, you will have 51 indices instead of 50 where padding mask is 0
@AbdulsalamBande that makes sense but there is no way for me to know in advance where in the sequense 0 will appear, hence for each batch this generates the mast
pad_mask = (x == pad_token_index) , which then masks any 0.
@brighteningeyes thank you, this seems like a reasonable solution to me. I think, this just means that in the output when I get the probability for each of the values in the sequence I have to discard the first one, since I’ll have 51 predicted values instead of 50.