Does the ChatBot Tutorial have a bug in the masking code?

I was going over through the masking code in the chatbot tutorial and noticed that it masks with a zero on indices that are 0 but are NOT padding tokens (e.g. the first token). Is that a bug? Is the fix to use the lengths of the sequences to pad?

# Returns padded target sequence tensor, padding mask, and max target length
def outputVar(l, voc):
    '''
    padVar = padded (transposed) list of sentences with the batches
        tensor([[1391,  188,  122,   53, 5091],
        [   4,   53,   12,  154, 7708],
        [   2, 3026, 1048,  747,    4],
        [   0,    4,  115, 5747,    2],
        [   0,    2,   12, 2281,    0],
        [   0,    0, 1048,    4,    0],
        [   0,    0,    4,    2,    0],
        [   0,    0,    2,    0,    0]])
    mask = mask indicating where words occur and what isn't a word (i.e 0 for padding)
        tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [0, 1, 1, 1, 1],
        [0, 1, 1, 1, 0],
        [0, 0, 1, 1, 0],
        [0, 0, 1, 1, 0],
        [0, 0, 1, 0, 0]], dtype=torch.uint8)
    max_target_len = length of longest target sentence
        max_target_len = 8
    '''
    # list of index representation of sentence [[124, 101, 102, 4401, 98, 382, 4, 2], ..., [67, 188, 38, 4, 2]]
    indexes_batch = [indexesFromSentence(voc, sentence) for sentence in l]
    # get length of the largest (target) sentence
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    # (transposed) list of index represtations of sentences with padded zeros at the end [(124, 25, 25, 218, 67), ..., (4, 2, 0, 0, 0), (2, 0, 0, 0, 0)]
    padList = zeroPadding(indexes_batch) # padds with zeros sentences that are too long
    # returns the mask indicating which position are words locations and marks with 0 which ones are simply zeros
    st()
    mask = binaryMatrix(padList)
    #mask = torch.Tensor(padList) != PAD_token ## ALSO BUGGY?!
    mask = torch.ByteTensor(mask)
    # tensorfy (transposed) list of index represtations of sentences with padded zeros at the end. The last list is now a tensor/matrix
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

That’s not a bug, the token with index 0 is the padding token:

PAD_token = 0  # Used for padding short sentences

Best regards

Thomas

1 Like

I also missed that things get transposed at some point so that confused me too (cuz I saw 2 pad tokens each in a row at the beginning of the code).