Can torch.nn.AdaptiveLogSoftmaxWithLoss
specifies a target value that is ignored and does not contribute to the input gradient like torch.nn.CrossEntropyLoss
? In many cases, we need to pad the text so that all the sequences are the same length, so we can process them in batch. So I think specifying a target value that is ignored and does not contribute to the input gradient is necessary.